In [1]:
# HIDDEN
Base.displaysize() = (5, 80)
using DataFrames
using CSV

## Julia String Methods

Julia provides a variety of methods for basic string manipulation. Although simple, these methods form the primitives that piece together to form more complex string operations. We will introduce Julia's string methods in the context of a common use case for working with text: data cleaning.

## Cleaning Text Data

Data often comes from several different sources that each implements its own way of encoding information. In the following example, we have one table that records the state that a county belongs to and another that records the population of the county.

In [8]:
# HIDDEN
state = DataFrame(
    County=[
        "De Witt County",
        "Lac qui Parle County",
        "Lewis and Clark County",
        "St John the Baptist Parish",
    ],
    State=[
        "IL",
        "MN",
        "MT",
        "LA"
    ]
)
population = DataFrame(
    County=[
        "DeWitt  ",
        "Lac Qui Parle",
        "Lewis & Clark",
        "St. John the Baptist"
    ],
    Population=[
        16798,
        8067,
        55716,
        43044
    ]
);

In [9]:
state

Unnamed: 0_level_0,County,State
Unnamed: 0_level_1,String,String
1,De Witt County,IL
2,Lac qui Parle County,MN
3,Lewis and Clark County,MT
4,St John the Baptist Parish,LA


In [10]:
population

Unnamed: 0_level_0,County,Population
Unnamed: 0_level_1,String,Int64
1,DeWitt,16798
2,Lac Qui Parle,8067
3,Lewis & Clark,55716
4,St. John the Baptist,43044


We would naturally like to join the `state` and `population` tables using the `County` column. Unfortunately, not a single county is spelled the same in the two tables. This example is illustrative of the following common issues in text data:

1.  Capitalization: `qui` vs `Qui`
1.  Different punctuation conventions: `St.` vs `St` 
1.  Omission of words: `County`/`Parish` is absent in the `population` table
1.  Use of whitespace: `DeWitt` vs `De Witt`
1.  Different abbreviation conventions: `&` vs `and`

## String Methods

Julia's string methods allow us to start resolving these issues. These methods are conveniently defined on all Julia strings and thus do not require importing other modules. Although it is worth familiarizing yourself with [the complete list of string methods](https://docs.julialang.org/en/v1/base/strings/) or taking a look at the comprehensive introduction to Strings available at the [Julia Manual](https://docs.julialang.org/en/v1/manual/strings/#String-Basics-1), we describe a few of the most commonly used methods in the table below.

| Method                 | Description                                                                 |
| ---------------------- | --------------------------------------------------------------------------- |
| `str[x:y]`             | Slices `str`, returning indices x (inclusive) to y (inclusive)              |
| `lowercase(str)`       | Returns a copy of a string with all letters converted to lowercase          |
| `replace(str, a => b)` | Replaces all instances of the substring `a` in `str` with the substring `b` |
| `split(str, 'a')`      | Returns substrings of `str` split at a substring `a`                        |
| `strip(str)`           | Removes leading and trailing whitespace from `str`                          |


We select the string for St. John the Baptist parish from the `state` and `population` tables and apply string methods to remove capitalization, punctuation, and `county`/`parish` occurrences.

In [28]:
john1 = state[4, :County]
john2 = population[4, :County]

new_john1 = lowercase(john1) |>
    strip |>
    x->replace(x, " parish" => "") |>
    x->replace(x, " county" => "") |>
    x->replace(x, "&" => "and") |>
    x->replace(x, "." => "") |>
    x->replace(x, " " => "")
new_john1

"stjohnthebaptist"

Applying the same set of methods to `john2` allows us to verify that the two strings are now identical.

In [29]:
new_john2 = lowercase(john2) |>
    strip |>
    x->replace(x, " parish" => "") |>
    x->replace(x, " county" => "") |>
    x->replace(x, "&" => "and") |>
    x->replace(x, "." => "") |>
    x->replace(x, " " => "")
new_john2

"stjohnthebaptist"

Satisfied, we create a method called `clean_county` that normalizes an input county.

In [31]:
function clean_county(county)
    return (lowercase(county) |>
    strip |>
    x->replace(x, " parish" => "") |>
    x->replace(x, " county" => "") |>
    x->replace(x, "&" => "and") |>
    x->replace(x, "." => "") |>
    x->replace(x, " " => ""))
end;

We may now verify that the `clean_county` method produces matching counties for all the counties in both tables:

In [45]:
([clean_county(county) for county in state.County],
 [clean_county(county) for county in population.County])

(["dewitt", "lacquiparle", "lewisandclark", "stjohnthebaptist"], ["dewitt", "lacquiparle", "lewisandclark", "stjohnthebaptist"])

Because each county in both tables has the same transformed representation, we may successfully join the two tables using the transformed county names.

## String Methods in DataFrames

In the code above we used a loop to transform each county name. The `.` operator provides a convenient way to apply string methods to each column in a DataFrame. First, the series of county names in the `state` table:

In [42]:
print(state.County)

["De Witt County", "Lac qui Parle County", "Lewis and Clark County", "St John the Baptist Parish"]

Calling a method with the `.` operator will call the method on each element of the array.

In [47]:
print(lowercase.(state.County))

["de witt county", "lac qui parle county", "lewis and clark county", "st john the baptist parish"]

This allows us to transform each string in the series without using a loop.

In [48]:
print(clean_county.(state.County))

["dewitt", "lacquiparle", "lewisandclark", "stjohnthebaptist"]

We save the transformed counties back into their originating tables:

In [53]:
state[!, :County] = clean_county.(state.County)
population[!, :County] = clean_county.(population.County);

Now, the two tables contain the same string representation of the counties:

In [50]:
state

Unnamed: 0_level_0,County,State
Unnamed: 0_level_1,String,String
1,dewitt,IL
2,lacquiparle,MN
3,lewisandclark,MT
4,stjohnthebaptist,LA


In [51]:
population

Unnamed: 0_level_0,County,Population
Unnamed: 0_level_1,String,Int64
1,dewitt,16798
2,lacquiparle,8067
3,lewisandclark,55716
4,stjohnthebaptist,43044


It is simple to join these tables once the counties match.

In [52]:
join(state, population, on = :County)

Unnamed: 0_level_0,County,State,Population
Unnamed: 0_level_1,String,String,Int64
1,dewitt,IL,16798
2,lacquiparle,MN,8067
3,lewisandclark,MT,55716
4,stjohnthebaptist,LA,43044


## Summary

Julia's string methods form a set of simple and useful operations for string manipulation. By using the `.` operator on `DataFrames` we can appy the underlying Julia method to each element of a column.

You may find the complete documentation on Julia's `string` methods [here](https://docs.julialang.org/en/v1/base/strings/).