# 3.4 String Operations

With normal Python strings we can perform several functions to find the length of a string, get the last/first few letters, split the string into several strings, and check if a string contains a pattern of characters. Pandas also allows us to perform string operations on columns to extract additional information from textual data.

### About the data
​
The data used in this notebook shows information about passengers on the *Titanic* cruiseliner, a ship which set out from Southampton, U.K. to sail across the Atlantic ocean and which tragically sank upon collision with an iceberg. The dataset contains information about each passenger's passenger class, name, sex, age, siblings, parents/children, ticket number, ticket fare, cabin number, and the embarked location. It also contains information about each passenger's survival status. This data set is extremely popular among data scientists and will facilitate demonstrations of Pandas concepts.

In [1]:
import pandas as pd
df = pd.read_csv("./data/titanic.csv")

## Introduction
When conducting data analysis, textual data is often one of the richest sources of information available to the analyst. Many times, text contains information about different dimensions that can be used to add dimensions to the data that can be aggregated in new, insightful ways. For example:

1. A product description might include information about its color and type, allowing new columns to be created for color and type. This would allow aggregation across new dimensions to learn about the highest selling color or the most common type of product.
2. Data about employees might include information about their hire dates, where the date is saved as a string. From this data, you could create columns for year and month and then find how the rate of new hires has changed over time.
3. Manufacturing data might include codes like `M6C991` whose format might have a special meaning like "Machine 6, Code 991". Using string operations and this data, you could figure out the frequency that certain machines have errors.

Before anything else, let's start by printing out the first five rows of the dataframe.

In [2]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Checking for strings

Not all of the columns in our dataframe contain textual data. Which ones do you see that *do* contain text?

Pandas automatically formats numbers as integers or floats. You can check the data type that Pandas set for each column by using the `.info()` method.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In Pandas, strings are called `object`, as seen under the column `Dtype` above. Thus, you can see that columns `Name`, `Sex` and `Ticket` are strings.

### The problem with normal string operations
You can select a column that contains strings fairly easily, as seen below:

In [5]:
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

However, applying string operations to each string in the column is not as straightforward. For example, normally we could take the length of a string by using the `len()` function...

In [7]:
len("I was molded by the darkness")

28

...but we can't use the same function on our column of strings:

In [8]:
len(df['Name'])

891

The code didn't throw any errors though... That's because instead of getting the length of each string, we ended up getting the length of the Series. Thus, we were given the total number of rows in the column. Not what we wanted. Importantly, using string methods on the Series of strings applied the function *once to the Series* rather than *once to each row*.

### The `.str` accessor object
The `.str` accessor object is a bit of code that can be added on to columns (Series) that contain strings and allows string operations to be applied to each row individually. The `.str` accessor object should be written immediately after getting the column, but **doesn't have any functionality in itself**. It is merely an object that gives a column the ability to apply string operations to itself.

Notice that by adding `.str` to the column, nothing happens and we don't get a dataframe back. Instead, we just get an accessor object.

In [13]:
df['Name'].str

<pandas.core.strings.accessor.StringMethods at 0x1de996a9b40>

The `.str` accessor object comes equipped with many methods for applying string operations to a row. Some of the more common ones are shown below, although there are many others.

| Method                        | Description                                                         |
|-------------------------------|---------------------------------------------------------------------|
| `.len()`                      | Returns the length of the string.                                   |
| `.contains(text)`             | Returns True if the string contains the `text` and False otherwise. |
| `.upper()`                    | Turns all of the characters in the string to uppercase.             |
| `.lower()`                    | Turns all of the characters in the string to lowercase.             |
| `.title()`                    | Turns the first letter of each word in the string to uppercase.     |
| `.replace(text, replacement)` | Replaces the `text` with `replacement` text.                        |
| `.split(character)`           | Turns each string into a list with each item split by `character`   |

Each of the methods above should be used after adding the `.str` accessor object to the column. For example, we could find the length of each name using the following code:

In [14]:
df['Name'].str.len()

0      23
1      51
2      22
3      44
4      24
       ..
886    21
887    28
888    40
889    21
890    19
Name: Name, Length: 891, dtype: int64

The code above returns the number of characters in each name as a Series object.

We can then save the name lengths into a new column of our dataframe called `NameLength`.

In [23]:
df['NameLength'] = df['Name'].str.len()

Let's print out the dataframe again to see the new column.

In [24]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,asdf,NameLength
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,"[Braund, Mr. Owen Harris]",23
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"[Cumings, Mrs. John Bradley , Florence Briggs ...",51
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,"[Heikkinen, Miss. Laina]",22
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"[Futrelle, Mrs. Jacques Heath , Lily May Peel)]",44
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,"[Allen, Mr. William Henry]",24
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,"[Montvila, Rev. Juozas]",21
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,"[Graham, Miss. Margaret Edith]",28
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,"[Johnston, Miss. Catherine Helen ""Carrie""]",40
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,"[Behr, Mr. Karl Howell]",21


### Accessing specific characters of a string by index
We can also use the `.str` accessor object to access a string object by its index. For example, we might want to simplify the names of the passengers and create a column with just the first four letters of their last name (last name is written first). We can do so by simply adding brackets immediately after the `.str` accessor object.

In [26]:
df['Name'].str[0:4]

0      Brau
1      Cumi
2      Heik
3      Futr
4      Alle
       ... 
886    Mont
887    Grah
888    John
889    Behr
890    Dool
Name: Name, Length: 891, dtype: object

### Splitting a string
Splitting strings is one of the more difficult parts of working with strings in Pandas. For example, observe that some of the passengers of the *Titanic* have a second name in parentheses. This is the passengers' actual name, whereas the name outside of the parentheses is their husbands' name. Let's use the `.str` accessor object and the `.split()` method to split all the names in the `Name` column by the left parenthesis `(` character. By doing this, the passengers that have a name in parentheses will have their husbands' names placed in the first item of the list and their actual names placed in the second item of the list. Passengers with just one name will have their name placed in a list as the first item.

Notice that the `.split()` method returns a Series.

In [27]:
df['Name'].str.split("(")

0                              [Braund, Mr. Owen Harris]
1      [Cumings, Mrs. John Bradley , Florence Briggs ...
2                               [Heikkinen, Miss. Laina]
3        [Futrelle, Mrs. Jacques Heath , Lily May Peel)]
4                             [Allen, Mr. William Henry]
                             ...                        
886                              [Montvila, Rev. Juozas]
887                       [Graham, Miss. Margaret Edith]
888           [Johnston, Miss. Catherine Helen "Carrie"]
889                              [Behr, Mr. Karl Howell]
890                                [Dooley, Mr. Patrick]
Name: Name, Length: 891, dtype: object

If you save this Series back to the dataframe as column `RealName`...

In [28]:
df['RealName'] = df['Name'].str.split("(")

...you can see that the new column `RealName` does not contain names, but rather lists of names.

In [29]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,asdf,NameLength,RealName
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,"[Braund, Mr. Owen Harris]",23,"[Braund, Mr. Owen Harris]"
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"[Cumings, Mrs. John Bradley , Florence Briggs ...",51,"[Cumings, Mrs. John Bradley , Florence Briggs ..."
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,"[Heikkinen, Miss. Laina]",22,"[Heikkinen, Miss. Laina]"
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,"[Futrelle, Mrs. Jacques Heath , Lily May Peel)]",44,"[Futrelle, Mrs. Jacques Heath , Lily May Peel)]"
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,"[Allen, Mr. William Henry]",24,"[Allen, Mr. William Henry]"


We can add another `.str` accessor object to the `.split()` method above and get the last item in the list by passing in an index of `-1`. This will get the only item for passengers who only had one name, and the second item for passengers with two names.

In [30]:
df['Name'].str.split("(").str[-1]

0                       Braund, Mr. Owen Harris
1                       Florence Briggs Thayer)
2                        Heikkinen, Miss. Laina
3                                Lily May Peel)
4                      Allen, Mr. William Henry
                         ...                   
886                       Montvila, Rev. Juozas
887                Graham, Miss. Margaret Edith
888    Johnston, Miss. Catherine Helen "Carrie"
889                       Behr, Mr. Karl Howell
890                         Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

Again, we can save this to a new column and print out the final dataframe.

In [31]:
df['RealName'] = df['Name'].str.split("(").str[-1]
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,asdf,NameLength,RealName
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,"[Braund, Mr. Owen Harris]",23,"Braund, Mr. Owen Harris"
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"[Cumings, Mrs. John Bradley , Florence Briggs ...",51,Florence Briggs Thayer)
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,"[Heikkinen, Miss. Laina]",22,"Heikkinen, Miss. Laina"
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,"[Futrelle, Mrs. Jacques Heath , Lily May Peel)]",44,Lily May Peel)
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,"[Allen, Mr. William Henry]",24,"Allen, Mr. William Henry"
