# Workshop Lecture 7, Exercise 2
Most of the data we deal with contain strings, i.e., text data (names, addresses, etc.). Often, such data is not in the format needed for analysis, and we have to perform additional string manipulation to extract the exact data we need. This can be achieved using the pandas string methods.
To illustrate, we use the Titanic data set for this exercise.
1. Load the Titanic data and restrict the sample to men. (This simplifies the task. Women in this data set have much more complicated names as they contain both their husband’s and their maiden name)
2. Print the first five observations of the Name column. As you can see, the data is stored in the format “Last name, Title First name” where title is something like Mr., Rev., etc.
3. Split the Name column by , to extract the last name and the remainder as separate columns. You can achieve this using the partition() string method. 
4. Split the remainder (containing the title and first name) using the space character " " as separator to obtain individual columns for the title and the first name.
5. Store the three data series in the original DataFrame (using the column names FirstName, LastName and Title) and delete the Name column which is no longer needed.
6. Finally, extract the ship deck from the values in Cabin. The ship deck is the first character in the string stored in Cabin (A, B, C, . . . ). You extract the first character using the get() string method. Store the result in the column Deck.

Hint: Pandas’s string methods can be accessed using the .str attribute. For example, to partition values in the column Name, you need to use
df['Name'].str.partition()


In [96]:
import pandas as pd

#load the titanic data set
DATA_PATH = '/Users/lilapfageraas/Downloads/nhh/tech2/TECH2-H24/data'
file = pd.read_csv(f'{DATA_PATH}/titanic.csv')
df = pd.DataFrame(file)

In [97]:
#let's have a look at the dataframe to see what we are dealing with
file.head(4)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1,C123,S


In [98]:
#let's use .loc to create a copy of the dataframe with only males
df = df.loc[df['Sex'] == 'male'].copy()

In [99]:
#inspect the name column
df['Name'].head()

0          Braund, Mr. Owen Harris
4         Allen, Mr. William Henry
5                 Moran, Mr. James
6          McCarthy, Mr. Timothy J
7    Palsson, Master Gosta Leonard
Name: Name, dtype: object

In [100]:
#split the name column by ,
names = df['Name'].str.partition(sep=',')
names.head()


Unnamed: 0,0,1,2
0,Braund,",",Mr. Owen Harris
4,Allen,",",Mr. William Henry
5,Moran,",",Mr. James
6,McCarthy,",",Mr. Timothy J
7,Palsson,",",Master Gosta Leonard


In [101]:
#make a series calles with just the last names (no spaces)
last_name = names[0].str.strip()
last_name.head()

0      Braund
4       Allen
5       Moran
6    McCarthy
7     Palsson
Name: 0, dtype: object

In [102]:
#extract the title and first name; strip the same column then partition by space
title_first = names[2].str.strip().str.partition()
title_first.head()

Unnamed: 0,0,1,2
0,Mr.,,Owen Harris
4,Mr.,,William Henry
5,Mr.,,James
6,Mr.,,Timothy J
7,Master,,Gosta Leonard


In [103]:
#extract the title and strip it
title = title_first[0].str.strip()
title.head()

0       Mr.
4       Mr.
5       Mr.
6       Mr.
7    Master
Name: 0, dtype: object

In [104]:
#exttract the first name and strip it
first_name = title_first[2].str.strip()
first_name.head()

0      Owen Harris
4    William Henry
5            James
6        Timothy J
7    Gosta Leonard
Name: 2, dtype: object

In [105]:
#store the new columns in the dataframe
df['Title'] = title
df['First Name'] = first_name
df['Last Name'] = last_name

#delete the original name column
del df['Name']

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Ticket,Fare,Cabin,Embarked,Title,First Name,Last Name
0,1,0,3,male,22.0,A/5 21171,7.25,,S,Mr.,Owen Harris,Braund
4,5,0,3,male,35.0,373450,8.05,,S,Mr.,William Henry,Allen
5,6,0,3,male,,330877,8.4583,,Q,Mr.,James,Moran
6,7,0,1,male,54.0,17463,51.8625,E46,S,Mr.,Timothy J,McCarthy
7,8,0,3,male,2.0,349909,21.075,,S,Master,Gosta Leonard,Palsson


In [113]:
deck = df['Cabin'].str.strip().str.get(0)
deck.head()

0    NaN
4    NaN
5    NaN
6      E
7    NaN
Name: Cabin, dtype: object

In [114]:
df['Deck'] = deck

In [115]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Ticket,Fare,Cabin,Embarked,Title,First Name,Last Name,Deck
0,1,0,3,male,22.0,A/5 21171,7.25,,S,Mr.,Owen Harris,Braund,
4,5,0,3,male,35.0,373450,8.05,,S,Mr.,William Henry,Allen,
5,6,0,3,male,,330877,8.4583,,Q,Mr.,James,Moran,
6,7,0,1,male,54.0,17463,51.8625,E46,S,Mr.,Timothy J,McCarthy,E
7,8,0,3,male,2.0,349909,21.075,,S,Master,Gosta Leonard,Palsson,
