# Kaggle Titanic

In [1]:
import numpy as np
import pandas as pd
import re

## Dataset Observations

Import the dataset

In [2]:
titanic = pd.read_csv("train.csv")

In [3]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Now that the dataset has been imported, observations regarding the data can be made using pandas.

In [4]:
titanic.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

This indicates that the only features which hold null values are ages, cabin numbers, and embarkation location. Out of 891 passengers, only 177 age values are missing while 77% of the cabin numbers are missing. (Only first class passengers have a cabin number)

## Parsing the Names

Based on what was said on the forum's, name can be divided further into the categories of family, title, name/married name, and maiden name (in the case of a married woman).

The names of married women are the most complicated. For example, with the name Futrelle, Mrs. Jacques Heath (Lily May Peel)
    "Futrelle" is the family name
    "Mrs" is the title (indicating a married woman)
    "Jacques Heath" is the married name (which is the name of the husband)
    "(Lily May Peel)" is the maiden name
    
Another notable feature in the names is the nicknames, which are placed between double quotes, i.e. "Nellie"


In [5]:
# New dictionary to store values for name dataframe
names = {
    "family":[],
    "title":[],
    "name":[],
}
    
name_list = titanic["Name"].to_list()

# Get the family name and title
delimiters = [", ", ". "]

for i in range(0, len(name_list)):
    for delimiter in delimiters:
        name_list[i] = "|".join(name_list[i].split(delimiter))
    
    current_name = name_list[i].split("|")
    
    names["family"].append(current_name[0])
    names["title"].append(current_name[1])

# Get the person's name (location depends on gender and married status)
    if titanic["Sex"][i] == 'male':
        names["name"].append(current_name[2])
    else:
        maiden_name = re.findall('\(.*?\)', current_name[2])
        if maiden_name:
            names["name"].append(maiden_name[0][1:len(maiden_name[0])-1])
        else:
            names["name"].append(current_name[2])
            
namedf = pd.DataFrame(data=names)

In [8]:
namedf

Unnamed: 0,family,title,name
0,Braund,Mr,Owen Harris
1,Cumings,Mrs,Florence Briggs Thayer
2,Heikkinen,Miss,Laina
3,Futrelle,Mrs,Lily May Peel
4,Allen,Mr,William Henry
...,...,...,...
886,Montvila,Rev,Juozas
887,Graham,Miss,Margaret Edith
888,Johnston,Miss,"Catherine Helen ""Carrie"""
889,Behr,Mr,Karl Howell


## Create and Export Updated Dataset

Create new dataset with the updated data

In [11]:
updated_titanic = titanic.copy(deep=True)

In [12]:
updated_titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C
