## Explore The Data: Explore Categorical Features

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

This dataset contains information about 891 people who were on board the ship when departed on April 15th, 1912. As noted in the description on Kaggle's website, some people aboard the ship were more likely to survive the wreck than others. There were not enough lifeboats for everybody so women, children, and the upper-class were prioritized. Using the information about these 891 passengers, the challenge is to build a model to predict which people would survive based on the following fields:

- **Name** (str) - Name of the passenger
- **Pclass** (int) - Ticket class (1st, 2nd, or 3rd)
- **Sex** (str) - Gender of the passenger
- **Age** (float) - Age in years
- **SibSp** (int) - Number of siblings and spouses aboard
- **Parch** (int) - Number of parents and children aboard
- **Ticket** (str) - Ticket number
- **Fare** (float) - Passenger fare
- **Cabin** (str) - Cabin number
- **Embarked** (str) - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

**This section focuses on exploring the `Name`, `Sex`, `Ticket`, `Cabin`, and `Embarked` features.**

### Read In Data

In [1]:
import numpy as np
import pandas as pd

titanic_df = pd.read_csv('../Data/titanic.csv')
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
# Drop all continuous features
cont_features = ['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
titanic_df = titanic_df.drop(cont_features, axis=1)
titanic_df.head()

Unnamed: 0,Survived,Name,Sex,Ticket,Cabin,Embarked
0,0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,1,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,0,"Allen, Mr. William Henry",male,373450,,S


### Explore Categorical Features

In [3]:
# Check if there are any missing values
titanic_df.isnull().sum()

Survived      0
Name          0
Sex           0
Ticket        0
Cabin       687
Embarked      2
dtype: int64

In [5]:
# Explore the number of unique values for each feature
for col in titanic_df.columns:
    print('{} : {} unique values'.format(col, titanic_df[col].nunique()))

Survived : 2 unique values
Name : 891 unique values
Sex : 2 unique values
Ticket : 681 unique values
Cabin : 147 unique values
Embarked : 3 unique values


Based on the above quick unique values, we can group into 2 types. 
- one with very few uniques (Survived, Sex, Embarked)
- one with lots of uniquues values (Name, Ticket, Cabin)

<strong>NOTE</strong>: one quick way to find the relationship between features and target variable is 
- to group by each feature
- then just look at the average value of the target variable

In [7]:
# Check survival rate by gender
titanic_df.groupby('Sex').mean()

Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
female,0.742038
male,0.188908


In [8]:
# Check survival rate by the port departed from
titanic_df.groupby('Embarked').mean()

Unnamed: 0_level_0,Survived
Embarked,Unnamed: 1_level_1
C,0.553571
Q,0.38961
S,0.336957


In [10]:
# Is Cabin missing at random?
titanic_df.groupby(titanic_df['Cabin'].isnull()).mean()

Unnamed: 0_level_0,Survived
Cabin,Unnamed: 1_level_1
False,0.666667
True,0.299854


We can see from above Cabin data that 67% of people who were assigned Cabin survived. And around 30% of people who had missing Cabin values survived.
Seem like whether cabin is missing or not , is a strong indicator of survival chance. 
- one hypothesis might be people without assigned cabin literally didn't have a cabin and were maybe stuck in the bowels of the ship, that's why so few survived.
- but the reason doesn't really matter. In this case, the missing value for cabin means something.

In [11]:
# Look at unique values for the Ticket feature
titanic_df['Ticket'].value_counts()

347082               7
CA. 2343             7
1601                 7
CA 2144              6
3101295              6
                    ..
392091               1
65304                1
250643               1
A./5. 2152           1
SOTON/O.Q. 392087    1
Name: Ticket, Length: 681, dtype: int64

As for `Name` field, there are title representing the status. This may be correlated to the chance of survival.

In [13]:
# Create a title feature by parsing passenger name
titanic_df['Title'] = titanic_df['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
titanic_df.head()

Unnamed: 0,Survived,Name,Sex,Ticket,Cabin,Embarked,Title
0,0,"Braund, Mr. Owen Harris",male,A/5 21171,,S,Mr
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C,Mrs
2,1,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S,Miss
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S,Mrs
4,0,"Allen, Mr. William Henry",male,373450,,S,Mr


In [14]:
# Look at survival rate by title
# index: groupby
titanic_df.pivot_table('Survived', index=['Sex', 'Title'], aggfunc=['count', 'mean'])

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean
Unnamed: 0_level_1,Unnamed: 1_level_1,Survived,Survived
Sex,Title,Unnamed: 2_level_2,Unnamed: 3_level_2
female,Dr,1,1.0
female,Lady,1,1.0
female,Miss,182,0.697802
female,Mlle,2,1.0
female,Mme,1,1.0
female,Mrs,125,0.792
female,Ms,1,1.0
female,the Countess,1,1.0
male,Capt,1,0.0
male,Col,2,0.5


As we can see from the data above, most female female related title survived more. But intersting point is title `Master` which is male and survived almost 57.5%.