## Dataset

We're about to work with the Titanic dataset[1]. From the dataset documentation:

The original Titanic dataset, describing the survival status of individual passengers on the Titanic. The titanic data does not contain information from the crew, but it does contain actual ages of half of the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were begun by a variety of researchers. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.

Thomas Cason of UVa has greatly updated and improved the Titanic data frame using the Encyclopedia Titanica and created the dataset here. Some duplicate passengers have been dropped, many errors corrected, many missing ages filled in, and new variables created.

For more information about how this dataset was constructed: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt

The dataset itself can be downloaded from:
https://www.openml.org/search?type=data&sort=runs&id=40945

[1] Author: Frank E. Harrell Jr., Thomas Cason Source: Vanderbilt Biostatistics

### Overview

- PassengerId is the unique id of the row and it doesn't have any effect on target
- Survived is the target variable we are trying to predict (0 or 1):
  - 1 = Survived
  - 0 = Not Survived
- Pclass (Passenger Class) is the socio-economic status of the passenger and it is a categorical ordinal feature which has 3 unique values (1, 2 or 3):
  - 1 = Upper Class
  - 2 = Middle Class
  - 3 = Lower Class
- Name, Sex and Age are self-explanatory
- SibSp is the total number of the passengers' siblings and spouse
- Parch is the total number of the passengers' parents and children
- Ticket is the ticket number of the passenger
- Fare is the passenger fare
- Cabin is the cabin number of the passenger
- Embarked is port of embarkation and it is a categorical feature which has 3 unique values (C, Q or S):
  - C = Cherbourg
  - Q = Queenstown
  - S = Southampton
- boat - Lifeboat (if survived)
- body - Body number (if did not survive and body was recovered)
- home.dest - Home/Destination

In [None]:
%%capture
!wget -O titanic.arff https://www.openml.org/data/download/16826755/phpMYEkMl

In [None]:
%%capture
!pip install liac-arff
!pip install pandas --upgrade
!pip install seaborn --upgrade

In [None]:
import numpy as np
import pandas as pd
import arff

raw_data = data = arff.load(open('titanic.arff', 'r'))

In [None]:
raw_data['attributes']
df = pd.DataFrame(raw_data['data'])

In [None]:
raw_data['attributes']
df.columns=[x[0] for x in raw_data['attributes']]

### 1. Missing Values

In [None]:
df.isnull().sum()

In [None]:
# fill the two missing values with the most occurred value, which is "S".
df["embarked"] = df["embarked"].fillna("S")
df["fare"].fillna(df["fare"].median(), inplace=True)

In [None]:
# get average, std, and number of NaN values in titanic_df
average_age_titanic   = df["age"].mean()
std_age_titanic       = df["age"].std()
count_nan_age_titanic = df["age"].isnull().sum()

rand_age = np.random.randint(average_age_titanic - std_age_titanic, average_age_titanic + std_age_titanic, size = count_nan_age_titanic)

original_age_values = df['age'].dropna().astype(int)

# replace nan values with random
df.loc[np.isnan(df['age']), 'age'] = rand_age

df['age'] = df['age'].astype(int)

### Converting Types

In [None]:
df['survived'] = df['survived'].astype(int)
df['fare'] = df['fare'].astype(int)

### 2. Dropping unused features

In [None]:
# drop unnecessary columns, these columns won't be useful in analysis and prediction
df = df.drop(['name','ticket'], axis=1)

# Cabin has a lot of NaN values, so it won't cause a remarkable impact on prediction
df.drop("cabin", axis=1, inplace=True)

# There is direct correlation between boat and survived, since having a not-None value
# in boat means passenger survived and vice-versa
df.drop("boat", axis=1, inplace=True)
# Same for body
df.drop("body", axis=1, inplace=True)

# We have a lot of missing values for this feature, and it has no/poor predictive power
df.drop("home.dest", axis=1, inplace=True)

### 3. Some Data Visualisations

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

#### Embark location

In [None]:
fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,5))

# plot
sns.countplot(x='embarked', data=df, ax=axis1)
sns.countplot(x='survived', hue='embarked', data=df, ax=axis2)

# group by embarked, and get the mean for survived passengers for each value in Embarked
embark_perc = df[["embarked", "survived"]].groupby(['embarked'],as_index=False).mean()
sns.barplot(x='embarked', y='survived', data=embark_perc,order=['S','C','Q'],ax=axis3)

#### Fares

In [None]:
# get fare for survived & didn't survive passengers 
fare_not_survived = df["fare"][df["survived"] == 0]
fare_survived     = df["fare"][df["survived"] == 1]

# get average and std for fare of survived/not survived passengers
avgerage_fare = pd.DataFrame([fare_not_survived.mean(), fare_survived.mean()])
std_fare = pd.DataFrame([fare_not_survived.std(), fare_survived.std()])

# plot
df['fare'].plot(kind='hist', figsize=(15,3),bins=100, xlim=(0,300))

avgerage_fare.index.names = std_fare.index.names = ["Survived"]
avgerage_fare.plot(yerr=std_fare,kind='bar',legend=False)

#### Age

In [None]:
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(15,4))
axis1.set_title('Original Age values - Titanic')
axis2.set_title('New Age values - Titanic')

original_age_values.hist(bins=70, ax=axis1)
df['age'].hist(bins=70, ax=axis2)

In [None]:
# .... continue with plot Age column

# peaks for survived/not survived passengers by their age
facet = sns.FacetGrid(df, hue="survived",aspect=4)
facet.map(sns.kdeplot,'age',shade= True)
facet.set(xlim=(0, df['age'].max()))
facet.add_legend()

# average survived passengers by age
fig, axis1 = plt.subplots(1,1,figsize=(18,4))
average_age = df[["age", "survived"]].groupby(['age'],as_index=False).mean()
sns.barplot(x='age', y='survived', data=average_age)

#### Family (created feature)

In [None]:
# Instead of having two columns Parch & SibSp, 
# we can have only one column represent if the passenger had any family member aboard or not,
# Meaning, if having any family member(whether parent, brother, ...etc) will increase chances of Survival or not.
df['family'] =  df["parch"] + df["sibsp"]
df['family'].loc[df['family'] > 0] = 1
df['family'].loc[df['family'] == 0] = 0

# drop parch & sibsp
df = df.drop(['sibsp','parch'], axis=1)

In [None]:
# plot
fig, (axis1,axis2) = plt.subplots(1,2,sharex=True,figsize=(10,5))

sns.countplot(x='family', data=df, order=[1,0], ax=axis1)

# average of survived for those who had/didn't have any family member
family_perc = df[["family", "survived"]].groupby(['family'],as_index=False).mean()

sns.barplot(x='family', y='survived', data=family_perc, order=[1,0], ax=axis2)

axis1.set_xticklabels(["With Family","Alone"], rotation=0)

#### Person (created feature)

In [None]:
# Sex

# As we see, children(age < ~16) on aboard seem to have a high chances for Survival.
# So, we can classify passengers as males, females, and child
df['person'] = df[['sex', 'age']].apply(lambda row: 'child' if row['age'] < 16 else row['sex'], axis=1)
# No need to use Sex column since we created Person column
df.drop(['sex'],axis=1,inplace=True)


In [None]:
# First plot person column chances of survival
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(10,5))

sns.countplot(x='person', data=df, ax=axis1)
# average of survived for each Person(male, female, or child)
person_perc = df[["person", "survived"]].groupby(['person'],as_index=False).mean()
sns.barplot(x='person', y='survived', data=person_perc, ax=axis2, order=['male','female','child'])

#### PClass

In [None]:
# Pclass

sns.factorplot('pclass','survived',order=[1,2,3], data=df,height=5)

### Dummy Encoding

In [None]:
#Embarked

df = pd.get_dummies(df, columns = ['embarked'])

In [None]:
#Person

# create dummy variables for Person column
df  = pd.get_dummies(df, columns = ['person'])

In [None]:
# create dummy variables for pclass
df  = pd.get_dummies(df, columns = ['pclass'])

In [None]:
print(df.info())
df.head()