# Titanic Kaggle Competition

The Titanic is one of the most well-known stories of tragedy in which a brand new ship, one of the largest of its time, historically sank in its maiden voyage  crossing the Atlantic Ocean.

The goal of this project is to construct a model to predict the likelihood of survival based on a set of features associated with each person. I intend to try several models to verify the abilities of each algorithm and identify the pros and cons of each one.

## Background

The Titanic sank with a majority of its travelers due lack of lifeboats to satisfy the amount of passengers and crew on board. The story describes a policy of "Women and Children" first focusing on the perceived most vulnerable population. Several factors remain important including socioeconomic and staff versis passenger.

In [1]:
# Import various data
import numpy as np
import pandas as pd

titanic_set = pd.read_csv('./dataset/train.csv')
print(titanic_set.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [2]:
# Analyzing the Results, we can remove irrelevant information that can be quantitized.
# For this, we will remove the name, ticket definitely. The cabin number is confusing
# because it may indicate place on the titanic. For now, we will remove the cabin from the list

usable_fields = titanic_set.columns.drop(['Name','Ticket', 'Cabin'])
usable_titanic_dataset = titanic_set[usable_fields].copy()
print(usable_titanic_dataset.head())

   PassengerId  Survived  Pclass     Sex   Age  SibSp  Parch     Fare Embarked
0            1         0       3    male  22.0      1      0   7.2500        S
1            2         1       1  female  38.0      1      0  71.2833        C
2            3         1       3  female  26.0      0      0   7.9250        S
3            4         1       1  female  35.0      1      0  53.1000        S
4            5         0       3    male  35.0      0      0   8.0500        S


In [5]:
# Categorize relevant data columns. Specifically, the Sex, Embarkation and
# information about sibling
male_rows = usable_titanic_dataset['Sex'] == 'male'
female_rows = usable_titanic_dataset['Sex'] == 'female'
usable_titanic_dataset.loc[male_rows, 'Sex'] = 0
usable_titanic_dataset.loc[female_rows, 'Sex'] = 1

S_rows = usable_titanic_dataset['Embarked'] == 'S'
C_rows = usable_titanic_dataset['Embarked'] == 'C'
Q_rows = usable_titanic_dataset['Embarked'] == 'Q'
usable_titanic_dataset.loc[S_rows, 'Embarked'] = 0
usable_titanic_dataset.loc[C_rows, 'Embarked'] = 1
usable_titanic_dataset.loc[Q_rows, 'Embarked'] = 1

print(usable_titanic_dataset.head())

   PassengerId  Survived  Pclass Sex   Age  SibSp  Parch     Fare Embarked
0            1         0       3   0  22.0      1      0   7.2500        0
1            2         1       1   1  38.0      1      0  71.2833        1
2            3         1       3   1  26.0      0      0   7.9250        0
3            4         1       1   1  35.0      1      0  53.1000        0
4            5         0       3   0  35.0      0      0   8.0500        0


In [7]:
print(usable_titanic_dataset[usable_titanic_dataset[usable_titanic_dataset.columns.drop(['Age'])].isna().any(axis=1)])

     PassengerId  Survived  Pclass Sex   Age  SibSp  Parch  Fare Embarked
61            62         1       1   1  38.0      0      0  80.0      NaN
829          830         1       1   1  62.0      0      0  80.0      NaN


In [None]:
Y = titanic_set['Survived']

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

titanic_set = pd.read_csv('./dataset/test.csv')
NEW_X = titanic_set[['Pclass', 'Sex', 'Age']]
NEW_X.loc[NEW_X['Sex'] == 'male', 'Sex'] = 0
NEW_X.loc[NEW_X['Sex'] == 'female', 'Sex'] = 1
NEW_X.loc[NEW_X['Age'] < 18, 'Age'] = 1
NEW_X.loc[NEW_X['Sex'] >= 18, 'Age'] = 0
NEW_X.loc[np.isnan(NEW_X['Age'])] = 0
print(NEW_X)

In [None]:
predictions = clf.predict(NEW_X)

In [None]:
print(titanic_set)