<img src='resources/images/titanic-disaster.jpg' height=900 width=600>

Imports the library [pandas](https://pandas.pydata.org/) and the [DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) from the [sklearn library](http://scikit-learn.org/stable/documentation.html), [tree package](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)

In [85]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

Loads the CSV files into variables using pandas

In [86]:
train_df_raw = pd.read_csv('resources/datasets/train.csv')
test_df_raw = pd.read_csv('resources/datasets/test.csv')

Shows the first **five** rows in train data frame

In [87]:
train_df_raw.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Shows the first **five** rows in test data frame

In [88]:
test_df_raw.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Data Wrangling

Determines the **irrelevant** columns for both data frames then removes them

In [89]:
irrelevant_columns = ['Name', 'Ticket', 'Cabin']
train_df_raw.drop(irrelevant_columns, axis=1, inplace=True)
test_df_raw.drop(irrelevant_columns, axis=1, inplace=True)

Shows the first **five** rows in train data frame after removing the **irrelevant columns**

In [90]:
train_df_raw.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,22.0,1,0,7.25,S
1,2,1,1,female,38.0,1,0,71.2833,C
2,3,1,3,female,26.0,0,0,7.925,S
3,4,1,1,female,35.0,1,0,53.1,S
4,5,0,3,male,35.0,0,0,8.05,S


Doing the same for the **test data frame**

In [91]:
test_df_raw.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,892,3,male,34.5,0,0,7.8292,Q
1,893,3,female,47.0,1,0,7.0,S
2,894,2,male,62.0,0,0,9.6875,Q
3,895,3,male,27.0,0,0,8.6625,S
4,896,3,female,22.0,1,1,12.2875,S


Creates two new variables for both **train** and **test** data frames, but transforming the categorial features as binary features 

In [92]:
train_df = pd.get_dummies(train_df_raw)
test_df = pd.get_dummies(test_df_raw)

Shows the first five rows for the **new train data frame**. For example: the first passenger was **male** (Sex_male = 1) and has embarked on **Southampton** (Embarked_S = 1).

In [93]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,22.0,1,0,7.25,0,1,0,0,1
1,2,1,1,38.0,1,0,71.2833,1,0,1,0,0
2,3,1,3,26.0,0,0,7.925,1,0,0,0,1
3,4,1,1,35.0,1,0,53.1,1,0,0,0,1
4,5,0,3,35.0,0,0,8.05,0,1,0,0,1


In [94]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,892,3,34.5,0,0,7.8292,0,1,0,1,0
1,893,3,47.0,1,0,7.0,1,0,0,0,1
2,894,2,62.0,0,0,9.6875,0,1,0,1,0
3,895,3,27.0,0,0,8.6625,0,1,0,0,1
4,896,3,22.0,1,1,12.2875,1,0,0,0,1


In [95]:
train_df.isnull().sum().sort_values(ascending=False)

Age            177
Embarked_S       0
Embarked_Q       0
Embarked_C       0
Sex_male         0
Sex_female       0
Fare             0
Parch            0
SibSp            0
Pclass           0
Survived         0
PassengerId      0
dtype: int64

In [96]:
test_df.isnull().sum().sort_values(ascending=False)

Age            86
Fare            1
Embarked_S      0
Embarked_Q      0
Embarked_C      0
Sex_male        0
Sex_female      0
Parch           0
SibSp           0
Pclass          0
PassengerId     0
dtype: int64

In [97]:
train_df['Age'].fillna(train_df['Age'].mean(), inplace=True)
train_df.isnull().sum().sort_values(ascending=False)

Embarked_S     0
Embarked_Q     0
Embarked_C     0
Sex_male       0
Sex_female     0
Fare           0
Parch          0
SibSp          0
Age            0
Pclass         0
Survived       0
PassengerId    0
dtype: int64

In [98]:
test_df['Age'].fillna(test_df['Age'].mean(), inplace=True)
test_df['Fare'].fillna(test_df['Fare'].mean(), inplace=True)
test_df.isnull().sum().sort_values(ascending=False)

Embarked_S     0
Embarked_Q     0
Embarked_C     0
Sex_male       0
Sex_female     0
Fare           0
Parch          0
SibSp          0
Age            0
Pclass         0
PassengerId    0
dtype: int64

## Creating and fitting model

Separates the **features** and the **target** from the **train data frame** into two new variables

In [100]:
features = train_df.drop('Survived', axis=1)
target = train_df['Survived']

Creates and fits the **model**

In [108]:
model = DecisionTreeClassifier(max_depth=5, random_state=0)
model.fit(features, target)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

## Score model

Shows the **accuracy** for predicting new data

In [109]:
model.score(features, target)

0.8507295173961841

## Predicting and saving to new file

Creates a new variable to store the result of predictions

In [111]:
predictions = pd.DataFrame()
predictions['PassengerId'] = test_df['PassengerId']
predictions['Survived'] = model.predict(test_df)

Shows the first fifteen rows for predictions results data frame

In [114]:
predictions.head(15)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


Saves the result into a **CSV file**

In [115]:
predictions.to_csv('resources/results/predictions.csv', index=False)