# Test with Titanic Movie Characters

In the last tutorial, we trained a Random Forest Classifier on passengers that survived or did not survive.  For this exercise, you can choose to play a character from the movie.  

** Note that the details in the dataset for these characters do not exactly match those of the movie.  I have tried to find actual passengers from the titanic_train dataset that matched these characters as closely as possible, and then replaced their actual names with the names of the movie characters.  

<div>
<img src = "TitanicCharacters.png" width="700">
</div>

### Use the Random Forest Classifier to predict if the character survived or not.  

#### Import libraries and the stored Random Forest Algorithm
Let's start by growing a decision tree from the data

In [1]:
# Import libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Workshop Functions
import sys
sys.path.append('..')
from Wksp722_functions import * 

In [2]:
# Read in the classifier trained in the previous lesson
import pickle
RF_Final = pickle.load(open('RF_Final.pkl', 'rb'))

### Load the Test dataset
* In order to expedite the process, this dataset was cleaned according to the changes we performed for the Training dataset.  Normally you would need to clean any test datasets separately.  

In [3]:
# Load the test dataset
chars = pd.read_csv("titanicMovieCharacters.csv")
chars.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Salutation
0,13,3,"Dawson, Mr. Jack",male,20,0,0,A/5. 2151,8.05,S,Mr.
1,586,1,"Dewitt-Bukater, Miss. Rose",female,18,0,2,110413,79.65,S,Miss.
2,98,1,"Hockley, Mr. Caledon",male,23,0,1,PC 17759,63.3583,C,Mr.
3,880,1,"Dewitt-Bukater, Mrs. Ruth",female,56,0,1,11767,83.1583,C,Mrs.
4,185,3,"Cartmell, Miss. Cora",female,4,0,2,315153,22.025,S,Miss.


### Process Test dataset
* Next we need to further process the dataset in the same manner we did in the previous lecture.  

Import the function defined in the previous lecture:

def titanicNumericalConverter(df):
    # convert the categorical variable 'Sex' to numerical 0 and 1 using mapping
    mapping = {'male':0, 'female':1}
    df.loc[:,'Sex'] = df.Sex.map(mapping)
    
    #convert columns using one-hot state encoding:
    dfTemp = pd.get_dummies(df.loc[:,['Embarked','Salutation']])
    df = pd.concat([df,dfTemp], axis=1)
    df.drop(['PassengerId','Embarked','Name','Ticket','Salutation'], axis=1,inplace=True)
    return df

In [4]:
x_test = titanicNumericalConverter(chars)

In [5]:
x_test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_S,Salutation_Miss.,Salutation_Mr.,Salutation_Mrs.
0,3,0,20,0,0,8.05,0,1,0,1,0
1,1,1,18,0,2,79.65,0,1,1,0,0
2,1,0,23,0,1,63.3583,1,0,0,1,0
3,1,1,56,0,1,83.1583,1,0,0,0,1
4,3,1,4,0,2,22.025,0,1,1,0,0


However, we see that we're missing some columns as they are not included in the test dataset.  However, the Random Forest algorithm will still look for them.  So we need to insert these columns at the right place.  

In [6]:
# This function will insert the missing columns.  I made a function to reduce typing during class.  
x_test = M3L3_titanicTest_colInsert(x_test)


Embarked_Q 7
Salutation_Capt. 9
Salutation_Col. 10
Salutation_Countess. 11
Salutation_Don. 12
Salutation_Dr. 13
Salutation_Jonkheer. 14
Salutation_Lady. 15
Salutation_Major. 16
Salutation_Master. 17
Salutation_Mlle. 19
Salutation_Mme. 20
Salutation_Ms. 23
Salutation_Rev. 24
Salutation_Sir. 25


In [7]:
print(x_test.head())
print(x_test.columns)

   Pclass  Sex  Age  SibSp  Parch     Fare  Embarked_C  Embarked_Q  \
0       3    0   20      0      0   8.0500           0           0   
1       1    1   18      0      2  79.6500           0           0   
2       1    0   23      0      1  63.3583           1           0   
3       1    1   56      0      1  83.1583           1           0   
4       3    1    4      0      2  22.0250           0           0   

   Embarked_S  Salutation_Capt.  ...  Salutation_Major.  Salutation_Master.  \
0           1                 0  ...                  0                   0   
1           1                 0  ...                  0                   0   
2           0                 0  ...                  0                   0   
3           0                 0  ...                  0                   0   
4           1                 0  ...                  0                   0   

   Salutation_Miss.  Salutation_Mlle.  Salutation_Mme.  Salutation_Mr.  \
0                 0           

In [8]:
# Next use the model to predict the survival of the passengers in this new test data
y_pred = RF_Final.predict(x_test)

In [9]:
# get the actual answers
temp = pd.read_csv("titanicMovieCharacters_Answers.csv")
y_test = temp.loc[:,'Survived']

In [10]:
for i in range(0,len(y_pred)):
    print(chars.loc[i,'Name'], '||', 'predicted to survive: ', y_pred[i], '||', 'actual: ', y_test[i])

Dawson, Mr. Jack || predicted to survive:  0 || actual:  0
Dewitt-Bukater, Miss. Rose || predicted to survive:  1 || actual:  1
Hockley, Mr. Caledon || predicted to survive:  0 || actual:  1
Dewitt-Bukater, Mrs. Ruth || predicted to survive:  1 || actual:  1
Cartmell, Miss. Cora || predicted to survive:  1 || actual:  1
Andrews, Mr. Thomas Jr || predicted to survive:  0 || actual:  0


#### How well did the algorithm do for your character?  

***Curiosity Points (15 points)***
Clean and process the "titanic_test.csv" dataset using the methods in the last 3 notebooks.  Predict the survival of the passengers in this dataset.  Then go to the Kaggle challenge website (https://www.kaggle.com/competitions/titanic) and submit your results.  See how well you did.  