## Process the titanic_test dataset
In the "M3L2 - Process (Train)" notebook, we did the following adjustments to the titanic_train dataset:
* Dropped the cabin column
* Dropped any paseengers with 'Embarked' data missing
* Replaced index with PassengerId
* Created a Salutation column
* Replaced missing Age data with the median of their Salutation group

We must repeat these steps for the test dataset.  
Note that Kaggle does not provide the Survived column for the test dataset.  This is in order to prevent researchers from tuning their models specifically for the test dataset and thereby artificially inflating their accuracies.  

The code below is an abbridged version of the "M3L2 - Process (Train)" notebook.

In [None]:
# Import libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Workshop Functions
import sys
sys.path.append('..')
from Wksp722_functions import * 

In [None]:
df = pd.read_csv("titanic_test.csv")
df.head(2)

In [None]:
# reset index to PassengerId
df.set_index('PassengerId', inplace=True)

In [None]:
df.isnull().sum()

In [None]:
# Replace 1 passenger that had a missing Fare with median

print(df.shape) # size of df before 
df.loc[df.loc[:,'Fare'].isnull(),'Fare'] = df.loc[:,'Fare'].median()
print(df.shape) # size of df after
df.head()

In [None]:
# Drop Cabin column

df.drop(['Cabin'], axis=1, inplace=True)
df.head()

### Create 'Salutation' Column

In [None]:
split_name = df.loc[:,'Name'].str.split(n=3, expand=True)

In [None]:
# let's see the list of salutations
df.loc[:,'Salutation']=split_name[1]

In [None]:
#  Count the number of passengers for each salutation
df.groupby('Salutation').count().loc[:,'Name']

It's important to use the median ages calculated during the training set to replace missing ages.  That way, we're using the same substitution values for both sets.  

In [None]:
median_age = pd.read_csv('median_age.csv')
median_age.set_index('Salutation',inplace=True)

In [None]:
# let's remember how may null values there are in the existing Age column
df.loc[:,'Age'].isnull().sum()

In [None]:
for ind in df.index:
    if np.isnan(df.loc[ind,'Age']): 
        df.loc[ind,'Age'] = median_age.loc[df.loc[ind,'Salutation'],'Age']

### Save Test dataset

In [None]:
df.to_csv('titanic_test_cleaned.csv')