# Introduction

This notebook is a brief exploration of exploratory data analysis (EDA) and how it can aid in data preparation and feature generation. A supervised machine learning (ML) task will be completed as an overarching project to explore these topics of interest.

The ML task being completed is from the Kaggle competition [Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic), in which the goal is to classify whether individual passengers on the Titanic may have survived based on their provided passenger information. To complete this task, the Scikit Learn Python library will be used, and a Random Forest Decision Tree classifier will be trained. Seaborn and Matplotlib will provide data visualization for EDA and other parts of the project. Finally, the Pandas and NumPy  libraries will be utilized for data handling.

In [1]:
#Python Packages
import time #Used to timestamp submission files

#Data Handling
import pandas as pd
import numpy as np

#Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

#Machine Learning and data manipulation
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier

#Ensure Notebook displays visuals
sns.set_style('white')
%matplotlib inline

## Access Data

This notebook uses Google drive to access stored data. These cells should be updated if the notebook is run in a different location/environment, or by another person.

This notebook uses the data from the Kaggle competition described above and can be found [here](https://www.kaggle.com/c/titanic/data).

In [4]:
#Access the project data
datapath = "input/titanic/"

#The files with paths for training and test data
train_file = datapath + "train.csv"
test_file = datapath + "test.csv"

# Data Exploration

Data exploration is crucial for any data mining project. It is important to understand the dataset you are working with and exploring that dataset can give the analyst a good insight into what parts of the dataset is useful and which aren't. Exploration can also help an analyst find out hidden information within the dataset that wasn't apparent at first. This EDA will allow us to understand the work that will be needed to prepare the data for further analysis later in the project.

## First Look at the Data

With the first look at the dataset, it is clear there is potentially a lot of work to do to get the data ready for a ML classifier. There are multiple text columns, some of which contain text (Name and Ticket) while others are categorical (like Sex and Embarked). There are also multiple numeric columns, including categorical values and real numbers. Diving into the data with EDA will help provide information on how to handle the various attributes in the dataset, and potentially provide insight on useful features that could be generated.

* Passenger Class (Pclass) is numeric
* Name is a string
* Sex is a string that is either male or female
* Age is a float
* SibSp and Parch are both integers
* Ticket is a string
* Fare is a float
* Cabin is a string
* Embarked is a char

In [5]:
#Open dataframes for both training and test data
train_df = pd.read_csv(train_file, index_col='PassengerId')
test_data = pd.read_csv(test_file)

#Show head of training data
train_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Exploratory Data Analysis

### Missing Values

One important thing to investigate is missing values within a dataset. If an attribute is missing values, the analyst must decide what to do with that attribute, which could include imputing new values into missing values or even dropping the attribute all together. It is important to consider each attribute to determine the best course of action when dealing with missing values.


In [None]:
train_df.info()
#We can see that some columns are missing data in the training data

In [None]:
#A heatmap can be used to visualize missing data, making it easier to see where focus is needed
#Initial Formatting
fig, axes = plt.subplots(figsize=(6,6), constrained_layout=True)

#Graph
train_miss = sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False,cmap='cividis',ax=axes)

#Additional Formatting
fig_title = fig.suptitle('Missing Values',fontsize=24)
axes_title = axes.set_title('Training Dataset',fontsize=16)
axes_format = axes.set(ylabel=None)
axes_xlabels = axes.set_xticklabels(train_miss.get_xticklabels(),size=14)

In [None]:
#We should also investigate the test data to see if we need to clean or handle any of that data.
#Initial Formatting
fig, axes = plt.subplots(figsize=(6,6), constrained_layout=True)

#Graph
test_miss = sns.heatmap(test_data.isnull(),yticklabels=False,cbar=False,cmap='cividis',ax=axes)

#Additional Formatting
fig_title = fig.suptitle('Missing Values',fontsize=24)
axes_title = axes.set_title('Testing Dataset',fontsize=16)
axes_format = axes.set(ylabel=None)
axes_xlabels = axes.set_xticklabels(train_miss.get_xticklabels(),size=14)

As we can see, the Age attribute has some missing values (about 20% of values are missing in the training dataset), but that attribute seems like a good candidate for having values imputed for records missing age. The same can be said about the Fare and Embarked attributes.

The Cabin attribute is missing large numbers of values (about 77% of values are missing in the training dataset) and could be a candidate for being dropped. However, this feature should still be explored and researched some before determining whether to keep or drop the attribute.

### Analyzing Survival

Since the task at hand is determining whether a passenger survived, the EDA will be mainly focused on looking at the various attributes and their relationship to survival.

#### Class Distribution

Class distribution can influence many choices in a ML project, so it is important to understand that as soon as possible.

While more passengers did not survive than passengers that did, the class distribution isn't so heavily skewed that it has to be taken into account.

In [None]:
#Initial Formatting
fig, axes = plt.subplots(figsize=(8,6), constrained_layout=True)

#Graph
cls_dst = sns.countplot(x='Survived',data=train_df,ax=axes,palette='colorblind')

#Additional Formatting
fig_title = fig.suptitle('Class Distribution',fontsize=24)
axes_title = axes.set_title('Training Dataset',fontsize=16)
axes_ylabel = axes.set_ylabel('Count',size=14)
axes_yticklabel = axes.set_yticklabels([int(x) for x in cls_dst.get_yticks()],size=12)
axes_xlabel = axes.set_xlabel('Survival',size=14)
axes_xticklabel = axes.set_xticklabels(['Did Not Survive','Survived'],size=12)

In [None]:
#Calculate class distribution
dns = (train_df['Survived'].value_counts()[0]/len(train_df))*100
surv = (train_df['Survived'].value_counts()[1]/len(train_df))*100

#Show distribution
print('Class Distribution - Training Dataset')
print(f'Did Not Survive: {dns:4.2f}%')
print(f'Survived: {surv:4.2f}%')

#### Sex

Being male meant you were more likely to not survive, while females were more likely to survive.

This will be a useful feature for the ML model.
* This attribute will need to be converted to a numerical categorical attribute.

In [None]:
#We can further dive into the data by using hue

#Initial Formatting
fig, axes = plt.subplots(figsize=(8,6), constrained_layout=True)

#Graph
srv_sex = sns.countplot(x='Survived', hue='Sex', data=train_df,ax=axes,palette='colorblind')

#Additional Formatting
fig_title = fig.suptitle('Survival by Sex',fontsize=24)
axes_title = axes.set_title('Training Dataset',fontsize=16)
axes_ylabel = axes.set_ylabel('Count',size=14)
axes_yticklabel = axes.set_yticklabels([int(x) for x in srv_sex.get_yticks()],size=12) 
axes_xlabel = axes.set_xlabel('Survival',size=14)
axes_xticklabel = axes.set_xticklabels(['Did Not Survive','Survived'],size=12)
graph_legend = srv_sex.legend(['Male','Female'],fontsize=14,title='Sex',title_fontsize=14)

In [None]:
#Pivot tables can be used to see how various attributes correlate to survival
train_df.pivot_table(values='Survived',index='Sex',aggfunc=np.mean)
#We can see that 74.2% of females survived while only 18.9% of males survived

#### Passenger Class

Being a 3rd class passenger was also bad for survival chances (only 24.2% survived), while being in a higher class was better for survival (62.9% of 1st class passengers survived).

This will be a useful feature for the ML model.

In [None]:
#Initial Formatting
fig, axes = plt.subplots(figsize=(8,6), constrained_layout=True)

#Graph
srv_pclass = sns.countplot(x='Survived', hue='Pclass', data=train_df,ax=axes,palette='colorblind')

#Additional Formatting
fig_title = fig.suptitle('Survival by Passenger Class',fontsize=24)
axes_title = axes.set_title('Training Dataset',fontsize=16)
axes_ylabel = axes.set_ylabel('Count',size=14)
axes_yticklabel = axes.set_yticklabels([int(x) for x in srv_pclass.get_yticks()],size=12) 
axes_xlabel = axes.set_xlabel('Survival',size=14)
axes_xticklabel = axes.set_xticklabels(['Did Not Survive','Survived'],size=14)
graph_legend = srv_pclass.legend(['1st Class','2nd Class','3rd Class'],fontsize=14,title='Passenger Class',title_fontsize=14)

In [None]:
train_df.pivot_table(values='Survived',index='Pclass',aggfunc=np.mean)
#Class 2 passengers are almost split, leaning slightly towards not surviving, with only 47.3% surviving and 52.7% not surviving.

#### Age

The correlation between age and survival isn't as directly apparent, but there are some correlations.

For example, children under 5 years old had a very high likelihood of being among the survivors. Meanwhile individuals in their 20s-30s were at a higher likelihood of not surviving.

In [None]:
#Initial Formatting
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6), constrained_layout=True, sharey=True)

#Graphs
age_surv = sns.histplot(data=train_df[train_df['Survived']==1],x='Age',bins=30,ax=axes[1])
age_dead = sns.histplot(data=train_df[train_df['Survived']==0],x='Age',bins=30,ax=axes[0])

#Additional Formatting
figtitle = fig.suptitle('Survival by Age - Training Dataset',fontsize=24)
axeszero_ylabel = axes[0].set_ylabel('Count',size=14)
axeszero_yticklabel = axes[0].set_yticklabels([int(x) for x in age_dead.get_yticks()],size=12)
axeszero_title = axes[0].set_title('Did Not Survive', fontsize=14)
axeszero_xticklabel = axes[0].set_xticklabels([int(x) for x in age_dead.get_xticks()],size=12) 
axeszero_xlabel = axes[0].set_xlabel('Age',size=14)
axesone_title = axes[1].set_title('Survived', fontsize=14)
axesone_xticklabel = axes[1].set_xticklabels([int(x) for x in age_surv.get_xticks()],size=12)
axesone_xlabel = axes[1].set_xlabel('Age',size=14)

We see that most 3rd class passengers are in their 20s and 30s, which helps explain the higher numbers of non-survivors in those age ranges.

Because passenger class is also correlated with passenger survival, and that attribute is complete (no missing values), the average age of passengers by class could be used to impute missing age values for records missing the age.

This will be a useful feature for the ML model.
* New age values must be imputed for missing values. The average age per class will be used to impute age for missing values based on the record's class.


In [None]:
#Initial Formatting
fig, axes = plt.subplots(figsize=(8,6), constrained_layout=True)

#Graph
agedist_pclass = sns.boxplot(x='Pclass',y='Age',data=train_df,palette='colorblind',ax=axes)

#Additional Formatting
fig_title = fig.suptitle('Age Distribution by Passenger Class',fontsize=24)
axes_title = axes.set_title('Training Dataset',fontsize=16)
axes_ylabel = axes.set_ylabel('Count',size=14)
axes_yticklabel = axes.set_yticklabels([int(x) for x in agedist_pclass.get_yticks()],size=12)
axes_xlabel = axes.set_xlabel('Passenger Class',size=14)
axes_xticklabel = axes.set_xticklabels(['1st Class','2nd Class','3rd Class'],size=12)

In [None]:
train_df.pivot_table(values='Age',index='Pclass',aggfunc=np.mean)
#We can use the mean age per class to impute missing age values

#### Fare

Like Age above, there is no major correlation of fare to survival apparent. Most notably, those who paid lower fares were more likely to not survive. This is in line with the rest of the data: passengers in 3rd class were more likely to not survive and a 3rd class ticket would cost less on average.

So, while there is not a strong apparent correlation for Fare, there seems to be some useful information captured within this attribute.

Because there is a correlation between Fare and Passenger Class, and there is a correlation between Passenger Class and Fare, this attribute will be used for the ML model.
* New fare values must be imputed for missing values. The average fare per class will be used to impute fare for missing values based on the record's class.


In [None]:
#Initial Formatting
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,6), constrained_layout=True, sharey=True)

#Graphs
fare_surv = sns.histplot(data=train_df[train_df['Survived']==1],x='Fare',bins=15,ax=axes[1])
fare_dead = sns.histplot(data=train_df[train_df['Survived']==0],x='Fare',bins=15,ax=axes[0])

#Additional Formatting
figtitle = fig.suptitle('Survival by Fare - Training Dataset',fontsize=24)
axeszero_title = axes[0].set_title('Did Not Survive', fontsize=14)
axeszero_ylabel = axes[0].set_ylabel('Count',size=14)
axeszero_yticklabel = axes[0].set_yticklabels([int(x) for x in fare_dead.get_yticks()],size=12)
axeszero_xticklabel = axes[0].set_xticklabels([int(x) for x in fare_dead.get_xticks()],size=12)
axeszero_xlabel = axes[0].set_xlabel('Fare',size=14)
axesone_title = axes[1].set_title('Survived', fontsize=14)
axesone_xticklabel = axes[1].set_xticklabels([int(x) for x in fare_surv.get_xticks()],size=12)
axesone_xlabel = axes[1].set_xlabel('Fare',size=14)

In [None]:
#Initial Formatting
fig, axes = plt.subplots(figsize=(8,6), constrained_layout=True)

#Graph
faredist_pclass = sns.boxplot(x='Pclass',y='Fare',data=train_df,palette='colorblind',ax=axes, showfliers=False) #Omitting outliers to make the visual easier to read

#Additional Formatting
fig_title = fig.suptitle('Fare Distribution by Passenger Class',fontsize=24)
axes_title = axes.set_title('Training Dataset',fontsize=16)
axes_ylabel = axes.set_ylabel('Count',size=14)
axes_yticklabel = axes.set_yticklabels([int(x) for x in faredist_pclass.get_yticks()],size=12)
axes_xlabel = axes.set_xlabel('Passenger Class',size=14)
axes_xticklabel = axes.set_xticklabels(['1st Class','2nd Class','3rd Class'],size=12)

In [None]:
train_df.pivot_table(values='Fare',index='Pclass',aggfunc=np.mean)
#We can use the mean fare per class to impute missing age values

#### Family

The SibSp and Parch attributes contain information on the number of family members on board. While it is possible to explore the relationship between the number of family members onboard and survival, and easier analysis that can be completed initially is too look at whether having family on board affects survival in the first place.

In [None]:
#Instead of looking at the number of siblings/children/parents on the ship, looking at family simplifies two variables
#To investigate if family helped, create a function to create a has family attribute
def has_family(columns):
  sib = columns[0]
  par = columns[1]

  if (sib > 0) or (par > 0):
    return 1
  else:
    return 0

In [None]:
#Create the has family attribute
train_df['has_fam'] = train_df[['SibSp','Parch']].apply(has_family,axis=1)

As we can see, not having family on board was not good for a passenger's survival odds. These finding show that this 'has_family' feature could be useful for the ML task.

The has_family feature will be a useful feature for the ML model.

In [None]:
#Initial Formatting
fig, axes = plt.subplots(figsize=(8,6), constrained_layout=True)

#Graph
srv_fam = sns.countplot(x='Survived', hue='has_fam', data=train_df,palette='colorblind',ax=axes)

#Additional Formatting
fig_title = fig.suptitle('Survival by Family Status',fontsize=24)
axes_title = axes.set_title('Training Dataset',fontsize=16)
axes_ylabel = axes.set_ylabel('Count',size=14)
axes_yticklabel = axes.set_yticklabels([int(x) for x in srv_fam.get_yticks()],size=12) 
axes_xlabel = axes.set_xlabel('Survival',size=14)
axes_xticklabel = axes.set_xticklabels(['Did Not Survive','Survived'],size=14)
graph_legend = srv_fam.legend(['Traveled w/o Family','Traveled w/ Family'],fontsize=14,title='Family Status',title_fontsize=14)

In [None]:
train_df.pivot_table(values='Survived',index='has_fam',aggfunc=np.mean)
#Looking at actual numbers, not having family means the passenger was more likely to not survive, making this seems like a useful attribute

#### Cabin

This column seems tempting to drop due to the large number of missing values and the messy state of the data within it. However, that would be incorrect.

The Cabin attribute highlights the usefulness of subject matter knowledge and expertise in data analytics and data mining projects. While most 1st class passengers would have had their own cabin, lower class passengers would have stayed in communal rooms without cabin numbers. So, there is a reason there are so many empty values within the Cabin attribute.


In [None]:
#Let us just look at whether a passenger has a cabin.
def has_cabin(row):
  if pd.isnull(row):
    return 0
  else:
    return 1

In [None]:
#Create a has_cabin attribute to show whether a passenger had a cabin.
train_df['has_cabin'] = train_df['Cabin'].apply(has_cabin)

In [None]:
train_df.pivot_table(values='has_cabin',index='Pclass',aggfunc=np.sum)
#We can see that while first class passengers have cabins, lower class passengers didn't

With this subject matter knowledge, we can analysis the Cabin attribute properly and see that it provides a good correlation to survival.

The has_cabin feature will be a useful feature for the ML model.


In [None]:
#Initial Formatting
fig, axes = plt.subplots(figsize=(8,6), constrained_layout=True)

#Graph
srv_cabin = sns.countplot(x='Survived', hue='has_cabin', data=train_df,palette='colorblind',ax=axes)

#Additional Formatting
fig_title = fig.suptitle('Survival by Cabin Status',fontsize=24)
axes_title = axes.set_title('Training Dataset',fontsize=16)
axes_ylabel = axes.set_ylabel('Count',size=14)
axes_yticklabel = axes.set_yticklabels([int(x) for x in srv_cabin.get_yticks()],size=12) 
axes_xlabel = axes.set_xlabel('Survival',size=14)
axes_xticklabel = axes.set_xticklabels(['Did Not Survive','Survived'],size=14)
graph_legend = srv_cabin.legend(['No Cabin','Has Cabin'],fontsize=14,title='Cabin Status',title_fontsize=14)

In [None]:
train_df.pivot_table(values='Survived',index='has_cabin',aggfunc=np.mean)
#Having a cabin was highly correlated with survival and vice versa

#### Other Considerations

**Embarked**

The Embarked attribute could have been analyzed for its correlation for survival to determine whether it is a useful feature. Subject matter knowledge would also be helpful for understanding this attribute and any apparent correlation it may have.
* The attribute will be used by using a simple function to convert the attribute to numerical categorical attribute.

**Name**

The Names attribute could have been analyzed; however, the messiness of the data make analysis nontrivial. While it seems to be a messy column, it does contain information on the titles of passengers (such as Mr., Mrs., Dr., etc.). Since having a prestigious title could also be correlated to being wealthy, such as being in 1st class and having a cabin, that could be another useful feature. 
* For simplicity, the Name attribute was not used for this project.


**Ticket**

The Ticket attribute is a messy attribute and was not included. It may have potentially useful information but is an example of another column that would most likely benefit from subject matter knowledge and expertise.
* For simplicity, the Ticket attribute was not used for this project.

**Age** & **Fare**

Using the **mean** to impute missing values for these attributes may not be the best choice. Another measure of central tendancy, such as **mode**, could be a better choice for imputting values.
* Thank you [nizarh](https://www.kaggle.com/michaelwilder/titanic-survival-eda-and-randomforest/comments#1501044) for making this suggestion.


# Data Cleansing and Feature Generation

From EDA we have learned that we need to do the following steps:
* Convert Sex to a numerical categorical attribute
* Impute Age for records with missing Age values
* Impute Fare for records with missing Fare values
* Convert SibSp and Parch to a has_family attribute
* Convert Cabin to a has_cabin attribute
* Convert Embarked to a numerical categorical attribute


In [None]:
#Open a the training dataframe for the ML task
#This data will be messy
train_messy = pd.read_csv(train_file, index_col='PassengerId')

In [None]:
#Function to impute age based on clase
#The average age per class was calculated from the training dataset
def impute_age(columns):
  Age = columns[0]
  Pclass = columns[1]

  if pd.isnull(Age):
    if Pclass == 1:
      return 38
    elif Pclass == 2:
      return 30
    else:
      return 25
  else:
    return Age

In [None]:
#Function to impute fare based on passenger clase
#The average fare per class was calculated from the training dataset
def impute_fare(columns):
  fare = columns[0]
  Pclass = columns[1]

  if pd.isnull(fare):
    if Pclass == 1:
      return 84.15
    elif Pclass == 2:
      return 20.66
    else:
      return 13.67
  else:
    return fare

In [None]:
#Function to convert embark to numerical categorical variable
def convert_embark(embark):
  if embark == 'C':
    return 0
  elif embark == 'Q':
    return 1
  else:
    return 2

In [None]:
#Function to do initial cleansing of the dataset,
#including imputing missing values.
def clean_attributes(df):
  #Impute values for missing ages
  df['Age'] = df[['Age','Pclass']].apply(impute_age,axis=1)

  #Impute values for missing fares
  df['Fare'] = df[['Fare','Pclass']].apply(impute_fare,axis=1)

  #Create has_cabin
  df['has_cabin'] = df['Cabin'].apply(has_cabin)

  #Convert Embarked
  df['Embarked'] = df['Embarked'].apply(convert_embark)

  #Drop unneeded columns
  df.drop(['Name','Ticket','Cabin'],axis=1,inplace=True)

  #Drop any records with null values and return the dataframe
  return pd.DataFrame(df)

In [None]:
#Function to process data through cleansing and feature generation
def process_data(df):

  #clean the data first
  df = clean_attributes(df)

  #Then create features and shit
  #Create has_family
  df['has_fam'] = df[['SibSp','Parch']].apply(has_family,axis=1)

  #Convert Sex
  psex = pd.get_dummies(df['Sex'],drop_first=True)
  df = pd.concat([df,psex],axis=1)

  #Drop columns (Sex, Sibsp, Parch, Embarked)
  df.drop(['Sex','SibSp','Parch'],axis=1,inplace=True)

  feature_list = ['has_fam','has_cabin','male','Pclass','Age','Fare','Embarked']

  feat_df = df[feature_list]

  #Return the dataframe
  return pd.DataFrame(feat_df)

With functions written to clean, process, and generate features from the dataset, the training features X and the labels y can be created.



In [None]:
#Generate X and y for training and validating the machine learning model
X, y = process_data(train_messy.iloc[ : , 1:11 ]), train_messy['Survived']

And we can visually verify that the training features is a quality dataset with no missing values.

In [None]:
#Initial Formatting
fig, axes = plt.subplots(figsize=(6,6), constrained_layout=True)

#Graph
train_cln = sns.heatmap(X.isnull(),yticklabels=False,cbar=False,cmap='cividis',ax=axes)

#Additional Formatting
fig_title = fig.suptitle('Missing Values - Cleaned Data',fontsize=24)
axes_title = axes.set_title('Training Dataset',fontsize=16)
axes_format = axes.set(ylabel=None)
axes_xlabels = axes.set_xticklabels(train_cln.get_xticklabels(),size=12)

## Normalization

Because there are both categorical attributes (such as has_fam and Pclass) along with numeric attributes (Age and Fare), normalizing the data will bring all features within the same range. This can help prevent a feature like Fare, which has much larger numbers than the rest of the dataset, from influencing the classifier. Normalization can be useful in some situations, however not all classifiers need to use normalized data.

In [None]:
#Normalize the attributes to help the classifier
scaler = MinMaxScaler(feature_range=(0,1))
scaler.fit(X)
sX = scaler.transform(X)

## Train Test Split

Split sX into training data and a set aside a set of validation data to use on the trained model to validate performance.

In [None]:
#Create a set for training and another for validating the model.
sX_train, sX_val, y_train, y_val = train_test_split(sX,y,test_size=0.2,random_state=101)

# Random forest - GridsearchCV

To complete the ML task, a Random Decision Tree Forest will be used to generate a trained ML model.

To tune the hyperparameter of the Random Forest classifier, a GridsearchCV with cross validation will be utilized to determine which set of hyperparameters provide the best model accuracy.

In [None]:
#Parameter grid for the gridsearch
#These are the parameters that the gridsearch with methodical work through
#to determine the best set for the task at hand.
params = {
    'n_estimators': [300, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [3, 4, 5],
    'criterion' :['gini', 'entropy']
}
#This parameter grid has been shortened to help the runtime of this notebook.

In [None]:
#Create a Random Forest classifier object
rf = RandomForestClassifier(random_state=101)

#Create a Gridsearch object
rf_gs = GridSearchCV(estimator=rf,param_grid=params,scoring='accuracy',cv=5,verbose=True)

In [None]:
#Fit the Gridsearch object using the normalized training data
#Note, this can take some time depending on the parameter grid
rf_gs.fit(sX_train,y_train)

In [None]:
#Generate predictions from the trained gridsearch object
y_pred = rf_gs.predict(sX_val)

In [None]:
#Print out performance information for the gridsearch object
print(classification_report(y_val,y_pred))
print(f'The accuracy score of the best model is: {accuracy_score(y_val,y_pred)*100:4.2f}%')

In [None]:
#Create a dataframe of the confusion matrix data from the predictions of the 
#gridsearch object
cm_df = pd.DataFrame(confusion_matrix(y_val,y_pred),columns=np.unique(y_val),index=np.unique(y_val))
cm_df.index.name = "Actual"
cm_df.columns.name = "Predicted"

In [None]:
#Create confusion matrix heatmap of the gridsearch object predictions
#Initial Formatting
fig, axes = plt.subplots(figsize=(6,6), constrained_layout=True)

#Graph
cnfmtx_pipeline = sns.heatmap(data=cm_df, cmap='Blues',annot=confusion_matrix(y_val,y_pred),annot_kws={"size": 16}, fmt='d',ax=axes)

#Additional Formatting
fig_title = fig.suptitle('Confusion Matrix',fontsize=24)
axes_title = axes.set_title('GridSearchCV Random Forest',fontsize=14)
axes_Xax = axes.set_xlabel('Predicted', fontsize=16)
axes_Yax = axes.set_ylabel('Actual', fontsize=16)
axes_xtick = axes.set_xticklabels(labels=[0,1],fontsize=14)
axes_ytick = axes.set_yticklabels(labels=[0,1],fontsize=14)

# Submission

With a model trained and validated, the testing data can be prepared, fed into the ML model for predictions, and a submission created for the Kaggle competition.

In [None]:
#Open and prepare test data
test_messy = pd.read_csv(test_file)
test, p_id = process_data(test_messy.iloc[ : , 1:11 ]), test_messy['PassengerId']

In [None]:
#Visually confirm test data was properly processed and in good form
#Initial Formatting
fig, axes = plt.subplots(figsize=(6,6), constrained_layout=True)

#Graph
test_cln = sns.heatmap(test.isnull(),yticklabels=False,cbar=False,cmap='cividis',ax=axes)

#Additional Formatting
fig_title = fig.suptitle('Missing Values - Cleaned Data',fontsize=24)
axes_title = axes.set_title('Test Dataset',fontsize=16)
axes_format = axes.set(ylabel=None)
axes_xlabels = axes.set_xticklabels(train_cln.get_xticklabels(),size=12)

In [None]:
#Normalize the test data with the fit scaler
stest = scaler.transform(test)

#Generate the predictions based on the normalized test data
test_pred = rf_gs.predict(stest)

#Generate a dataframe for the test predictions
sub_df = pd.DataFrame()
sub_df['PassengerId'] = p_id
sub_df['Survived'] = test_pred

#Save the predictions as a csv file
sub_df.to_csv(f'Titanic_Submission__{time.strftime("%Y%m%d_%H%M%S")}.csv',index=False)