# Spaceship Titanic - A Supervised Classification Machine Learning Problem

# Background: 
Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

![Planetary Systems](./Images/55_Cancri_E_from_Earth.png)

# File and Data Field Descriptions
**Personal records recovered from the ship's damaged computer system**
* **train.csv** - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
    - PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
    - HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
    - CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
    - Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
    - Destination - The planet the passenger will be debarking to.
    - Age - The age of the passenger.
    - VIP - Whether the passenger has paid for special VIP service during the voyage.
    - RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
    - Name - The first and last names of the passenger.
    - Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
* **test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.
* **sample_submission.csv** - A submission file in the correct format.
    - PassengerId - Id for each passenger in the test set.
    - Transported - The target. For each passenger, predict either True or False.

### Define the Problem: 
1. What is the problem
    * Informal Description - I need a program that will predict which passengers were transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly using a set of personal records recovered from the ship's damaged computer system. <br><br>
    * Formalism 
        * Task(T) Classify a passenger (not seen in training) as being Transported ("True or False")
        * Experience(E) A list of passengers' personal records (the training set) indicating if they were Transported ("True or False") (Supervised Learning)
        * Performance(P) Classification accuracy, the number of passengers predicted correctly out of all passengers considered as a percentage<br><br>
2. Why does the problem need to be solved?
    * Motivation - I am solving this problem as a learning exercise in applying Machine Learning techniques.
    * Solution Use - The solution's lifetime is short-lived and will be submitted for the Kaggle competition (no maintenance required)<br><br>
3. How would I solve the problem?<br><br>
    * In a systematic fashion using a checklist<br><br>
source: https://machinelearningmastery.com/machine-learning-checklist/<br><br>
![ML Process to solve the problem](./Images/Machine-Learning-for-Programmers-Select-Tools-e1439699936331.png)
    * Define the Problem
    * Prepare the Data: Data Cleaning and Imputing missing values
    * Spot Check Algorithms: Baseline with simple model(s) to gain inital insight (i.e., feature importances) and benchmark classification accuracy (i.e., LogisticRegressionClassifier and RandomForestClassifier)
    * Improve the Results: Train and evaluate more complex models (like Deep Neural Networks) to improve classification accuracy
    * Present the Results


#### Assumptions:
    * A passenger's location on the ship when the collision occured matters to the model
        Related factors include:
            - Where is the passenger's cabin on the ship? Side, Deck, Cabin_Number
            - Was the passenger in their cabin at the time of the collision? Spending records indicate activity
            - What was the passenger's Destination? Passengers about to debark were likely not in their cabin
            - What was the passenger's HomePlanet? Passengers who recently embarked were likely in their cabin

### Prepare the Data:
#### Import Dependencies

In [1]:
# import common dependencies
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# import seaborn and set_theme
import seaborn as sns
sns.set_theme(style="whitegrid")

# import regular expressions
import re

In [2]:
# sklearn dependencies
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import KNNImputer

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import ConfusionMatrixDisplay


In [3]:
# import dependencies from tensorflow
import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint

#### Extract the Data

In [4]:
# import training and testing csv files from Data folder
input_file_path = "./Data/train.csv"
training_df = pd.read_csv(input_file_path)
training_df.shape

(8693, 14)

#### Set PassengerId as the index
The PassengerId is a unique value and is required as part of the contest submission. The PassengerId will be set as the index using set_index method so that any data cleaning retains the PassengerId value

In [5]:
training_df.set_index('PassengerId' , inplace=True, drop=False)
training_df

Unnamed: 0_level_0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0001_01,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
0002_01,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
0003_01,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
0003_02,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
0004_01,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9276_01,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
9278_01,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
9279_01,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
9280_01,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


In [16]:
# drop rows where all values are na
training_df = training_df.copy()
training_df.dropna(inplace=True)
training_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6606 entries, 0001_01 to 9280_02
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   6606 non-null   object 
 1   HomePlanet    6606 non-null   object 
 2   CryoSleep     6606 non-null   object 
 3   Cabin         6606 non-null   object 
 4   Destination   6606 non-null   object 
 5   Age           6606 non-null   float64
 6   VIP           6606 non-null   object 
 7   RoomService   6606 non-null   float64
 8   FoodCourt     6606 non-null   float64
 9   ShoppingMall  6606 non-null   float64
 10  Spa           6606 non-null   float64
 11  VRDeck        6606 non-null   float64
 12  Name          6606 non-null   object 
 13  Transported   6606 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 729.0+ KB


#### Explore the Data
##### What are the value counts for the Target (Transported)?
Is this a balanced or imbalanced problem? Balanced

In [None]:
# Examine the target variable "Transported"
Target = training_df["Transported"].value_counts()
Target

In [None]:
# visualize the value count of the Target
Target.plot(kind='bar')

##### What values are missing?

In [None]:
# visualize the missing values using seaborn heatmap
sns.heatmap(training_df.isnull(), cbar=False, cmap='viridis')

In [None]:
# get a count of missing values for each col in training_df
# use training_df[column].isnull().sum() within list comprehension
# list comprehension template is: [expression, for col in columns, where conditional is true]
col_with_missing_values = [print(col, training_df[col].isnull().sum()) for col in training_df.columns if training_df[col].isnull().sum() > 0]


##### Strategy for missing values and combined columns:
* HomePlanet (201)
    - Inital Strategy: Impute missing values for HomePlanet with 'most_frequent'  - Earth
    - Follow-on Strategy: Use Supervised Classification to predict missing HomePlanet<br><br>
    
* CryoSleep (217)
    - Inital Strategy: Impute CryoSleep with 'most_frequent' - False
    - Follow-on Strategy: Use KKN to Classify missing CryoSleep<br><br>
    
* Cabin (199)
    - Inital Strategy: Drop missing values then decompose Cabin into deck, num, side
    - Follow-on Strategy: Use KNN to Classify missing Cabin<br><br>

* Destination (182)
    - Inital Strategy: Impute Destination with 'most_frequent' - TRAPPIST-1e
    - Follow-on Strategy: Use Supervised Classification to predict missing Destination<br><br>
    
* Age (179)
    - Inital Strategy: Imput Age with 'most_frequent' (median age) - 27
    - Follow-on Strategy: Use Supervised Regression to predict missing Age<br><br>

* VIP (203)
    - Inital Strategy: Impute VIP status with 'most_frequent' - False
    - Follow-on Stragegy: Use KKN to Classify missing VIP status<br><br>

* RoomService (181), FoodCourt (183), ShoppingMall (208), Spa (183), VRDeck (188)
    - Inital Strategy: Impute missing values for RoomService, FoodCourt, ShoppingMall, Spa, VRDeck with 0<br><br>

* Name (200) 
    - Inital Strategy: Drop missing values then just keep last name
    - Follow-on Strategy: Use unsupervised learning to bin names and then classify the 'class' using K-Nearest Neighbor<br><br>

* PassengerId (0)
    - Inital Strategy: use string.split to split into GroupId and GroupCount


#### The imputing strategy plans to use others in the same group if application
Create column Group_Size

In [None]:
# get column names 
training_df.columns

##### PassengerId: Breakout into GroupId and GroupCount and drop PassengerId

In [None]:
# Extract GroupId and GroupCount from PassengerId using str.split()
# PassengerId - A unique Id for each passenger. 
# Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. 
# People in a group are often family members, but not always.
training_df = training_df.copy()
training_df[["GroupId", "Count_in_Group"]] = training_df["PassengerId"].str.split("_", expand=True)

# drop Name and First_Name
training_df.drop(columns=['PassengerId'], inplace=True)
training_df.columns

In [None]:
# reorder columns to put Deck, Cabin_Number, Side together and drop Cabin
# extract_training_df.columns
columns = ['GroupId', 'Count_in_Group', 'Name', 'HomePlanet',  'Destination', 'Cabin',  
      'Age', 'CryoSleep','VIP', 'RoomService', 'FoodCourt', 'ShoppingMall',
       'Spa', 'VRDeck', 'Transported' ]
training_df = training_df[columns]
training_df

#### Convert Count_in_Group to Group_Size

In [None]:
# check dtype of GroupCount
training_df['Count_in_Group'].dtypes

# recast object to int using 
training_df['Count_in_Group'] = training_df['Count_in_Group'].astype(int)

# check dtype
training_df['Count_in_Group'].dtypes

In [None]:
# convert GroupId to numeric (drops leading zeros)
training_df['GroupId'] = pd.to_numeric(training_df['GroupId'])
# training_df['GroupId']

In [None]:
# Get list of GroupIds to iterate over
Group_Ids =(list(training_df['GroupId']))
# Group_Ids

In [None]:
# Group_Size will be the max Count_in_Group for each unique GroupId
training_df['Group_Size'] = [max(training_df.loc[training_df['GroupId'] == Group_Id]['Count_in_Group']) for Group_Id in Group_Ids]
# training_df

In [None]:
training_df.columns

In [None]:
# reorder and drop GroupCount
# training_df.columns
columns = ['GroupId', 'Group_Size', 'Name', 'HomePlanet', 'Destination', 'Cabin',
       'Age', 'CryoSleep', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall',
       'Spa', 'VRDeck', 'Transported']
training_df = training_df[columns]
training_df

In [None]:
# view distribution of Group_Size
training_df['Group_Size'].value_counts().plot(kind='bar')

##### Impute HomePlanet with 'most_frequent'

In [None]:
# View passengers missing values for HomePlanet
training_df.loc[training_df['HomePlanet'].isnull()].where(training_df['Group_Size'].isin([2,3,4,5,6,7,8]))

In [None]:
# what is the most frequent value for HomePlanet
training_df['HomePlanet'].value_counts()

In [None]:
# Inital stratgy
# fillna with Earth
training_df['HomePlanet'].fillna('Earth', inplace=True)


##### Impute CryoSleep with 'most_frequent'

In [None]:
# find most common value for CryoSleep
CryoSleep = training_df['CryoSleep'].value_counts()
CryoSleep

In [None]:
# Impute Missing Values for CryoSleep to most common value (False)
training_df['CryoSleep'].fillna(False, inplace=True)

##### Breakout Cabin into deck, cabin number and side

In [None]:
# # Extract deck cabin number and side from Cabin using str.split()
# Cabin - The cabin number where the passenger is staying. 
# Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

# drop missing values (initally)
# make a copy 
extract_training_df = training_df.copy()
extract_training_df.dropna(subset=['Cabin'], inplace=True)

# split out Deck, Cabin_Number, Side
extract_training_df[["Deck", "Cabin_Number", "Side"]] = extract_training_df["Cabin"].str.split("/", expand=True)
extract_training_df

In [None]:
# view columns
extract_training_df.columns

In [None]:
# reorder columns to put Deck, Cabin_Number, Side together and drop Cabin
# extract_training_df.columns
columns = ['PassengerId', 'HomePlanet', 'CryoSleep', 'Deck', 'Cabin_Number', 'Side', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name', 'Transported', ]
extract_training_df = extract_training_df[columns]
extract_training_df

In [None]:
# Side counts
Side = extract_training_df['Side'].value_counts()
Side

In [None]:
# deck counts
Deck = extract_training_df['Deck'].value_counts()
Deck

# in the original vessal 
# Deck A is a promenade deck
# Cabins are on decks B to G
# Deck T is for crew

In [None]:
# how many unique cabins were assigned?
Cabins_Assigned = len(pd.unique(extract_training_df['Cabin_Number']))
Cabins_Assigned

In [None]:
Cabin_Numbers = extract_training_df['Cabin_Number'].value_counts()
Cabin_Numbers

##### Destination: Impute with 'most_frequent' TRAPPIST-1e

In [None]:
# find most common value for Destination
Destination = extract_training_df['Destination'].value_counts()
Destination

In [None]:
# Impute Missing Values for Destination to TRAPPIST-1e

# create a copy
extract_training_copy_df = extract_training_df.copy()
extract_training_copy_df['Destination'].fillna('TRAPPIST-1e', inplace=True)

# check results
Destination = extract_training_copy_df['Destination'].value_counts()
Destination

##### Age: Impute with 'median age' (27)

In [None]:
# simplify name of dataframe 
training_df = extract_training_copy_df

In [None]:
# find median value for Age
Age = training_df['Age']
Age.describe()

In [None]:
# Impute Missing Values for Age with median value (27)
training_df['Age'].fillna(27, inplace=True)

In [None]:
# visualize the distribution of age in a box plot
ax = sns.boxplot(x=Age)

##### VIP Status: Impute with 'most_frequent' False

In [None]:
# Use fillna method and a dictionary of fill values to impute VIP, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck with a 0
values = {"VIP": False}
training_df.fillna(value=values, inplace=True)


##### RoomService, FoodCourt, ShoppingMall, Spa, VRDeck: Impute with 0

In [None]:
# view the summary statistics for RoomService, FoodCourt, ShoppingMall, Spa, VRDeck
training_df.describe()[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']]

In [None]:
# Use fillna method and a dictionary of fill values to impute VIP, RoomService, FoodCourt, ShoppingMall, Spa, VRDeck with a 0
values = {"RoomService": 0, "FoodCourt": 0, "ShoppingMall": 0, "Spa": 0, "VRDeck": 0}
training_df.fillna(value=values, inplace=True)


##### Name: Keep just Last_Name

In [None]:
# drop missing values for Name (initally)
# make a copy 
training_df = training_df.copy()
training_df.dropna(subset=['Name'], inplace=True)

In [None]:
# keep just last name
training_df[["First_Name", "Last_Name"]] = training_df["Name"].str.split(" ", expand=True)
training_df.columns

In [None]:
# drop Name and First_Name
training_df.drop(columns=['Name', 'First_Name'], inplace=True)

In [None]:
training_df.columns

In [None]:
# get a count of missing values for each col in training_df
# use training_df[column].isnull().sum() within list comprehension
# list comprehension template is: [expression, for col in columns, where conditional is true]
col_with_missing_values = [print(col, training_df[col].isnull().sum()) for col in training_df.columns if training_df[col].isnull().sum() > 0]
col_with_missing_values

In [None]:
# visualize the missing values using seaborn heatmap
sns.heatmap(training_df.isnull(), cbar=False, cmap='viridis')

In [None]:
# examine dtypes
training_df.info()

#### Convert Columns from Booleans into Integers with astypes()

In [None]:
# what are our boolean value columns
columns_boolean = training_df.select_dtypes(include='bool').columns
columns_boolean

In [None]:
# convert boolean to numeric
for col in ['CryoSleep', 'VIP']:
    training_df[col] = training_df[col].astype(int)

In [None]:
# check again for what are our boolean value columns
columns_boolean = training_df.select_dtypes(include='bool').columns
columns_boolean

### Split data into feature matrix (X) and target (y)

In [None]:
# Create feature matrix
X = training_df.drop(columns='Transported')
X.shape

In [None]:
# Create the target
y = training_df['Transported']
y.shape

In [None]:
# confirm all columns are object or numeric (for use in pd.get_dummies)
X.info()

In [None]:
# convert Cabin_Number to an int
X['Cabin_Number'] = X['Cabin_Number'].astype(int)

In [None]:
# check again for dtype for Cabin_Number
X['Cabin_Number'].dtypes

In [None]:
# convert GroupId to int
X['GroupId'] = X['GroupId'].astype(int)

In [None]:
# check again for dtype for GroupId
X['GroupId'].dtypes

In [None]:
# convert GroupId to int
X['GroupCount'] = X['GroupCount'].astype(int)

In [None]:
# check again for dtype for GroupId
X['GroupCount'].dtypes

### Save Clean_training_df as csv

In [None]:
# save clean_training_df as csv file
clean_training_df.to_csv('./Data/clean_training.csv', index=False)

### Preprocessing

In [None]:
# examine the shape of clean_training_df
clean_training_df.shape

In [None]:
# note the balance/imbalance of the target
y_value_counts = y.value_counts()
y_value_counts

# this is a balanced clasification problem

In [None]:
# what are our categorical value columns
columns_obj = X.select_dtypes(include='object').columns
print(f"There are {len(columns_obj)} columns with a dtype of 'object'")

In [None]:
# view the 6 columns:
columns_obj

In [None]:
# examine all the columns and each columns value_counts
for column in X[columns_obj]:
    print(column, "\n", X[column].value_counts())
    print("-----------------------------------")

In [None]:
# these columns will be converted to numeric values using Pandas get_dummies
X_encoded = pd.get_dummies(X)
X_encoded.shape

In [None]:
# confirm all columns are now numeric
X_encoded.info()

### StandarScaler: Transform the feature matrix using standard scaler 

In [None]:
# scale X_encoded using StandardScaler
data_scaler = StandardScaler()

In [None]:
# fit and transform our X_encoded
X_encoded_std_scaled = data_scaler.fit_transform(X_encoded)

# examine the first row
X_encoded_std_scaled[:1]

### Establish a Baseline for model performance using LogisticRegress as a classifier

In [None]:
# from sklearn.model_selection import train_test_split 
# note: y is not scaled but the extension _ss is added to help me remember to use the scaled feature matrix values
X_train, X_test, y_train, y_test = train_test_split(X_encoded_std_scaled, y, random_state = 1)

### Establish a Baseline for model performance using LogisticRegress as a classifier

In [None]:
# Instantiate a Logistic Regression Model
# increased max_iter=1000 due to inability of 'lbfgs' solver to converge
LR_clf_baseline = LogisticRegression(solver='lbfgs', random_state=1, max_iter=100)
LR_clf_baseline

In [None]:
# fit our model with our data (training)
LR_clf_baseline.fit(X_train, y_train)

In [None]:
# create predictions using predict() method
y_pred_baseline = LR_clf_baseline.predict(X_test)
y_pred_baseline

In [None]:
# view the data
pd.DataFrame({"Prediction": y_pred_baseline, "Actual": y_test})

In [None]:
# calculate model accuracy
balanced_accuracy_score(y_test, y_pred_baseline)

In [None]:
# create a confusion matrix
cm_baseline = confusion_matrix(y_test, y_pred_baseline)
print(cm_baseline)

In [None]:
# confusion matrix using ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm_baseline,display_labels= LR_clf_baseline.classes_)
disp.plot()

# save the image
plt.savefig("./Images/Baseline_confusion_matrix.png")
plt.show()

In [None]:
# view the classification report
# report = classification_report(y_test, y_pred_baseline, output_dict=True)

baseline_report = classification_report(y_test, y_pred_baseline)
print("Baseline")
print(baseline_report)

### Balanced Random Forest Classifier


In [None]:
# model fit predict using the BalancedRandomForestClassifier

# model
brf_model = BalancedRandomForestClassifier(n_estimators= 100,random_state=1)

# fit
brf_model = brf_model.fit(X_train, y_train)

# predict
predictions = brf_model.predict(X_test)

In [None]:
# Calculated the balanced accuracy score
print(f"The balanced accuracy score is: {balanced_accuracy_score(y_test, predictions):.3f}")

In [None]:
# Display the confusion matrix
cm_brf = confusion_matrix(y_test, predictions)
cm_brf

In [None]:
# confusion matrix using ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm_brf,display_labels= brf_model.classes_)
disp.plot()

# save the image
plt.savefig("./Images/BalancedRandomForestClassifier_confusion_matrix.png")
plt.show()

In [None]:
# Print the classification report
BalancedRandomForestClassifier_report = classification_report(y_test, predictions)
print("BalancedRandomForestClassifier")
print(BalancedRandomForestClassifier_report)

In [None]:
# List the features sorted in descending order by feature importance
importance_features = brf_model.feature_importances_

# put this together in a dataframe
# get the column names
cols = X_encoded.columns

# create the dataframe
importance_features_df = pd.DataFrame({'feature': cols, 'importance': importance_features})
importance_features_df

In [None]:
# sort in desc order
print("Feature Importance BalancedRandomForestClassifier")
top_ten_importance_features_df = importance_features_df.sort_values('importance', ascending=False)
top_ten_importance_features_df.reset_index(drop=True, inplace=True)
top_ten_importance_features_df

### Easy Ensemble AdaBoost Classifier

In [None]:
# Train the EasyEnsembleClassifier

# model
ee_clf = EasyEnsembleClassifier(n_estimators=100, random_state=1)

# fit/train
ee_clf.fit(X_train, y_train)

# predict
y_pred = ee_clf.predict(X_test)

In [None]:
# Calculated the balanced accuracy score
balanced_accuracy_score(y_test, y_pred)

In [None]:
# Display the confusion matrix
cm_AdaBoost = confusion_matrix(y_test, y_pred)
cm_AdaBoost

In [None]:
# confusion matrix using ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm_AdaBoost,display_labels= ee_clf.classes_)
disp.plot()

# save the image
plt.savefig("./Images/EasyEnsembleClassifier_confusion_matrix.png")
plt.show()

In [None]:
# Print the classification report
EasyEnsembleClassifier_classification_report = classification_report(y_test, y_pred)
print("EasyEnsembleClassifier")
print(EasyEnsembleClassifier_classification_report)

#### What is the shape of the training_df?

In [None]:
# what is the shape of training_df
training_df.shape

#### What are the dtypes?

In [None]:
# examine missing values and dtypes using info()
training_df.info()

# note there are missing values for almost every column except the target "Transported" and PassengerId

## Neural Network

### Define the model.
### Add first and second hidden layers.
### Add the output layer.

In [None]:
# how many inputs in the input layer
inputs = len(X_train[0])

### Compile the model

In [None]:
# Define the model - deep neural net, i.e., the number of input features and hidden nodes for each layer.
number_input_features = inputs
hidden_nodes_layer1 = inputs * 3
hidden_nodes_layer2 = inputs * 2
hidden_nodes_layer3 = inputs * 1

nn = tf.keras.models.Sequential()

# First hidden layer
nn.add(tf.keras.layers.Dense(units=hidden_nodes_layer1, input_dim=number_input_features, activation="relu"))

# Second hidden layer
nn.add(tf.keras.layers.Dense(units=hidden_nodes_layer2, activation="sigmoid"))

# Third hidden layer
nn.add(tf.keras.layers.Dense(units=hidden_nodes_layer3, activation="sigmoid"))

# Output layer
nn.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the model
nn.summary()

### Implement Checkpoints
Note: Create a callback that saves the model's weights every 5 epochs.

In [None]:
# Import checkpoint dependencies
# import os
# from tensorflow.keras.callbacks import ModelCheckpoint

# Define the checkpoint path and filenames
os.makedirs("Checkpoints/",exist_ok=True)
checkpoint_path = "checkpoints/weights.{epoch:02d}.hdf5"

### Compile the Model

In [None]:
# Compile the model
nn.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
# Create a callback that saves the model's weights every epoch (set  5)
cp_callback = ModelCheckpoint(
    filepath=checkpoint_path,
    verbose=1,
    save_weights_only=True,
    save_freq= 5)

In [None]:
# Train the model
fit_model = nn.fit(X_train, y_train, epochs=50, callbacks=[cp_callback])

In [None]:
# Create a DataFrame containing training history
history_df = pd.DataFrame(fit_model.history, index=range(1,len(fit_model.history["loss"])+1))

# Plot the loss
history_df.plot(y="loss")

In [None]:
# Plot the accuracy
history_df.plot(y="accuracy")

In [None]:
# Evaluate the model using the test data
model_loss, model_accuracy = nn.evaluate(X_test, y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

In [None]:
# Export our model to HDF5 file
nn.save("./Trained_Models/DeepNeuralNetwork.h5")