<a href="https://colab.research.google.com/github/jochen13/Data-Science-As-a-Field/blob/main/Titanic_Spaceship.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Titanic Spaceship Machine Learning Project

The final project I have choose is taken from Kaggle. It is called Spaceship Titanic and the goal of the Machine Leaning task is to predict if a passenger on the spaceship has been transported to an alternate dimension. The dataset given by Kaggle is a csv file with personal records of each passenger. These records can be split up into features to predict if a passenger has been transported and the label which is either true or false. This is a binary classification problem and several Machine Learning algorithms are tried and compared in order to solve this problem. I have choosen to try Adaboost, Random Forest and Support Vector Machine ML Algorithms. The notebook is split into different parts and all tasks are functionalized in order to organize the notebook. Below are the steps:
1. Import all the necessary libraries
2. Load the csv file
3. Clean the data by dropping unecessary features, fill nan with medians, convert string data to numerical and replace True/False with 1 and 0.
4. Display the correlation matrix
5. Normalize the dataset using min/max normalization
6. Split the data into a Train and Test set
7. Run a Gridsearch for 3 models (Adaboost, Random Forest and SVM) to find the best hyperparameters
8. Train 3 models (Adaboost, Random Forest and SVM)
9. Display the training and test scores for each model
10. Conclusion

Link to the project on github: https://github.com/jochen13/Data-Science-As-a-Field/blob/main/Titanic_Spaceship.ipynb

## Import Statements
Here the following libraries are imported:
1. Pandas
2. Numpy
3. Sklearn support for Adaboost, LinearSVC, RandomForestClassifier, GridSearch and train_test_split
4. Matplotlib

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

## Data Loading, normalization and dispaly of a correlation matrix
This sections consists of functions to do the following tasks:
1. Load the data into a dataframe
2. Fill Nan, factorize data and normalize (min/max normalization)
3. Create correlation matrix

In [2]:
def load_data(filename: str = 'train.csv') -> pd.DataFrame:
  return pd.read_csv(filename)

In [3]:
def normalize_dataframe(df: pd.DataFrame) -> pd.DataFrame:
  #train_data.dropna(inplace=True)
  df.drop(columns=['PassengerId', 'Name'], inplace=True)
  df = df * 1

  values, index = pd.factorize(df.HomePlanet)
  df.HomePlanet = values

  values, index = pd.factorize(df.Destination)
  df.Destination = values

  values, index = pd.factorize(df.Cabin)
  df.Cabin = values

  df[df.columns] = df[df.columns].apply(pd.to_numeric, errors='coerce')
  df = df.fillna(df.median())

  return df

In [4]:
def display_correlation_matrix(df: pd.DataFrame) -> None:
  display(df.corr().style.background_gradient(cmap='coolwarm').format(precision=2))

In [5]:
def create_labels_and_features(df: pd.DataFrame):
  y = np.array(df['Transported'])

  df_min_max_scaled = df.drop(columns=['Transported']).copy()
  for column in df_min_max_scaled.columns:
    df_min_max_scaled[column] = (df_min_max_scaled[column] - df_min_max_scaled[column].min()) / (df_min_max_scaled[column].max() - df_min_max_scaled[column].min())
  X = np.array(df_min_max_scaled)

  return X, y

## Split the dataset into a training and test set

In [6]:
def split_into_test_and_train_sets(X: np.array, y: np.array):
  return train_test_split(X, y, test_size=0.2, random_state=42)

## Adaboost Classifier
In this section of the notebook the following is done:

1. Run a GridSearch on 3 hyperparameters for the Adaboost Classifier
2. Use the best parameters to then train a Adaboost Classifier on the Titanic Spaceship dataset


In [7]:
def adaBoostGridSearch(X_train: np.array, y_train: np.array):
  # Define the hyperparameter grid
  param_grid = {
    "learning_rate": [0.25, 0.5,1],
    "n_estimators": [100, 200, 300, 1000],
    "algorithm": ["SAMME", "SAMME.R"],
    'random_state' : [0]
  }

  # Create the random forest classifier
  ada_classifier = AdaBoostClassifier()

  # Initialize GridSearchCV
  grid_search = GridSearchCV(ada_classifier, param_grid=param_grid, scoring='accuracy')

  # Fit the model to your data
  grid_search.fit(X_train, y_train)

  # Get the best hyperparameters
  best_params = grid_search.best_params_
  print(f"Best Hyperparameters: {best_params}")
  return grid_search.best_params_

In [8]:
def adaboostAlgo(best_params_ada, X_train: np.array, y_train: np.array, X_test: np.array, y_test: np.array):
  ada = AdaBoostClassifier(**best_params_ada).fit(X_train, y_train)
  print(f"Adaboost: Training score: {ada.score(X_train, y_train)} | Test score: {ada.score(X_test, y_test)}")
  return ada

## Random Forest Classifier
In this section of the notebook the following is done:
1. Run a GridSearch on 4 hyperparameters for the Random Forest Classifier
2. Use the best parameters to then train a Random Forest Classifier on the Titanic Spaceship dataset

In [9]:
def randomForestGridSearch(X_train: np.array, y_train: np.array):
  # Define the hyperparameter grid
  param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'random_state' : [0]
  }

  # Create the random forest classifier
  rf_classifier = RandomForestClassifier()

  # Initialize GridSearchCV
  grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, scoring='accuracy')

  # Fit the model to your data
  grid_search.fit(X_train, y_train)

  # Get the best hyperparameters
  best_params = grid_search.best_params_
  print(f"Best Hyperparameters: {best_params}")
  return grid_search.best_params_

In [10]:
def randomForest(best_params_rf, X_train: np.array, y_train: np.array, X_test: np.array, y_test: np.array):
  rf = RandomForestClassifier(**best_params_rf).fit(X_train, y_train)
  print(f"Random Forest: Training score: {rf.score(X_train, y_train)} | Test score: {rf.score(X_test, y_test)}")
  return rf

## Support Vector Machine ML Algorithm
In this section of the notebook the following is done:

1. Run a GridSearch on 2 hyperparameters for the SVM Classifier
2. Use the best parameters to then train a SVM Classifier on the Titanic Spaceship dataset


In [11]:
def svmGridSearch(X_train: np.array, y_train: np.array):
  # Define the hyperparameter grid
  param_grid = {
    'max_iter': [500, 1000, 1500],
    'tol': [1e-4, 1e-5, 1e-6],
    'random_state' : [0]
  }

  # Create the random forest classifier
  svm_classifier = LinearSVC()

  # Initialize GridSearchCV
  grid_search = GridSearchCV(estimator=svm_classifier, param_grid=param_grid, cv=5, scoring='accuracy')

  # Fit the model to your data
  grid_search.fit(X_train, y_train)

  # Get the best hyperparameters
  best_params = grid_search.best_params_
  print(f"Best Hyperparameters: {best_params}")
  return grid_search.best_params_

In [14]:
def linear_svm(best_params_svm, X_train: np.array, y_train: np.array, X_test: np.array, y_test: np.array):
  svm = LinearSVC(**best_params_svm).fit(X_train, y_train)
  print(f"SVM: Training score: {svm.score(X_train, y_train)} | Test score: {svm.score(X_test, y_test)}")
  return svm

In [15]:
train_data = load_data('train.csv')
train_data = normalize_dataframe(train_data)
display_correlation_matrix(train_data)

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
HomePlanet,1.0,-0.04,-0.0,-0.21,-0.12,-0.06,0.18,-0.28,0.08,-0.2,-0.23,-0.09
CryoSleep,-0.04,1.0,-0.01,0.1,-0.07,-0.08,-0.24,-0.21,-0.21,-0.2,-0.19,0.46
Cabin,-0.0,-0.01,1.0,-0.0,-0.01,0.01,-0.0,-0.01,0.02,-0.01,0.02,0.02
Destination,-0.21,0.1,-0.0,1.0,0.0,0.05,-0.04,0.11,-0.02,0.05,0.07,0.1
Age,-0.12,-0.07,-0.01,0.0,1.0,0.09,0.07,0.13,0.03,0.12,0.1,-0.07
VIP,-0.06,-0.08,0.01,0.05,0.09,1.0,0.06,0.13,0.02,0.06,0.12,-0.04
RoomService,0.18,-0.24,-0.0,-0.04,0.07,0.06,1.0,-0.02,0.05,0.01,-0.02,-0.24
FoodCourt,-0.28,-0.21,-0.01,0.11,0.13,0.13,-0.02,1.0,-0.01,0.22,0.22,0.05
ShoppingMall,0.08,-0.21,0.02,-0.02,0.03,0.02,0.05,-0.01,1.0,0.01,-0.01,0.01
Spa,-0.2,-0.2,-0.01,0.05,0.12,0.06,0.01,0.22,0.01,1.0,0.15,-0.22


In [16]:
X, y = create_labels_and_features(train_data)
X_train, X_test, y_train, y_test  = split_into_test_and_train_sets(X, y)

In [17]:
best_params_ada = adaBoostGridSearch(X_train, y_train)
aba = adaboostAlgo(best_params_ada, X_train, y_train, X_test, y_test)

Best Hyperparameters: {'algorithm': 'SAMME.R', 'learning_rate': 0.5, 'n_estimators': 300, 'random_state': 0}
Adaboost: Training score: 0.8048605119355766 | Test score: 0.7849338700402531


In [18]:
best_params_rf = randomForestGridSearch(X_train, y_train)
rf = randomForest(best_params_rf, X_train, y_train, X_test, y_test)

Best Hyperparameters: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200, 'random_state': 0}
Random Forest: Training score: 0.8550474547023296 | Test score: 0.7855089131684876


In [19]:
best_params_svm = svmGridSearch(X_train, y_train)
svm = linear_svm(best_params_svm, X_train, y_train, X_test, y_test)

Best Hyperparameters: {'max_iter': 500, 'random_state': 0, 'tol': 0.0001}
SVM: Training score: 0.769485188380788 | Test score: 0.7665324899367453


## Conclusion

All three ML classifiers do fairly well on both the training and the test sets but there is still room for improvement. As can be seen in the correlation matrix some of the features have a much higher correlation to the transported label than others, especially CryoSleep is higly correlated. Also destination is somewhat correlated to the label but most other features have a low or negative correlation. I think when it comes to features there is the first part where an improvement can be achieved by dropping some of the lower correlated features such as "Roomservice" and "Spa".

When we look at the results of the three ML Classifiers we get the following results:

* Adaboost: 80% on the training set and 79% on the test set
* Random Forest Classifier: 86% on the training set and 79% on the test set
* SVM Classifier: 77% on the training set and 77% on the test set.

When looking at the results it looks like the Adaboost ML Classifier performs the best while the Random Forest Classifier seems to overfit a bit on the training data. In order to get better results it would also be interesting to see how a deep learning neural network would perform on the data.

