This project is a data science exploration of the Spaceship Titanic dataset, a variant. Make reference to the html file of the famous Titanic competition on Kaggle. The goal of this project is to predict the survival of passengers in a spaceship that is about to collide with a 'spacetime anomaly' using advanced machine learning techniques. The project report can be found in the html file or the Rmd file.
The original features were:
where transported
was the variable to be predicted.
The original dataset included features such as cabin
and passenger_id
that contain hidden information. For example, by splitting the cabin
column every time we encountered a '/', we were able to create 3 new columns. Similarly, by splitting the passenger_id column on the '_' symbol, we were able to create 2 new columns.
This project is a binary classification task, we tried several machine learning models, including
- Logistic regression
- Shrinkage methods (Ridge, Lasso, Elastic net)
- Tree methods (Single tree, Random Forest, Bagging, Boosting)
- Support Vector Machines (Linear and non-linear)
- NN
Our analysis revealed that the best model was Boosting with an impressive Area Under the Curve of 0.8824744.
Spa
, VRDeck
and CryoSleep
turned out to be the most relevant features in terms of Mean Decrease in Gini Index for the tree-based methods. Let's look at the charts we did ex-ante to see if this was visually intuible:
Cryosleep was super clear, however for numerical columns we struggle to see this evidence because there are some extremeli high values that skew the data to the right
This was an exciting and enlightening project that allowed us to dive deep into the dataset and uncover hidden insights. We were able to experiment with a variety of different models and understand their parameters and appropriate use cases.