Skip to content

The top 5% of the titanic competition in Kaggle. achieved this through ensemble of models

Notifications You must be signed in to change notification settings

Moddy2024/Titanic-Survival-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Titanic top 5%

This is the most famous Kaggle competition in which you have to use machine learning to create a model that predicts which passenger can survive the Titanic shipwreck or not. I have done a lot of data engineering and feature engineering to clean and to increase the accuracy of my models.

I have trained 3 models :

  • Logistic Regression.
  • XGBClassifier.
  • Random Forest.

I have also ensemble all 3 of these models together to see the results and have only used ensemble of Logistic Regression and XGBClassifier. The accuracy in all of my submission file is a minimum of 75% in Kaggle but the best one was ensemble of Logistic Regression and XGBClassifier which is the file name hardvoting_withoutrf which got me in the top 5% of all the people in the Titanic Competetion in Kaggle. RandomForest seems to be overfitting because we don't have a very big dataset. When comparing each of the models separately Logistic Regression works better than XGBoost and Random Forest so after ensembling the best two the ensembled model works even well.

Dataset

You can download the dataset from Kaggle or get it in this repo which I have already downloaded from Kaggle.

For the training file go here.

For the test file go here.

Software requirements

  • Numpy
  • Pandas
  • Seaborn
  • Matplotlib
  • Scikitlearn
  • XGBoost

Key Files

  • titanic-1.ipynb - In this file you can see all the data engineering and the feature engineering that I have performed. After which I train the model, ensemble them and check their cross validation score.
  • results - All the results of the different models and ensemble are present here in csv format.
  • files from kaggle - The files that are provided by Kaggle. There are three files here training,test and gender_submission.csv. We can only use the training file for training the model for the competition and predict the results using the data in the test file for submission. The gender_submission.csv is as an example of what a submission file should look like.