The sinking of the Titanic is one of the most infamous shipwrecks in history. The aim of the project is to predict whether a person can survive or not.
The dataset is downloaded from Kaggle. The downloaded data consist of 2 files (train and test sets). Train dataset consists of many features such as (PassengerId, Age, Sex elc) and one label, which represents if a person is survived or not. Test dataset consists of only the features and it is not labeled. In this project, only the train dataset has been used by splitting the whole rows into train and test dataset, so that the test dataset has labels.
In this Repository, there are one code file 'code.ipynb' has been scripted using Jupyter Notebook.
In this section, different EDA methods has been used to observe visualize the nature of the data such as (distrobution, Box plots, missing values outliers etc).
Now we have a clear image of what is required to preprocess the data. by removing the outliers, adjusting the missing data, normalize the distribution of some features as well as creating new features based on existing features. As the dataset were either numerical or categorial, an ecoding step were required to transform the categorial into numerical data, as most of Machine Learning algorithms use only numerical data such as Logistic Regression, SVM and KNN.
Different models were build to classify the output, which are Logistic Regression, SVM, KNN, Decision Tree, Random Forest and ANN. Next, Prediction method has been applied based on the test set, so that we can compare the accuracy between the different applied algorithms as well as a Cross-Validation technique.
Until this stage, we have measured the accuracy of each algorithm, but we have no idea what features contributed in the decision process of the algorithm to take such a decision. Here, measuring shap value is applied to understand the black box model of the algorithm and what features contributed the most compared to other features.