- Objective
- Dataset Description
- Data Preprocessing
- Model Building
- Data Visualisations
- Best Model - Random Forest Model (with Hyperparameter Tuning)
- Conclusion
- Final Thoughts
The objective of this project was to build a predictive model using the Titanic dataset to determine whether a passenger on the Titanic survived or not. This dataset is a common starting point for data science and machine learning projects due to its simplicity and the availability of relevant features.
The Titanic dataset contains information about individual passengers, including the following features:
- Pclass: Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Sex: Gender of the passenger
- Age: Age of the passenger
- *SibSp#: Number of siblings or spouses aboard the Titanic
- Parch: Number of parents or children aboard the Titanic
- Ticket: Ticket number
- Fare: Passenger fare
- Cabin: Cabin number
- Embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- Survived: Survival status (0 = No, 1 = Yes) [Target variable]
Before building the models, the following preprocessing steps were undertaken:
-
Handling Missing Values: Missing values in the
Age
,Cabin
, andEmbarked
columns were addressed.Age
was imputed using the median age.Cabin
information was dropped due to a large number of missing values.- Missing values in
Embarked
were filled with the most common port (S
).
-
Feature Encoding: Categorical variables (
Sex
,Embarked
) were converted into numerical values using one-hot encoding. -
Feature Scaling: Continuous variables (
Age
,Fare
) were standardized to have a mean of 0 and a standard deviation of 1.
Two machine learning models were trained and evaluated: Logistic Regression and Random Forest. Additionally, Randomized Search Cross-Validation was used to tune the hyperparameters of the Random Forest model.
- Accuracy: 0.80
- Precision, Recall, and F1-Score:
- Accuracy: 0.82
- Precision, Recall, and F1-Score:
### Best Model - Random Forest Model (with Hyperparameter Tuning)
- Accuracy: 0.82
- Precision, Recall, and F1-Score:
## Model Conclusion
The Random Forest model with hyperparameter tuning performed slightly better than the Logistic Regression model, achieving an accuracy of 82%. The precision, recall, and F1-score indicate that the model is reasonably good at predicting survival on the Titanic, with a higher precision for predicting non-survival (class 0) and a balanced performance for survival (class 1).
The predictive models built using the Titanic dataset offer valuable insights into the factors that influenced survival on the Titanic. Analysis of the dataset revealed the following key points about survival:
-
Gender: Women had a significantly higher survival rate compared to men. This is reflected in the model's feature importance, where gender (Sex) was one of the most influential factors.
-
Passenger Class: Passengers in first class (Pclass = 1) had a higher survival rate compared to those in second and third classes. This indicates that socio-economic status played a crucial role in survival chances.
-
Age: Younger passengers had a better chance of survival compared to older passengers. Children, in particular, had higher survival rates.
-
Family Size: Passengers with fewer family members aboard (SibSp and Parch) tended to survive more often than those with larger families.
-
Embarkation Point: Passengers who embarked from Cherbourg (Embarked = C) had a slightly higher survival rate compared to those who boarded at Queenstown or Southampton.
In conclusion, the model successfully identifies the key determinants of survival, emphasizing the importance of gender, socio-economic status, age, and family size. These findings align with historical accounts and provide a comprehensive understanding of the factors that influenced survival during the Titanic disaster. The models developed in this project serve as an effective tool for predicting survival and demonstrate the potential of machine learning in analyzing historical datasets.