KNN Classification

This project aims to utilize the K-Nearest Neighbors (KNN) algorithm for data classification, primarily focusing on identifying survivors of the Titanic shipwreck. The approach encompasses an initial phase of exploratory analysis, followed by the application of KNN, the use of validation techniques, and ultimately, hyperparameter optimization. This comprehensive approach will enable a deeper and more effective understanding of the survival prediction task in the context of the Titanic dataset.

1. Introduction

This project aims to conduct an exploratory analysis of the Titanic data set and employ the KNN (K-Nearest Neighbors) model to classify the survivors of the shipwreck. The widely recognized Titanic dataset is available at Kaggle, Stanford and also at Department of Biostatistics. Despite being a simple dataset, designed for studies and focused on solving a single question, it is possible to apply advanced data analysis techniques, extract insights and knowledge in a similar way to what would be done in more complex and challenging datasets.

2. Data Understanding

The Titanic dataset, available on Kaggle, is divided into train and test files. The train file comprises 891 labels and 12 features, while the test file contains 418 labels and 11 features. The data is categorized into numeric and categorical types, and there are no instances of duplicate or missing values.

The Train naming dictionary is as follows:

Attribute	Definition	Data Type
PassengerId	Passenger ID	int64
Survived	Survival status (0 = No, 1 = Yes)	int64
Pclass	Ticket class	int64
Name	Passenger's name	object
Sex	Gender	object
Age	Age of the passenger	float64
SibSp	Number of siblings/spouses aboard	int64
Parch	Number of parents/children aboard	int64
Ticket	Ticket number	object
Fare	Passenger fare	float64
Cabin	Cabin number	object
Embarked	Port of embarkation (C, Q, S)	object

The Test naming dictionary is as follows:

Attribute	Definition	Data Type
PassengerId	Passenger ID	int64
Pclass	Ticket class	int64
Name	Passenger's name	object
Sex	Gender	object
Age	Age of the passenger	float64
SibSp	Number of siblings/spouses aboard	int64
Parch	Number of parents/children aboard	int64
Ticket	Ticket number	object
Fare	Passenger fare	float64
Cabin	Cabin number	object
Embarked	Port of embarkation (C, Q, S)	object

3. Action Plan

3.1. Objective

After completing the project of exploratory analysis and implementation of KNN in the Wine database, I decided to further deepen the use of this model by incorporating more sophisticated techniques. This repository aims to explore the capabilities of the KNN algorithm, focusing on optimization and enhancement to solve the Titanic problem.

3.2. Tools and Frameworks

Scope of tools used in the project:

Python 3.11.5
Jupyter Notebook
Git & GitHub
Kaggle
Pandas
Numpy
Scipy
Scikit-Learn
Machine Learning Classification Model
- K-Nearest Neighbors
- Gradient Boosting Classifier

4. Exploratory Data Analysis

4.1. General Train Dataset

Features: 12 (including the target)
Labels: 891
Numeric data types: float64, int64
Categorical data types: object
Missing values total: 866
No duplicate values
Target contains 2 options: 0, 1 (int64)

4.2. Missing values

Age: has 19.87% missing data that will be redefined with the median age
Cabin: has 77.11% missing data, which makes it a great candidate to be removed. But this data will be grouped and transformed into numbers.
Embarked: has only 0.23% missing data, which will be removed.

4.3. Correlation

The highest correlation in the dataset is between the features Parch and Sibsp. Both features are about family relationships, which at first glance justifies this relationship.
The lowest correlation in the dataset is between the features Pclass and Fare. At first glance, it's strange to understand how passenger fare data has such a low correlation with socio-economic status.

4.4. Outliers

There are outliers in 4 features of the dataset. I usually remove them, especially when the models used are sensitive to them. The percentage of outliers per feature is:
- Parch = 23.91%
- Fare = 13.02%
- SibSp = 5.16%
- Age = 1.23%

4.5. Create Features

Sex: Will be given numerical values for their respective categories
Cabin: Will be given numerical values for their respective categories and be grouped by the first letter
Cabin: Will receive numerical values for their respective categories and will be grouped by the first letter. Missing values will be set to "N"
Name: I believe the only use here would be to take the titles and insert numerical values into them
Boarders: They will receive numerical values for their respective categories

4.6. Drop Features

Ticket: There is a high degree of variation and dispersion in this data. I don't see much use in them, maybe group them together in some way and check for some relationship with the tariff

4.7. Normalization and Standardization

The decisive factor in rescaling the data is the exclusive use of the KNN model. It is extremely sensitive to discrepant scales, outliers and prefers normalized data.
There is a need to rescale the data because the range of the data goes from 0 to 512.33
Most numerical data does not follow a normal distribution. This corroborates the fact that data must be standardized in order to bring it as close as possible to its original state.
But focusing on the best use of the KNN model, I will use normalization.

5. Machine Learning Exploration

5.1. With Outliers Dataset

The model performances on the test dataset seem to decrease slightly as the number of neighbors increases.
The F1 Score, which balances precision and recall, ranges from 0.774292 to 0.795999 in train dataset and ranges from 0.825528 to 0.866613 in test dataset.
On the training dataset, the scores are higher, indicating potential overfitting, especially with a smaller number of neighbors.

5.2. Without Outliers Dataset

The model performances on the test dataset are generally better compared to the dataset with outliers.
The F1 Score, which balances precision and recall, ranges from 0.827385 to 0.828747 in train dataset and ranges from 0.816357 to 0.841127 in test dataset.
On the training dataset, the scores are still relatively high, suggesting good performance, and they are closer to the test scores compared to the dataset with outliers.

5.3. Neighbor Selection

It seems that using 5 neighbors gives good results for both datasets (e.g., 0.789238 to 0.826667 for the test dataset).
Selecting an appropriate number of neighbors is crucial; too few neighbors might lead to overfitting, while too many might result in underfitting.

5.4. Effect of Outliers

It seems that using 5 neighbors of without outliers gives good results for both datasets (e.g., 0.833809 to 0.838452 for the test dataset).
Selecting an appropriate number of neighbors is crucial; too few neighbors might lead to overfitting, while too many might result in underfitting.

6. Submissions Results

After optimizing the model with the GridSearchCV algorithm, the best KNN submission results were:

Model	Precision	Recall	F1-Score	Accuracy	Mean - CV	Std - CV	Range - CV	Kaggle Result
KNN - Without Outliers	78.34	69.10	73.43	85.12	81.28	6.50	[74.78, 87.77]	0.76555
KNN - With Outliers	83.54	78.65	81.02	85.86	81.71	5.04	[76.67, 86.75]	0.74880

7. Model Comparison

In order to understand the model's performance and how it behaved, I compared knn's results with those of the Gradient Boosting Classifier model.

Model	Precision	Recall	F1-Score	Accuracy	Mean - CV	Std - CV	Range - CV	Kaggle Result
KNN - Without Outliers	78.34	69.10	73.43	85.12	81.28	6.50	[74.78, 87.77]	0.76555
KNN - With Outliers	83.54	78.65	81.02	85.86	81.71	5.04	[76.67, 86.75]	0.74880
GBC - Without Outliers	88.24	75.84	81.57	89.80	81.26	4.43	[76.83, 85.69]	0.76076
GBC - With Outliers	95.58	88.60	91.96	94.05	83.51	5.24	[78.27, 88.75]	0.72727

8. Conclusion

Although the Gradient Boosting Classifier (GBC) is often considered more robust than K-Nearest Neighbors (KNN), in this specific context, KNN has shown remarkable adaptability to Kaggle's evaluation criteria (0.7655 x 0.76076). This effectiveness can be attributed to carefully adjusted transformations in the original dataset, optimized specifically for the KNN algorithm.

It is worth noting that among the evaluated models, only KNN, without the presence of outliers, produced Kaggle results consistent with the cross-validation range. This emphasizes the importance of data transformations specific to KNN. Interestingly, the inclusion of outliers in the KNN model resulted in Kaggle performance beyond expectations, reinforcing the idea that the presence of outliers can significantly impact KNN performance, possibly explaining the observed discrepancy in this case.

Another point to consider in explaining KNN's superior performance compared to GBC in Kaggle metrics is the possibility of overfitting in GBC. It is observed that in GBC, metrics such as Precision, Recall, F1-Score, and Accuracy were higher than those of KNN, indicating that GBC may not have generalized well to data beyond that used in training.

In conclusion, while GBC is recognized as a powerful model, the highlighted performance of KNN on the Kaggle platform underscores the importance of careful data preprocessing and consideration of specific dataset characteristics to achieve optimized model performance.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
submissions		submissions
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
titanic.ipynb		titanic.ipynb

License

lfaferreira/knn-titanic

Folders and files

Latest commit

History

Repository files navigation