## Jemia Johnson Naive Bayes Part on Kaggle dataset

link to dataset: https://www.kaggle.com/c/titanic/data

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

In [14]:
import pandas as pd

# Load the training data
train_df = pd.read_csv('data/titanic/train.csv')

# Load the testing data
test_df = pd.read_csv('data/titanic/test.csv')

# Display the first few rows of each dataset to verify they loaded correctly
print("Training Data:")
print(train_df.head())

print("\nTesting Data:")
print(test_df.head())

Training Data:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   Na

In [15]:
# Check for missing values in the training data
print(train_df.isnull().sum())

# Check for missing values in the testing data
print(test_df.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


In [16]:
# Fill missing values in the 'Age' column with the median age
train_df['Age'] = train_df['Age'].fillna(train_df['Age'].median())
test_df['Age'] = test_df['Age'].fillna(test_df['Age'].median())

# Fill missing values in the 'Embarked' column with the most frequent value
train_df['Embarked'] = train_df['Embarked'].fillna(train_df['Embarked'].mode()[0])

# Fill missing values in the 'Fare' column in the test set with the median fare
test_df['Fare'] = test_df['Fare'].fillna(test_df['Fare'].median())

# Fill missing values in the 'Cabin' column with 'Unknown'
train_df['Cabin'] = train_df['Cabin'].fillna('Unknown')
test_df['Cabin'] = test_df['Cabin'].fillna('Unknown')

# Verify there are no more missing values
print(train_df.isnull().sum())
print(test_df.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64
PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64


In [17]:
# Select relevant features and the target variable
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
target = 'Survived'

# Encode categorical variables
le = LabelEncoder()
for feature in ['Sex', 'Embarked']:
    train_df[feature] = le.fit_transform(train_df[feature])
    test_df[feature] = le.transform(test_df[feature])

# Split the training data into training and validation sets
X = train_df[features]
y = train_df[target]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Naive Bayes model
nb = GaussianNB()
nb.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = nb.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred)
recall = recall_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred))
print("Classification Report:\n", classification_report(y_val, y_pred))

# Make predictions on the test set
test_pred = nb.predict(test_df[features])

# Create a DataFrame with the results
submission = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': test_pred
})

# Save the results to a CSV file
submission.to_csv('submission.csv', index=False)

Accuracy: 0.776536312849162
Precision: 0.7125
Recall: 0.7702702702702703
F1 Score: 0.7402597402597403
Confusion Matrix:
 [[82 23]
 [17 57]]
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.78      0.80       105
           1       0.71      0.77      0.74        74

    accuracy                           0.78       179
   macro avg       0.77      0.78      0.77       179
weighted avg       0.78      0.78      0.78       179



### Detailed Report

#### Introduction

This report evaluates the performance of a Gaussian Naive Bayes model trained on the Titanic dataset to predict whether a passenger survived. The model was trained using the training set and validated using a validation set comprising 20% of the training data. Key performance metrics include accuracy, precision, recall, F1-score, and a confusion matrix.

#### Accuracy

The accuracy of the model is 77.65%. This means that approximately 77.65% of the predictions made by the model on the validation set are correct. While accuracy provides a general sense of the model's performance, it does not differentiate between the types of errors made.

#### Precision

Precision for the positive class (survived) is 71.25%. This metric indicates that when the model predicts a passenger survived, it is correct 71.25% of the time. Precision is crucial in scenarios where false positives are costly.

#### Recall

Recall for the positive class (survived) is 77.03%. This metric measures the model's ability to correctly identify passengers who survived. A recall of 77.03% means that the model successfully identified 77.03% of the actual survivors.

#### F1 Score

The F1 score is the harmonic mean of precision and recall. For the positive class, the F1 score is 74.03%. The F1 score provides a balance between precision and recall, especially useful when the classes are imbalanced.

#### Confusion Matrix

The confusion matrix provides a detailed breakdown of the model's predictions:
- True Negatives (TN): 82
- False Positives (FP): 23
- False Negatives (FN): 17
- True Positives (TP): 57

From the confusion matrix, we observe that:
- The model correctly identified 82 passengers who did not survive (true negatives).
- The model incorrectly predicted that 23 passengers survived when they did not (false positives).
- The model missed 17 passengers who survived (false negatives).
- The model correctly identified 57 passengers who survived (true positives).

#### Classification Report

The classification report provides precision, recall, and F1-score for each class (0: Did not survive, 1: Survived):
- Class 0 (Did not survive): Precision = 83%, Recall = 78%, F1-score = 80%
- Class 1 (Survived): Precision = 71%, Recall = 77%, F1-score = 74%

The weighted averages for these metrics account for class imbalances, providing an overall view of model performance.

### Conclusion

The Gaussian Naive Bayes model shows reasonable performance with an overall accuracy of 77.65%. It demonstrates a good balance between precision (71.25%) and recall (77.03%) for predicting passenger survival. The F1 score (74.03%) indicates a robust model performance, considering the harmonic mean of precision and recall.

To further improve the model, additional feature engineering, hyperparameter tuning, or employing more sophisticated models could be explored. Nonetheless, the current model provides a solid foundation for predicting passenger survival on the Titanic dataset.

### Recommendations

1. **Feature Engineering**: Explore additional features or interactions between existing features to provide the model with more relevant information.
2. **Hyperparameter Tuning**: Although Naive Bayes has few hyperparameters, exploring different configurations or variations (e.g., Bernoulli Naive Bayes) may yield better results.
3. **Ensemble Methods**: Combining the predictions of multiple models (e.g., using a voting classifier) could improve overall performance.
4. **Cross-Validation**: Implementing cross-validation during model training can provide more reliable performance estimates and help prevent overfitting.
