**Naive Bayes**

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE

# Load the dataset
file_path = '/Users/jean-paulhendriksen/Documents/Data Driven Decision Making in Business/DataMining/BankChurners.csv'
data = pd.read_csv(file_path)

# Drop irrelevant columns
data = data.drop(columns=['CLIENTNUM', 
                          'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
                          'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'])

# Define features (X) and target (y)
X = data.drop(['Attrition_Flag'], axis=1)
y = data['Attrition_Flag']

# Encode categorical features
label_encoders = {}
for column in X.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    X[column] = le.fit_transform(X[column])
    label_encoders[column] = le

# Encode the target variable if it is categorical
y = LabelEncoder().fit_transform(y)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SMOTE to balance classes
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

# Create and train the Naive Bayes model
nb_model = GaussianNB()
nb_model.fit(X_train_balanced, y_train_balanced)

# Make predictions on the test set
y_pred = nb_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)


Accuracy: 0.7655478775913129

Classification Report:
               precision    recall  f1-score   support

           0       0.38      0.72      0.50       327
           1       0.93      0.77      0.85      1699

    accuracy                           0.77      2026
   macro avg       0.66      0.75      0.67      2026
weighted avg       0.85      0.77      0.79      2026


Confusion Matrix:
 [[ 235   92]
 [ 383 1316]]




### 1. **Accuracy**:
   - **0.77** (or 76.55%) shows the overall correctness of the model’s predictions. This means the model correctly predicted approximately 76.55% of the test cases.

### 2. **Classification Report**:
   The classification report provides a more detailed look at the model's performance by evaluating precision, recall, and F1-score for each class.

   - **Class 0**:
     - **Precision**: 0.38 indicates that out of all instances the model predicted as class 0, only 38% were actually class 0.
     - **Recall**: 0.72 indicates that out of all the actual class 0 instances, 72% were correctly identified as class 0.
     - **F1-score**: 0.50 is the harmonic mean of precision and recall, showing a balance between them. A lower F1-score here suggests the model struggles with this class.

   - **Class 1**:
     - **Precision**: 0.93 shows that when the model predicts class 1, it is correct 93% of the time.
     - **Recall**: 0.77 indicates that 77% of actual class 1 instances were identified correctly.
     - **F1-score**: 0.85 is relatively high, showing the model performs better at identifying class 1 accurately.

   - **Macro Average**:
     - These metrics are calculated by taking the average of precision, recall, and F1-score across both classes. They indicate an average performance for the model without taking the class imbalance into account.
     - The **macro average F1-score of 0.67** is lower, suggesting that one class (class 0) is more challenging to predict correctly.

   - **Weighted Average**:
     - The weighted metrics adjust for class imbalance, taking into account the higher number of class 1 samples.
     - The **weighted F1-score of 0.79** reflects the model's better performance on class 1 but somewhat weaker on class 0.

### 3. **Confusion Matrix**:
   The confusion matrix provides a breakdown of the actual vs. predicted values, helping us see where the model is making mistakes.

   - **True Negatives (235)**: Instances where the model correctly predicted class 0.
   - **False Positives (92)**: Instances where the model predicted class 1, but the actual class was 0.
   - **False Negatives (383)**: Instances where the model predicted class 0, but the actual class was 1.
   - **True Positives (1316)**: Instances where the model correctly predicted class 1.

   The confusion matrix shows that the model has a tendency to misclassify class 1 instances as class 0, given the relatively high number of false negatives (383), impacting recall for class 1.




**Logistic Regression Model**

In [9]:
from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()
X_train_balanced = scaler.fit_transform(X_train_balanced)
X_test = scaler.transform(X_test)

# Increase max_iter and fit the Logistic Regression model
log_reg = LogisticRegression(max_iter=2000, random_state=42)
log_reg.fit(X_train_balanced, y_train_balanced)

# Make predictions on the test set
y_pred = log_reg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)


Accuracy: 0.8553800592300099

Classification Report:
               precision    recall  f1-score   support

           0       0.54      0.75      0.63       327
           1       0.95      0.88      0.91      1699

    accuracy                           0.86      2026
   macro avg       0.74      0.81      0.77      2026
weighted avg       0.88      0.86      0.86      2026


Confusion Matrix:
 [[ 245   82]
 [ 211 1488]]



### 1. **Accuracy**:
   - **0.86** (or 85.54%) shows that the logistic regression model correctly classified about 86% of instances. This is an improvement over the Naive Bayes model, indicating better overall performance.

### 2. **Classification Report**:
   The classification report evaluates the model’s performance across key metrics: precision, recall, and F1-score for each class.

   - **Class 0**:
     - **Precision**: 0.54 indicates that out of all instances predicted as class 0, only 54% were actually class 0.
     - **Recall**: 0.75 shows that out of all actual class 0 instances, the model correctly identified 75%.
     - **F1-score**: 0.63, a balance between precision and recall, suggests that while the model does fairly well at identifying class 0, it could improve its precision.

   - **Class 1**:
     - **Precision**: 0.95 demonstrates high accuracy when the model predicts class 1, meaning 95% of predictions are correct for this class.
     - **Recall**: 0.88 indicates that 88% of actual class 1 instances were correctly classified, a high recall that shows strong performance in identifying the majority class.
     - **F1-score**: 0.91 is high, confirming that the model performs well on class 1 with balanced precision and recall.

   - **Macro Average**:
     - The macro average gives an unweighted average of precision, recall, and F1-score across both classes.
     - The **macro F1-score of 0.77** reflects moderate performance on both classes, though it remains lower than the weighted average due to the model's struggles with class 0.

   - **Weighted Average**:
     - The weighted average takes class imbalance into account, showing the overall performance.
     - The **weighted F1-score of 0.86** indicates strong performance overall, skewed toward class 1 (the majority class).

### 3. **Confusion Matrix**:
   The confusion matrix provides a granular view of the model’s predictions versus actual labels.

   - **True Negatives (245)**: Instances where the model correctly predicted class 0.
   - **False Positives (82)**: Instances where the model predicted class 1, but the actual class was 0.
   - **False Negatives (211)**: Instances where the model predicted class 0, but the actual class was 1.
   - **True Positives (1488)**: Instances where the model correctly predicted class 1.

   The confusion matrix shows fewer false negatives than the Naive Bayes model, indicating that logistic regression is better at correctly identifying class 1 instances. However, there are still some misclassifications for class 0, suggesting room for further improvement in distinguishing between the two classes.

