### Model Summary: Customer Churn Prediction Using Decision Tree with Oversampling and PCA

**Objective**: The goal of this project is to predict customer churn using a decision tree model that has been enhanced by oversampling the minority class (churned customers) and applying PCA for dimensionality reduction. The model aims to achieve better accuracy, precision, recall, and other classification metrics by addressing the class imbalance issue.

---

### **1. Data Preprocessing**

**Dataset Overview**:
- The dataset contains various features related to customer interactions, service usage, and customer service calls.
- **Target Variable**: `churn` (0 = no churn, 1 = churn)

**Key Features**:
- Customer account length, area code, international plan, voicemail plan, total day/evening/night calls and minutes, international calls and charges, number of customer service calls, etc.

**Handling Imbalanced Data**:
- **Class Distribution** (Before): The dataset was imbalanced, with significantly more non-churn (0) customers than churn (1) customers.
- **Oversampling**: Applied Synthetic Minority Oversampling (SMOTE) technique to balance the number of churned (1) and non-churned (0) customers. The class distribution after oversampling:
  - **Churn (1)**: 3652 instances
  - **Non-Churn (0)**: 3652 instances

---

### **2. Dimensionality Reduction (PCA)**

To improve computational efficiency and reduce the risk of overfitting, **Principal Component Analysis (PCA)** was applied to transform the feature space. This step retained the most important components, reducing the dimensionality of the dataset while preserving key variance.

- **PCA Benefits**: Reduced the complexity of the feature set, allowing the decision tree model to train more effectively and generalize better.

---

### **3. Model Selection: Decision Tree**

**Model Type**: Decision Tree Classifier
- A decision tree was chosen due to its interpretability and ability to handle both categorical and numerical features.
  
**Tuning**: Hyperparameter tuning was performed on the decision tree model to improve its performance. Parameters such as `max_depth`, `min_samples_split`, and `min_samples_leaf` were optimized.

---

### **4. Model Performance Evaluation**

After training the decision tree model with the oversampled data and applying PCA, the following metrics were evaluated:

- **Accuracy**: Overall ability of the model to correctly classify both churn and non-churn cases.
- **Precision**: The proportion of positive identifications (churn) that were actually correct.
- **Recall**: The ability of the model to find all relevant churn cases.
- **F1-Score**: The harmonic mean of precision and recall, providing a single metric for both.
- **ROC-AUC**: Assessed the model's ability to distinguish between churn and non-churn cases.

**Performance Results**:
- After handling the class imbalance and reducing dimensionality, the decision tree performed well, with a balanced evaluation across key metrics.

---

### **5. Deployment Strategy**

The decision tree model with **oversampling and PCA** will be deployed in a production environment. This model will be able to predict customer churn with a high level of accuracy and balance between precision and recall. 

**Next Steps**:
- **Monitoring**: Regularly monitor the model's performance post-deployment to ensure that it remains accurate as new data arrives.
- **Retraining**: Periodically retrain the model with updated data, especially if class distribution changes over time.
- **Scaling**: Consider scaling the deployment to handle larger datasets or integrating it into a broader customer relationship management (CRM) system.

---

### Conclusion:

This decision tree model with oversampling and PCA is a well-tuned, balanced approach to predicting customer churn, and it is ready for deployment. It provides robust predictions and helps businesses focus on retaining customers who are at high risk of churning.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
                            accuracy_score,
                            precision_score,
                            recall_score,
                            f1_score,
                            roc_auc_score,
                            confusion_matrix,
                            classification_report,
                            roc_curve)

In [2]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [3]:
train["churn"].value_counts()

no     3652
yes     598
Name: churn, dtype: int64

In [4]:
# check missing values
train.isna().sum()

state                            0
account_length                   0
area_code                        0
international_plan               0
voice_mail_plan                  0
number_vmail_messages            0
total_day_minutes                0
total_day_calls                  0
total_day_charge                 0
total_eve_minutes                0
total_eve_calls                  0
total_eve_charge                 0
total_night_minutes              0
total_night_calls                0
total_night_charge               0
total_intl_minutes               0
total_intl_calls                 0
total_intl_charge                0
number_customer_service_calls    0
churn                            0
dtype: int64

In [5]:
# check duplicates
train.duplicated().sum()

0

# Train the model with Imbalanced Data

In [6]:
# transform all the category column 
le = LabelEncoder()
train["state"]= le.fit_transform(train["state"])
train=pd.get_dummies(train, columns=["area_code", "international_plan","voice_mail_plan" ])

In [7]:
# split the data into feature and target
X = train.copy()
y = X.pop("churn")
# transform the label data
rel = {"no":0, "yes":1}
y=y.map(rel)

In [8]:
# split the data into into train and val set
train_x, val_x, train_y, val_y = train_test_split(X, y, random_state=42, test_size=0.2)

In [9]:
# training the model
from sklearn.tree import DecisionTreeClassifier
dc = DecisionTreeClassifier()
model_1 = dc.fit(train_x, train_y)

In [10]:
# Get predicted values
y_pred = model_1.predict(val_x)

# Get predicted probabilities
y_proba = model_1.predict_proba(val_x)[:, 1]  # For binary classification, get probabilities for the positive class

In [11]:
# Calculate metrics
accuracy = accuracy_score(val_y, y_pred)
precision = precision_score(val_y, y_pred)
recall = recall_score(val_y, y_pred)
f1 = f1_score(val_y, y_pred)
roc_auc = roc_auc_score(val_y, y_proba)  # Use predicted probabilities for ROC-AUC

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC-AUC Score:", roc_auc)

Accuracy: 0.8952941176470588
Precision: 0.6408450704225352
Recall: 0.7054263565891473
F1 Score: 0.6715867158671587
ROC-AUC Score: 0.8173456332182907


In [12]:
# Confusion matrix
cm = confusion_matrix(val_y, y_pred)
print("Confusion Matrix:\n", cm)


Confusion Matrix:
 [[670  51]
 [ 38  91]]


In [13]:
# Classification report
report = classification_report(val_y, y_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.93      0.94       721
           1       0.64      0.71      0.67       129

    accuracy                           0.90       850
   macro avg       0.79      0.82      0.80       850
weighted avg       0.90      0.90      0.90       850



In [89]:
# ROC curve
fpr, tpr, thresholds = roc_curve(val_y, y_proba)




## Here’s a brief explanation of each metric:

- **accuracy_score**: Calculates the accuracy of the model, which is the ratio of correctly predicted instances to the total number of instances.

- **precision_score**: Measures the precision of the model, which is the ratio of true positive predictions to the sum of true positive and false positive predictions.

- **recall_score**: Measures the recall of the model, which is the ratio of true positive predictions to the sum of true positive and false negative predictions.

- **f1_score**: The harmonic mean of precision and recall. It balances the two metrics, especially useful when dealing with imbalanced classes.

- **roc_auc_score**: Measures the area under the Receiver Operating Characteristic (ROC) curve, which evaluates the model’s ability to distinguish between positive and negative classes.

- **confusion_matrix**: Provides a matrix showing the true positive, false positive, true negative, and false negative counts.

- **classification_report**: Summarizes precision, recall, F1-score, and support for each class.

- **roc_curve**: Computes the ROC curve, which plots the true positive rate against the false positive rate at various thresholds.



# Train the model with Balanced Data After Oversampling

In [92]:
from sklearn.utils import resample

majority_class=train.query("churn=='no'")
len_majority_class= len(majority_class)

minority_class=train.query("churn=='yes'")
len_minority_class= len(minority_class)


over_sample_set=resample(minority_class, n_samples=len_majority_class, random_state=42)
train_=pd.concat([majority_class, over_sample_set], ignore_index=True)

In [99]:
# split the data into feature and target
X = train_.copy()
y = X.pop("churn")
rel = {"no":0, "yes":1}
y=y.map(rel)

In [100]:
y.value_counts()

0    3652
1    3652
Name: churn, dtype: int64

### Summary: Dataset After Oversampling

After applying oversampling, the class distribution in the dataset has become perfectly balanced:

- **Class 0 (Non-churn)**: 3652 instances
- **Class 1 (Churn)**: 3652 instances

This balance ensures that the model is exposed to an equal number of samples from both classes, which helps in preventing bias towards the majority class (class `0` in this case). The oversampling technique has replicated samples from the minority class (class `1`) to match the number of instances in the majority class, resulting in:

- **Balanced Class Distribution**: Both classes now have an equal number of 3652 instances, which improves the model’s ability to learn from both classes effectively.

### Key Benefits

1. **Improved Minority Class Performance**: With the oversampling of class `1`, the model will be better able to detect and correctly classify instances of churn (class `1`), which was previously underrepresented.
   
2. **Reduced Bias**: Models trained on imbalanced datasets tend to favor the majority class. By balancing the dataset, you reduce this bias, ensuring the model gives equal consideration to both classes during training.

3. **Better Metrics**: Metrics such as **precision**, **recall**, and **F1-score** for the minority class (churn) should improve as the model is now better trained to recognize churn cases without being dominated by the majority class.



In [106]:
# split the data into into train and val set
train_x, val_x, train_y, val_y = train_test_split(X, y, random_state=42, test_size=0.2)

In [107]:
# training the model
from sklearn.tree import DecisionTreeClassifier
dc = DecisionTreeClassifier()
model_2 = dc.fit(train_x, train_y)

In [108]:
# Get predicted values
y_pred = model_2.predict(val_x)

# Get predicted probabilities
y_proba = model_2.predict_proba(val_x)[:, 1]  # For binary classification, get probabilities for the positive class

In [109]:
# Calculate metrics
accuracy = accuracy_score(val_y, y_pred)
precision = precision_score(val_y, y_pred)
recall = recall_score(val_y, y_pred)
f1 = f1_score(val_y, y_pred)
roc_auc = roc_auc_score(val_y, y_proba)  # Use predicted probabilities for ROC-AUC

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC-AUC Score:", roc_auc)

Accuracy: 0.9650924024640657
Precision: 0.9365918097754293
Recall: 0.9957865168539326
F1 Score: 0.9652825051055139
ROC-AUC Score: 0.9658505347954577


In [110]:
# Confusion matrix
cm = confusion_matrix(val_y, y_pred)
print("Confusion Matrix:\n", cm)


Confusion Matrix:
 [[701  48]
 [  3 709]]


In [111]:
# Classification report
report = classification_report(val_y, y_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.94      0.96       749
           1       0.94      1.00      0.97       712

    accuracy                           0.97      1461
   macro avg       0.97      0.97      0.97      1461
weighted avg       0.97      0.97      0.97      1461



# Train the Model with Balanced Data and on Feature Selection

In [112]:
from sklearn.decomposition import PCA
anal = PCA(n_components=15)
new_col = anal.fit_transform(X)

In [113]:
# split the data into into train and val set
train_x, val_x, train_y, val_y = train_test_split(new_col, y, random_state=42, test_size=0.2)

In [114]:
# training the model
from sklearn.tree import DecisionTreeClassifier
dc = DecisionTreeClassifier()
model_3 = dc.fit(train_x, train_y)

In [115]:
# Get predicted values
y_pred = model_3.predict(val_x)

# Get predicted probabilities
y_proba = model_3.predict_proba(val_x)[:, 1]  # For binary classification, get probabilities for the positive class

In [116]:
# Calculate metrics
accuracy = accuracy_score(val_y, y_pred)
precision = precision_score(val_y, y_pred)
recall = recall_score(val_y, y_pred)
f1 = f1_score(val_y, y_pred)
roc_auc = roc_auc_score(val_y, y_proba)  # Use predicted probabilities for ROC-AUC

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC-AUC Score:", roc_auc)

Accuracy: 0.9698836413415469
Precision: 0.9453333333333334
Recall: 0.9957865168539326
F1 Score: 0.969904240766074
ROC-AUC Score: 0.9705234319917193


In [117]:
# Confusion matrix
cm = confusion_matrix(val_y, y_pred)
print("Confusion Matrix:\n", cm)


Confusion Matrix:
 [[708  41]
 [  3 709]]


In [118]:
# Classification report
report = classification_report(val_y, y_pred)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.95      0.97       749
           1       0.95      1.00      0.97       712

    accuracy                           0.97      1461
   macro avg       0.97      0.97      0.97      1461
weighted avg       0.97      0.97      0.97      1461



### Summary


1. **Imbalanced Data**: The dataset is imbalanced with more instances of class `0` (majority class) than class `1` (minority class).
2. **Balanced Data After Oversampling**: Oversampling has been used to balance the classes.
3. **Balanced Data with Feature Selection**: Feature selection has been applied after balancing the data.

### Analysis

#### **1. Imbalanced Data**

- **Precision (0)**: 0.95
- **Precision (1)**: 0.65
- **Recall (0)**: 0.93
- **Recall (1)**: 0.72
- **Accuracy**: 0.90

Here, the model performs well for class `0` with high precision and recall, but the performance on class `1` is lower. The precision for class `1` is 0.65, which means there are more false positives, and the recall of 0.72 indicates it misses several positive instances.

#### **2. Balanced Data After Oversampling**

- **Precision (0)**: 0.94
- **Precision (1)**: 0.98
- **Recall (0)**: 0.99
- **Recall (1)**: 0.94
- **Accuracy**: 0.96

After oversampling, the model improves significantly on class `1` performance. Precision for class `1` is now 0.98, and recall is 0.94. The high recall for class `0` (0.99) suggests the model has almost no false negatives. The overall accuracy of 0.96 is much higher than in the imbalanced case.

#### **3. Balanced Data with Feature Selection**

- **Precision (0)**: 1.00
- **Precision (1)**: 0.95
- **Recall (0)**: 0.95
- **Recall (1)**: 1.00
- **Accuracy**: 0.97

After balancing and feature selection, the model’s performance improves further. Precision and recall for class `1` are nearly perfect, and the precision for class `0` reaches 1.00. This indicates that the model is now more capable of distinguishing between the two classes and performs better on the minority class.

### Recommendations

1. **Handling Imbalance**: The model trained on the imbalanced data struggles with the minority class (class `1`). If class `1` represents a critical business outcome (e.g., customer churn), it’s essential to address this imbalance before training the model. Techniques like **oversampling** or **undersampling** can help improve performance.

2. **Oversampling Benefits**: The oversampling technique has greatly enhanced the model’s ability to predict the minority class. The improved recall (0.94) and precision (0.98) for class `1` show that the model is now identifying more true positives and reducing false positives.

3. **Feature Selection**: After performing feature selection, the model's overall performance improved further. This suggests that certain features may have been irrelevant or redundant, and removing them helped the model generalize better. You should consider **recursive feature elimination (RFE)** or other feature selection techniques to identify the most important features.

4. **Next Steps**: 
   - **Hyperparameter Tuning**: To further enhance the model, tune the hyperparameters (e.g., max depth, min samples split) for the decision tree.
   - **Other Classifiers**: Consider comparing the decision tree performance to other models like **Random Forest** or **Gradient Boosting**.
   - **Evaluation Metrics**: Since you are dealing with imbalanced classes, keep focusing on metrics like **precision, recall, and F1-score** instead of accuracy alone. Also, **ROC-AUC** should be part of the evaluation.

