# **Customer Churn Prediction** ¶


## **Problem Statement:**

Develop a predictive model to identify customers at risk of churning from an investment bank, enabling proactive retention strategies to minimize customer loss and maximize revenue growth.


## **About the Dataset**

There are 14 columns/features and 10k rows/samples.

**RowNumber**—corresponds to the record (row) number and has no effect on the output.

**CustomerId**—contains random values and has no effect on customer leaving the bank.

**Surname**—the surname of a customer has no impact on their decision to leave the bank.

**CreditScore**—can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank.

**Geography**—a customer’s location can affect their decision to leave the bank.

**Gender**—it’s interesting to explore whether gender plays a role in a customer leaving the bank.

**Age**—this is certainly relevant, since older customers are less likely to leave their bank than younger ones.

**Tenure**—refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank.

**Balance**—also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances.

**NumOfProducts**—refers to the number of products that a customer has purchased through the bank.

**HasCrCard**—denotes whether or not a customer has a credit card. This column is also relevant, since people with a credit card are less likely to leave the bank.

**IsActiveMember**—active customers are less likely to leave the bank.

**EstimatedSalary**—as with balance, people with lower salaries are more likely to leave the bank compared to those with higher salaries.

**Exited**—whether or not the customer left the bank.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **KNN**

The K-Nearest Neighbors (KNN) algorithm is a simple and effective machine learning technique that classifies data points by finding the K most similar instances to a new input and voting for the target class or value.

### **The most commonly used hyperparameters for K-Nearest Neighbors (KNN) algorithm:**

n_neighbors: The number of nearest neighbors to consider when making a prediction. Increasing this number can improve the model's performance, but also increases the computation time.

weights: The weight function used to calculate the distance between samples. Supported weights are 'uniform' (all points have equal weight) and 'distance' (points closer to the query point have higher weight).

algorithm: The algorithm used to compute the nearest neighbors. Supported algorithms are 'brute' (exhaustive search), 'kd_tree' (k-d tree search), and 'ball_tree' (ball tree search).

leaf_size: The number of samples in each leaf node of the k-d tree or ball tree. Increasing this number can improve the model's performance, but also increases the computation time.

p: The power parameter for the Minkowski metric. When p=1, it is the Manhattan distance, and when p=2, it is the Euclidean distance.

metric: The distance metric used to calculate the distance between samples. Supported metrics are 'minkowski' (Minkowski distance), 'euclidean' (Euclidean distance), 'manhattan' (Manhattan distance), and 'chebyshev' (Chebyshev distance).

### **Here are some common values for these hyperparameters:**

n_neighbors: 3, 5, 10, 20

weights: 'uniform', 'distance'

algorithm: 'brute', 'kd_tree', 'ball_tree'

leaf_size: 10, 20, 30

p: 1, 2

metric: 'minkowski', 'euclidean', 'manhattan', 'chebyshev'


In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,roc_auc_score
from sklearn.svm import SVC
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

#import warnings
#warnings.filterwarnings('ignore')



In [None]:
#from google.colab import drive

In [None]:
#drive.mount('/content/drive')

In [4]:

# Load data
data = pd.read_csv('/content/drive/MyDrive/churn_data/Churn_data .csv')



In [5]:
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [7]:
# is null?
isnull = data.isnull().sum()
isnull

Unnamed: 0,0
RowNumber,0
CustomerId,0
Surname,0
CreditScore,0
Geography,0
Gender,0
Age,0
Tenure,0
Balance,0
NumOfProducts,0


In [8]:
# Preprocess data
selected_features = [
    'CreditScore', 'Geography', 'Gender', 'Age',
    'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
    'IsActiveMember', 'EstimatedSalary'
]
X = data[selected_features]
y = data[['Exited']]



In [9]:
X['Geography'].unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [10]:
# Label encoding
le = LabelEncoder()
X['Geography'] = le.fit_transform(X['Geography'])
X['Gender'] = le.fit_transform(X['Gender'])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Geography'] = le.fit_transform(X['Geography'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Gender'] = le.fit_transform(X['Gender'])


In [11]:
# Scaling
scaler = MinMaxScaler()
X[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']] = scaler.fit_transform(X[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']] = scaler.fit_transform(X[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']])


In [13]:
# Split data
train_X, val_X, train_y, val_y = train_test_split(
    X, y, random_state=42, train_size=0.8
)



In [14]:
# Train model
model = KNeighborsClassifier(n_neighbors=2, metric='euclidean', weights='uniform', algorithm='auto', leaf_size=100, p=2)
model.fit(train_X, train_y)



  return self._fit(X, y)


In [15]:
# Evaluate model
val_prediction = model.predict(val_X)
y_pred_proba = model.predict_proba(val_X)[:,1]
accuracy = accuracy_score(val_y, val_prediction)
print(f'Model accuracy: {accuracy}')



Model accuracy: 0.8235


In [16]:
print(confusion_matrix(val_y, val_prediction))
print(classification_report(val_y, val_prediction))

[[1551   56]
 [ 297   96]]
              precision    recall  f1-score   support

           0       0.84      0.97      0.90      1607
           1       0.63      0.24      0.35       393

    accuracy                           0.82      2000
   macro avg       0.74      0.60      0.63      2000
weighted avg       0.80      0.82      0.79      2000



In [17]:
auc = roc_auc_score(val_y, y_pred_proba)
print(auc)

0.6971479737978405


In [18]:
# Save model
joblib.dump(model, 'churn_model.pkl')

['churn_model.pkl']

In [None]:
geo_encoder = LabelEncoder()
gender_encoder = LabelEncoder()

df['Geography'] = geo_encoder.fit_transform(df['Geography'])
df['Gender'] = gender_encoder.fit_transform(df['Gender'])

joblib.dump(geo_encoder, "geo_encoder.pkl")
joblib.dump(gender_encoder, "gender_encoder.pkl")
joblib.dump(scaler, "min_max_scaler.pkl")
joblib.dump(model, "churn_model.pkl")

In [None]:
##Updated app.py

In [None]:
import streamlit as st
import pandas as pd
import joblib
import numpy as np

# --------------------------------
# Load model and preprocessing objects
# --------------------------------
model = joblib.load('churn_model.pkl')

# Separate encoders for Geography and Gender (BEST PRACTICE)
geo_encoder = joblib.load('geo_encoder.pkl')
gender_encoder = joblib.load('gender_encoder.pkl')

# Load scaler
min_max_scaler = joblib.load('min_max_scaler.pkl')

# Feature order (must match training)
feature_order = [
    'CreditScore', 'Geography', 'Gender', 'Age',
    'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
    'IsActiveMember', 'EstimatedSalary'
]

# --------------------------------
# Streamlit UI
# --------------------------------
st.title('Customer Churn Prediction App')
st.write('Enter customer details to predict if they will churn.')

credit_score = st.slider('Credit Score', 350, 850, 600)
geography_options = ['France', 'Spain', 'Germany']
geography = st.selectbox('Geography', geography_options)
gender_options = ['Female', 'Male']
gender = st.selectbox('Gender', gender_options)
age = st.slider('Age', 18, 92, 35)
tenure = st.slider('Tenure (years with bank)', 0, 10, 5)
balance = st.number_input('Balance', 0.0, 250000.0, 50000.0, format="%.2f")
num_of_products = st.slider('Number of Products', 1, 4, 1)
has_cr_card = st.checkbox('Has Credit Card?')
is_active_member = st.checkbox('Is Active Member?')
estimated_salary = st.number_input('Estimated Salary', 0.0, 200000.0, 100000.0, format="%.2f")

has_cr_card_val = 1 if has_cr_card else 0
is_active_member_val = 1 if is_active_member else 0

# --------------------------------
# SAFE LABEL ENCODING HELPER
# --------------------------------
def safe_label_transform(encoder, value):
    """Safely transform label-encoded values, including unseen ones."""
    if value not in encoder.classes_:
        # Extend encoder to include new category
        encoder.classes_ = np.append(encoder.classes_, value)
    return encoder.transform([value])[0]

# --------------------------------
# Prediction
# --------------------------------
if st.button('Predict Churn'):

    # Build user input DataFrame
    input_data = pd.DataFrame([{
        'CreditScore': credit_score,
        'Geography': geography,
        'Gender': gender,
        'Age': age,
        'Tenure': tenure,
        'Balance': balance,
        'NumOfProducts': num_of_products,
        'HasCrCard': has_cr_card_val,
        'IsActiveMember': is_active_member_val,
        'EstimatedSalary': estimated_salary
    }])

    # Safe encoding (prevents "unseen label" crash)
    input_data['Geography'] = input_data['Geography'].apply(
        lambda x: safe_label_transform(geo_encoder, x)
    )
    input_data['Gender'] = input_data['Gender'].apply(
        lambda x: safe_label_transform(gender_encoder, x)
    )

    # Numerical scaling
    numerical_cols = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
    input_data[numerical_cols] = min_max_scaler.transform(input_data[numerical_cols])

    # Reorder columns
    input_data = input_data[feature_order]

    # Predict
    prediction = model.predict(input_data)
    prediction_proba = model.predict_proba(input_data)[:, 1]

    st.subheader('Prediction Result:')
    if prediction[0] == 1:
        st.error(f'The customer is likely to churn. (Probability: {prediction_proba[0]:.2f})')
    else:
        st.success(f'The customer is not likely to churn. (Probability: {prediction_proba[0]:.2f})')

In [1]:
#updated requirement.txt

streamlit
pandas
scikit-learn
joblib

NameError: name 'streamlit' is not defined

A Decision Tree Classifier is a type of supervised learning algorithm in machine learning. It works by creating a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. The tree is constructed by recursively partitioning the data into subsets based on the values of the input features.

### **The most commonly used hyperparameters for Decision Tree Classifier**:

criterion: The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain.

max_depth: The maximum depth of the tree. Increasing this number can improve the model's performance, but also increases the risk of overfitting.

min_samples_split: The minimum number of samples required to split an internal node. Decreasing this number can lead to overfitting, while increasing it can lead to underfitting.

min_samples_leaf: The minimum number of samples required to be at a leaf node. Decreasing this number can lead to overfitting, while increasing it can lead to underfitting.

max_features: The maximum number of features to consider at each split. Increasing this number can improve the model's performance, but also increases the computation time.

random_state: The random seed used to shuffle the data before splitting it into training and testing sets. Setting this to a fixed value ensures reproducibility of the results.

class_weight: The weight assigned to each class during training. This can be useful for imbalanced datasets, where one class has a much larger number of instances than the others.

### **Here are some common values for these hyperparameters:**

criterion: 'gini', 'entropy'

max_depth: 3, 5, 10, None (None means no limit)

min_samples_split: 2, 5, 10

min_samples_leaf: 1, 5, 10

max_features: 'auto', 'sqrt', 'log2', None (None means no limit)

random_state: 0, 42, 100

class_weight: 'balanced', 'balanced_subsample', None (None means all classes are equal)


Random Forest is a supervised learning algorithm that combines multiple decision trees to produce a more accurate and stable prediction model. It works by creating a collection of decision trees, where each tree is trained on a random subset of the training data. The final prediction is made by combining the predictions of all the trees.

### **The most commonly used hyperparameters for Random Forest Classifier:**

n_estimators: The number of trees in the forest. Increasing this number can improve the model's performance, but also increases the computation time.

criterion: The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain.

max_depth: The maximum depth of each tree. Increasing this number can improve the model's performance, but also increases the risk of overfitting.

min_samples_split: The minimum number of samples required to split an internal node. Decreasing this number can lead to overfitting, while increasing it can lead to underfitting.

min_samples_leaf: The minimum number of samples required to be at a leaf node. Decreasing this number can lead to overfitting, while increasing it can lead to underfitting.

max_features: The maximum number of features to consider at each split. Increasing this number can improve the model's performance, but also increases the computation time.

max_leaf_nodes: The maximum number of leaf nodes in each tree. Increasing this number can improve the model's performance, but also increases the computation time.

min_impurity_decrease: The minimum decrease in impurity required to split an internal node. Increasing this number can lead to underfitting, while decreasing it can lead to overfitting.

bootstrap: Whether to use bootstrap sampling to build each tree. If True, each tree is built on a random subset of the training data.

oob_score: Whether to use out-of-bag samples to estimate the generalization accuracy.

random_state: The random seed used to shuffle the data before building each tree. Setting this to a fixed value ensures reproducibility of the results.

class_weight: The weight assigned to each class during training. This can be useful for imbalanced datasets, where one class has a much larger number of instances than the others.

### **Here are some common values for these hyperparameters:**

n_estimators: 10, 50, 100, 200

criterion: 'gini', 'entropy'

max_depth: 3, 5, 10, None (None means no limit)

min_samples_split: 2, 5, 10

min_samples_leaf: 1, 5, 10

max_features: 'auto', 'sqrt', 'log2', None (None means no limit)

max_leaf_nodes: 10, 50, 100, None (None means no limit)

min_impurity_decrease: 0.0, 0.1, 0.5

bootstrap: True, False

oob_score: True, False

random_state: 0, 42, 100

class_weight: 'balanced', 'balanced_subsample', None (None means all classes are equal)


Support Vector Machine (SVM) is a supervised learning algorithm that can be used for classification and regression tasks. It works by finding the hyperplane that maximally separates the classes in the feature space.

### **The most commonly used hyperparameters for Support Vector Machines (SVMs) are:**

C: The regularization parameter. It controls the trade-off between the margin and the misclassification error.


kernel: The kernel function used to transform the data into a higher dimensional space.


gamma: The kernel coefficient. It is used to control the spread of the kernel.
degree: The degree of the polynomial kernel.


### **Here are some common values for these hyperparameters:**

C: 1.0, 10.0, 100.0, 1000.0

kernel: 'rbf', 'linear', 'poly', 'sigmoid'

gamma: 'scale', 'auto', 0.1, 1.0, 10.0

degree: 2, 3, 4, 5


# Task
Save the fitted `LabelEncoder` and `MinMaxScaler` objects, generate the `app.py` file for the Streamlit application, and create the `requirements.txt` file listing all necessary Python dependencies.

## Save Preprocessing Objects

### Subtask:
Save the fitted `LabelEncoder` and `MinMaxScaler` objects to disk so they can be loaded by the Streamlit application for consistent data preprocessing.


**Reasoning**:
To ensure consistent data preprocessing in the Streamlit application, I need to save the fitted `LabelEncoder` and `MinMaxScaler` objects to disk.



In [19]:
joblib.dump(le, 'label_encoder.pkl')
joblib.dump(scaler, 'min_max_scaler.pkl')
print("LabelEncoder and MinMaxScaler objects saved successfully.")

LabelEncoder and MinMaxScaler objects saved successfully.


## Generate app.py

### Subtask:
Create the `app.py` file containing the Streamlit application code. This will include loading the model and preprocessing objects, setting up the user interface for input, preprocessing new data, and displaying predictions.


**Reasoning**:
I need to generate the `app.py` file with the Streamlit application code. This code will handle loading the model and preprocessing objects, creating the user interface, processing user input, making predictions, and displaying results.



In [20]:
%%writefile app.py
import streamlit as st
import pandas as pd
import joblib
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
import numpy as np

# Load the pre-trained model and preprocessing objects
model = joblib.load('churn_model.pkl')
label_encoder = joblib.load('label_encoder.pkl')
min_max_scaler = joblib.load('min_max_scaler.pkl')

# Define the order of features used during training
# This should match the 'selected_features' list used during model training
feature_order = [
    'CreditScore', 'Geography', 'Gender', 'Age',
    'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
    'IsActiveMember', 'EstimatedSalary'
]

# Streamlit app title and description
st.title('Customer Churn Prediction App')
st.write('Enter customer details to predict if they will churn.')

# Create input fields
credit_score = st.slider('Credit Score', 350, 850, 600)
geography_options = ['France', 'Spain', 'Germany']
geography = st.selectbox('Geography', geography_options)
gender_options = ['Female', 'Male']
gender = st.selectbox('Gender', gender_options)
age = st.slider('Age', 18, 92, 35)
tenure = st.slider('Tenure (years with bank)', 0, 10, 5)
balance = st.number_input('Balance', 0.0, 250000.0, 50000.0, format="%.2f")
num_of_products = st.slider('Number of Products', 1, 4, 1)
has_cr_card = st.checkbox('Has Credit Card?')
is_active_member = st.checkbox('Is Active Member?')
estimated_salary = st.number_input('Estimated Salary', 0.0, 200000.0, 100000.0, format="%.2f")

# Convert boolean checkboxes to integer (0 or 1)
has_cr_card_val = 1 if has_cr_card else 0
is_active_member_val = 1 if is_active_member else 0

# Prediction button
if st.button('Predict Churn'):
    # Create a DataFrame from user input
    input_data = pd.DataFrame([{
        'CreditScore': credit_score,
        'Geography': geography,
        'Gender': gender,
        'Age': age,
        'Tenure': tenure,
        'Balance': balance,
        'NumOfProducts': num_of_products,
        'HasCrCard': has_cr_card_val,
        'IsActiveMember': is_active_member_val,
        'EstimatedSalary': estimated_salary
    }])

    # Apply Label Encoding to 'Geography' and 'Gender'
    # Ensure the input categories are known to the encoder
    input_data['Geography'] = label_encoder.transform(input_data['Geography'])
    input_data['Gender'] = label_encoder.transform(input_data['Gender'])

    # Apply Min-Max Scaling to numerical features
    numerical_cols = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
    input_data[numerical_cols] = min_max_scaler.transform(input_data[numerical_cols])

    # Ensure the order of columns matches the training data
    input_data = input_data[feature_order]

    # Make prediction
    prediction = model.predict(input_data)
    prediction_proba = model.predict_proba(input_data)[:, 1]

    st.subheader('Prediction Result:')
    if prediction[0] == 1:
        st.error(f'The customer is likely to churn. (Probability: {prediction_proba[0]:.2f})')
    else:
        st.success(f'The customer is not likely to churn. (Probability: {prediction_proba[0]:.2f})')


Writing app.py


**Reasoning**:
The next step is to create the `requirements.txt` file, which lists all the Python dependencies required for the Streamlit application to run. This will ensure that the environment can be set up correctly.



In [24]:
pip install joblib



In [25]:
%%writefile requirements.txt
streamlit
pandas
scikit-learn
joblib

Overwriting requirements.txt


## Summary:

### Data Analysis Key Findings
*   The `LabelEncoder` and `MinMaxScaler` objects, essential for data preprocessing, were successfully saved to `label_encoder.pkl` and `min_max_scaler.pkl` respectively, ensuring consistent preprocessing for the Streamlit application.
*   An `app.py` file was generated, containing a Streamlit application designed to predict customer churn. This application loads a pre-trained churn prediction model, the saved `LabelEncoder`, and `MinMaxScaler` to process user-entered customer data (including Credit Score, Geography, Gender, Age, Tenure, Balance, Number of Products, Has Credit Card, Is Active Member, and Estimated Salary) and display a churn probability prediction.
*   A `requirements.txt` file was created, listing all necessary Python dependencies for the Streamlit application: `streamlit`, `pandas`, `scikit-learn`, and `joblib`.

### Insights or Next Steps
*   The completed files (`churn_model.pkl`, `label_encoder.pkl`, `min_max_scaler.pkl`, `app.py`, `requirements.txt`) are now ready for deployment. The Streamlit application can be launched using `streamlit run app.py` in a terminal where the dependencies from `requirements.txt` have been installed.
