<a href="https://colab.research.google.com/github/i-wav/ML-Pipeline-Customer-Churn-Dataset/blob/main/ML_Pilpeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Customer Churn: **ML Pipeline**
### This notebook consists of the code to generate a ML Pipeline for customer churn dataset taken from kaggle.
### Link to dataset: [Click Here](https://www.kaggle.com/datasets/muhammadshahidazeem/customer-churn-dataset)

### 1.1 Loading libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score

In [3]:
from google.colab import files
uploaded = files.upload()

Saving customer_churn_dataset-training-master.csv to customer_churn_dataset-training-master.csv


In [4]:
df = pd.read_csv('customer_churn_dataset-training-master.csv')

In [5]:
df.head()

Unnamed: 0,CustomerID,Age,Gender,Tenure,Usage Frequency,Support Calls,Payment Delay,Subscription Type,Contract Length,Total Spend,Last Interaction,Churn
0,2.0,30.0,Female,39.0,14.0,5.0,18.0,Standard,Annual,932.0,17.0,1.0
1,3.0,65.0,Female,49.0,1.0,10.0,8.0,Basic,Monthly,557.0,6.0,1.0
2,4.0,55.0,Female,14.0,4.0,6.0,18.0,Basic,Quarterly,185.0,3.0,1.0
3,5.0,58.0,Male,38.0,21.0,7.0,7.0,Standard,Monthly,396.0,29.0,1.0
4,6.0,23.0,Male,32.0,20.0,5.0,8.0,Basic,Monthly,617.0,20.0,1.0


In [6]:
df.shape

(440833, 12)

In [7]:
df.isnull().sum()

Unnamed: 0,0
CustomerID,1
Age,1
Gender,1
Tenure,1
Usage Frequency,1
Support Calls,1
Payment Delay,1
Subscription Type,1
Contract Length,1
Total Spend,1


In [8]:
df[df.isnull().sum(axis=1) > 0]

Unnamed: 0,CustomerID,Age,Gender,Tenure,Usage Frequency,Support Calls,Payment Delay,Subscription Type,Contract Length,Total Spend,Last Interaction,Churn
199295,,,,,,,,,,,,


Since only one row has its data missing we'll drop that row otherwise we would have imputed it.

In [9]:
df.dropna(inplace=True)
print(df.isnull().sum().sum())

0


### Encoding Categorical Variables (columns that are used as features)
#### Label Encoding for binary & One-Hot Encoding for non-binary categorical variables

In [13]:
df.drop('CustomerID', axis=1, inplace=True)

# # Label Encode 'Gender'
# le = LabelEncoder()
# df['Gender'] = le.fit_transform(df['Gender'])  # Female: 0 Male: 1

# # One-Hot Encode 'Subscription Type' and 'Contract Length'
# df = pd.get_dummies(df, columns=['Subscription Type', 'Contract Length'], drop_first=True)

In [14]:
X = df.drop('Churn', axis=1)
y = df['Churn']

In [15]:
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

In [17]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first'), categorical_features)
    ])

In [18]:
clf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

In [19]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

clf_pipeline.fit(X_train, y_train)