# Task 2: End-to-End ML Pipeline for Customer Churn Prediction

**Objective**  
Build a complete machine learning pipeline to predict which customers will leave (churn) the company.  
Using the Telco Customer Churn dataset.  
Making the pipeline reusable and ready for production using scikit-learn Pipeline.

**Why this is useful**  
Companies lose money when customers leave. Predicting churn helps them take action early (like giving discounts).

**Dataset**  
Telco Customer Churn (from Kaggle) has info about customers like age, services they use, how long they've stayed, monthly bill, etc.
Target = Churn (Yes/No)

## Step 1: Import Libraries

Bringing in tools i will need:
- pandas
- scikit-learn 
- joblib

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, classification_report
import joblib

## Step 2: Load and Explore the Data

load the CSV file and look at it:
- See first rows
- Check data types and missing values
- See basic numbers
- Check how many people churned (Yes/No)

In [4]:
df = pd.read_csv("/kaggle/input/datasets/blastchar/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

print(df.head())


   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingMovies        Contract Pape

In [5]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [6]:
print(df.isnull().sum())

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


In [7]:
print(df.describe())

       SeniorCitizen       tenure  MonthlyCharges
count    7043.000000  7043.000000     7043.000000
mean        0.162147    32.371149       64.761692
std         0.368612    24.559481       30.090047
min         0.000000     0.000000       18.250000
25%         0.000000     9.000000       35.500000
50%         0.000000    29.000000       70.350000
75%         0.000000    55.000000       89.850000
max         1.000000    72.000000      118.750000


In [8]:
print(df['Churn'].value_counts())

Churn
No     5174
Yes    1869
Name: count, dtype: int64


## Step 3: Clean the Data

Small fixes that must be done:
- TotalCharges is text 
- Delete useless column (customerID)
- Change Churn Yes/No → 1/0 (computers like numbers)
- Remove the few rows with missing TotalCharges (~11 rows)

In [9]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [11]:
df = df.dropna(subset=['TotalCharges'])

In [12]:
if 'customerID' in df.columns:
    df = df.drop('customerID', axis=1)
    print("Dropped customerID")
else:
    print("customerID already dropped, skipping it")

df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

print("Current columns:", df.columns.tolist())

Dropped customerID
Current columns: ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']


In [13]:
print("Shape after cleaning:", df.shape)
print("Columns now:", df.columns.tolist())

Shape after cleaning: (7032, 20)
Columns now: ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']


## Step 4: Split Data into Train and Test

Dividing data:
- 80% to teach the model (train)
- 20% to test how good it is (test)

Using stratify so Yes/No ratio stays similar in both parts.

In [17]:
X = df.drop('Churn', axis=1)
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [18]:
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

Train shape: (5625, 19)
Test shape: (1407, 19)


## Step 5: Make Preprocessing Steps

Preparing data automatically:
- Numbers (tenure, charges) → scale them (make similar range)
- Yes/No or category columns → turn into numbers (one-hot encoding)

In [19]:
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

cat_cols = [col for col in X.columns if col not in num_cols]

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), cat_cols)
    ]
)

## Step 6: Build Pipelines and Train Models

Two models inside pipelines:
1. Logistic Regression (simple and fast)
2. Random Forest (usually better)

Pipeline = preprocessing + model

In [24]:
#LogisticRegressionPipeline
logreg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

In [25]:
logreg_pipeline.fit(X_train, y_train)

y_pred_logreg = logreg_pipeline.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg))
print("F1 Score:", f1_score(y_test, y_pred_logreg))
print(classification_report(y_test, y_pred_logreg))

Logistic Regression Accuracy: 0.8045486851457001
F1 Score: 0.6088193456614509
              precision    recall  f1-score   support

           0       0.85      0.89      0.87      1033
           1       0.65      0.57      0.61       374

    accuracy                           0.80      1407
   macro avg       0.75      0.73      0.74      1407
weighted avg       0.80      0.80      0.80      1407



In [26]:
#RandomForestPipeline
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

In [27]:
rf_pipeline.fit(X_train, y_train)

y_pred_rf = rf_pipeline.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("F1 Score:", f1_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

Random Forest Accuracy: 0.7839374555792467
F1 Score: 0.5449101796407185
              precision    recall  f1-score   support

           0       0.83      0.89      0.86      1033
           1       0.62      0.49      0.54       374

    accuracy                           0.78      1407
   macro avg       0.72      0.69      0.70      1407
weighted avg       0.77      0.78      0.78      1407



## Step 7: Improve Models with Grid Search

Trying different settings to get the best version of each model.
Using F1 score because churn "Yes" is less common (imbalanced).

In [28]:
param_grid_logreg = {
    'classifier__C': [0.01, 0.1, 1, 10, 100]
}

In [29]:
grid_logreg = GridSearchCV(logreg_pipeline, param_grid_logreg, cv=5, scoring='f1')
grid_logreg.fit(X_train, y_train)

In [30]:
print("Best LogReg Params:", grid_logreg.best_params_)
y_pred_best_logreg = grid_logreg.predict(X_test)
print("Tuned LogReg F1:", f1_score(y_test, y_pred_best_logreg))

Best LogReg Params: {'classifier__C': 10}
Tuned LogReg F1: 0.6036671368124118


In [31]:
param_grid_rf = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 10, 20]
}

In [32]:
grid_rf = GridSearchCV(rf_pipeline, param_grid_rf, cv=5, scoring='f1')
grid_rf.fit(X_train, y_train)

In [33]:
print("Best RF Params:", grid_rf.best_params_)
y_pred_best_rf = grid_rf.predict(X_test)
print("Tuned RF F1:", f1_score(y_test, y_pred_best_rf))

Best RF Params: {'classifier__max_depth': 10, 'classifier__n_estimators': 200}
Tuned RF F1: 0.5568862275449101


## Step 8: Save the Final Pipeline

Saving the best model so it can be used later without training again.
File name: churn_pipeline.pkl

In [34]:
joblib.dump(grid_rf.best_estimator_, 'churn_pipeline.pkl')

['churn_pipeline.pkl']

In [35]:
print("Model saved as churn_pipeline.pkl")

Model saved as churn_pipeline.pkl


## Summary & Results

- Built a full pipeline: clean → preprocess → train → tune → save
- Random Forest usually gets better F1 score for churn (around 0.55–0.60)
- This pipeline is ready to use on new customers
- Skills learned: Pipelines, ColumnTransformer, GridSearchCV, joblib
