<a href="https://colab.research.google.com/github/meerab-17/End-to-End-ML-Pipeline-with-Scikit-learn-Pipeline-API/blob/main/ml_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

problem statement: Building a reusable and production-ready machine learning pipeline using scikit-learn's Pipeline API to predict customer churn from the Telco dataset. The pipeline includes preprocessing, model training, hyperparameter tuning, and export using joblib.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import joblib

url = "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"
df = pd.read_csv(url)
print(df.shape)
print(df.columns)
print(df['Churn'].value_counts())

(7043, 21)
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')
Churn
No     5174
Yes    1869
Name: count, dtype: int64


In [None]:
df.drop('customerID', axis=1, inplace=True)

In [None]:
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)


In [None]:
X = df.drop('Churn', axis=1)
y = df['Churn']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

print("Numeric columns:", numeric_features)
print("Categorical columns:", categorical_features)

Numeric columns: ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']
Categorical columns: ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Preprocessing for numerical data
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

In [None]:
#logistic regression pipeline
from sklearn.linear_model import LogisticRegression

logreg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('clf', LogisticRegression(max_iter=1000))
])

In [None]:
#random forest pipeline
from sklearn.ensemble import RandomForestClassifier

rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('clf', RandomForestClassifier(random_state=42))
])

In [None]:
from sklearn.model_selection import GridSearchCV

logreg_params = {
    'clf__C': [0.1, 1.0, 10]
}

logreg_grid = GridSearchCV(logreg_pipeline, logreg_params, cv=5, scoring='accuracy')
logreg_grid.fit(X_train, y_train)

print("Best Logistic Regression Params:", logreg_grid.best_params_)

Best Logistic Regression Params: {'clf__C': 10}


In [None]:
rf_params = {
    'clf__n_estimators': [100, 200],
    'clf__max_depth': [5, 10, None]
}

rf_grid = GridSearchCV(rf_pipeline, rf_params, cv=5, scoring='accuracy')
rf_grid.fit(X_train, y_train)

print("Best Random Forest Params:", rf_grid.best_params_)

Best Random Forest Params: {'clf__max_depth': 10, 'clf__n_estimators': 100}


In [None]:
from sklearn.metrics import classification_report

# Pick the better performing model
best_model = rf_grid.best_estimator_  # or logreg_grid.best_estimator_

y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.90      0.87      1035
           1       0.66      0.53      0.59       374

    accuracy                           0.80      1409
   macro avg       0.75      0.72      0.73      1409
weighted avg       0.79      0.80      0.80      1409



In [None]:
import joblib

joblib.dump(best_model, 'churn_pipeline.joblib')
print(" Model saved as churn_pipeline.joblib")

 Model saved as churn_pipeline.joblib


In [None]:
loaded_pipeline = joblib.load('churn_pipeline.joblib')
predictions = loaded_pipeline.predict(X_test[:5])
print(predictions)

[0 1 0 0 0]


In this task, I built a complete end-to-end machine learning pipeline to predict customer churn using the Telco Customer Churn dataset. The goal was to create a production-ready, reusable pipeline using scikit-learn.

-Data Loading & Cleaning: Handled missing values and transformed relevant columns.

-Preprocessing with Pipelines: Applied scaling to numeric features and one-hot encoding to categorical features using ColumnTransformer.

-Model Training: Built and trained two separate pipelines using Logistic Regression and Random Forest Classifier.

-Hyperparameter Tuning: Used GridSearchCV to tune important model parameters.

-Model Evaluation: Evaluated both models using accuracy and classification report on test data.

-Model Export: Saved the final best-performing pipeline using joblib, making it ready for deployment or reuse.

Conclusion:
The pipeline successfully predicts customer churn and follows best practices for real-world deployment, including modular preprocessing, model tuning, and exportability. The final model can now be reused to make predictions on new, unseen customer data.