Task 2: End-to-End ML Pipeline with Scikit-learn

 Objective
Build a reusable and production-ready machine learning pipeline to predict customer churn using the Telco Churn Dataset.


STEP:-
1. Load Dataset
- Import the Telco Churn data.
- Separate features (`X`) and target (`y`).
- Encode the target variable (Yes → 1, No → 0).

In [1]:
%pip install pandas


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import pandas as pd

# Load dataset from the same folder
data = pd.read_csv("Telco-Customer-Churn.csv")

# Existing preprocessing
X = data.drop("Churn", axis=1)
y = data["Churn"].map({"Yes": 1, "No": 0})

print("Dataset loaded successfully!")
print(X.head())
print(y.head())



Dataset loaded successfully!
   LoyaltyID Customer ID Senior Citizen Partner Dependents  Tenure  \
0     318537  7590-VHVEG             No     Yes         No       1   
1     152148  5575-GNVDE             No      No         No      34   
2     326527  3668-QPYBK             No      No         No       2   
3     845894  7795-CFOCW             No      No         No      45   
4     503388  9237-HQITU             No      No         No       2   

  Phone Service    Multiple Lines Internet Service Online Security  \
0            No  No phone service              DSL              No   
1           Yes                No              DSL             Yes   
2           Yes                No              DSL             Yes   
3            No  No phone service              DSL             Yes   
4           Yes                No      Fiber optic              No   

  Online Backup Device Protection Tech Support Streaming TV Streaming Movies  \
0           Yes                No           No   

STEP:-
2. Identify Features
- Separate **numeric** and categorical features for preprocessing.


In [3]:
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()


See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  categorical_features = X.select_dtypes(include=['object']).columns.tolist()


STEP:-
3. Preprocessing Pipelines
- Scale numeric features using `StandardScaler`.
- Encode categorical features using `OneHotEncoder`.
- Combine both using `ColumnTransformer`.


In [4]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

numeric_transformer = Pipeline([
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])


STEP:-
4. Build Full Pipelines
- Create pipelines for **Logistic Regression** and **Random Forest**.
- Include preprocessing and model training in the pipeline.


In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Logistic Regression pipeline
logreg_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=500))
])

# Random Forest pipeline
rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

 STEP:-
 5. Hyperparameter Tuning
- Use `GridSearchCV` to find the best hyperparameters for Random Forest.
- Use cross-validation to evaluate performance.


In [6]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [5, 10, None]
}

grid_search = GridSearchCV(rf_pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)

print("Best params:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)


Best params: {'classifier__max_depth': None, 'classifier__n_estimators': 200}
Best score: 0.7928433890896186


STEP:-
6. Export Pipeline
- Save the trained pipeline using `joblib` for production reuse.


In [7]:
import joblib

joblib.dump(grid_search.best_estimator_, "telco_churn_pipeline.pkl")
print("Pipeline saved as 'telco_churn_pipeline.pkl'")


Pipeline saved as 'telco_churn_pipeline.pkl'


STEP:-
7. Load and Use Saved Pipeline
- Load the saved pipeline for making predictions on new customer data.
- Output:
  - `prediction` → 0 (No churn) / 1 (Churn)
  - `prediction_proba` → probability of churn

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# 1. Load data
data = pd.read_csv("Telco-Customer-Churn.csv")

# 2. Setup Target (y)
# Standard Telco datasets use 'Churn'
y = data["Churn"].map({"Yes": 1, "No": 0})

# 3. Setup Features (X)
# Drop the Target and IDs. I added 'errors=ignore' so it won't crash if a column is missing.
X = data.drop(columns=['Churn', 'LoyaltyID', 'Customer ID'], errors='ignore')

# 4. Define Numeric Features (MATCHING YOUR PRINTED LIST EXACTLY)
numeric_features = ['Tenure', 'Monthly Charges', 'Total Charges']

for col in numeric_features:
    if col in X.columns:
        # Convert to string first to strip spaces, then to numeric
        X[col] = pd.to_numeric(X[col].astype(str).str.strip(), errors='coerce')

# Drop rows with NaN in numeric columns (common in Total Charges)
X = X.dropna(subset=numeric_features)
y = y[X.index] # Keep y aligned with X

# 5. Define Categorical Features
categorical_features = [col for col in X.columns if col not in numeric_features]

# 6. Build the Pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# 7. Fit the model
pipeline.fit(X, y)

# 8. Create New Customer Data (NAMES MUST MATCH X COLUMNS EXACTLY)
new_data = pd.DataFrame([{
    'Senior Citizen': 0,
    'Partner': 'Yes',
    'Dependents': 'No',
    'Tenure': 12,
    'Phone Service': 'Yes',
    'Multiple Lines': 'No',
    'Internet Service': 'Fiber optic',
    'Online Security': 'No',
    'Online Backup': 'Yes',
    'Device Protection': 'No',
    'Tech Support': 'No',
    'Streaming TV': 'Yes',
    'Streaming Movies': 'No',
    'Contract': 'Month-to-month',
    'Paperless Billing': 'Yes',
    'Payment Method': 'Electronic check',
    'Monthly Charges': 85.5,
    'Total Charges': 876.5,
    'gender': 'Female' # Added this to match typical dataset
}])

# Ensure new_data has all columns in the same order/style as training X
# We filter new_data to only include columns the pipeline was trained on
new_data_filtered = new_data[X.columns]

# 9. Predict
prediction = pipeline.predict(new_data_filtered)
prediction_proba = pipeline.predict_proba(new_data_filtered)

print("--- Result ---")
print("Predicted Churn (0=No, 1=Yes):", prediction[0])
print("Prediction probability:", prediction_proba[0])

Columns in dataset: ['LoyaltyID', 'Customer ID', 'Senior Citizen', 'Partner', 'Dependents', 'Tenure', 'Phone Service', 'Multiple Lines', 'Internet Service', 'Online Security', 'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV', 'Streaming Movies', 'Contract', 'Paperless Billing', 'Payment Method', 'Monthly Charges', 'Total Charges']


ValueError: A given column is not a column of the dataframe