##**ðŸ§© Problem Statement**

Customer churn is a critical business problem where customers discontinue a service.
The objective of this project is to predict whether a customer will churn based on their demographic information, account details, and service usage.

This is a binary classification problem, where:

*   `1` â†’ Customer will churn
*   `0` â†’ Customer will not churn

Accurate churn prediction helps businesses take proactive retention actions and reduce revenue loss.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


## Dataset Loading

The dataset is loaded using `pandas`. Basic inspection is performed
to understand structure, data types, and missing values.




In [None]:
df = pd.read_csv("Telco-Customer-Churn.csv")
df.head()

### Data Overview
Checking dataset shape, data types, and missing values.


In [None]:
df.info()

In [None]:
df.shape


In [None]:
df.describe(include='O')

In [None]:
df['TotalCharges'] = df['TotalCharges'].replace(' ', np.nan)

In [None]:
df['TotalCharges'] = df['TotalCharges'].astype('float64')

In [None]:
df.info()

In [None]:
df = df.dropna()

### Class imbalance

In [None]:
X = df.drop(['Churn', 'customerID'], axis =1)
y = df['Churn']

In [None]:
y.value_counts(normalize = True)

Splitting the data into training and test sets.


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.2)

In [None]:
X.describe()

In [None]:
X.columns

In [None]:
X.describe(include='O')

## Feature Engineering

Binary categorical features are encoded using `OrdinalEncoder`.
Multi-class categorical features are encoded using `OneHotEncoder`.
Numerical features are scaled using `StandardScaler`.


In [None]:
binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling']
multi_cat_cols = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection','TechSupport', 'StreamingTV','StreamingMovies', 'Contract','PaymentMethod']
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

In [None]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

In [None]:
binary_transformer = OrdinalEncoder()
multi_cat_transformer = OneHotEncoder(drop='first', sparse_output=False,handle_unknown='ignore')
num_transformer = StandardScaler()


## Model Pipeline
An end-to-end pipeline is created using `ColumnTransformer` and `Pipeline`
to prevent data leakage.


In [None]:
preprocessor = ColumnTransformer([
    ('binary', binary_transformer, binary_cols ),
    ('multi_cat', multi_cat_transformer, multi_cat_cols),
    ('num', num_transformer, num_cols)
])

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import set_config

In [None]:
set_config(display = 'diagram')

### Training a Logistic Regression model using the pipeline.


In [None]:
steps = [
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
]


In [None]:
pipeline = Pipeline(steps)

In [None]:
pipeline.fit(X_train, y_train)


In [None]:
y_pred = pipeline.predict(X_test)

In [None]:
y_pred

## Model Evaluation
Evaluating model performance using confusion matrix and classification report.


In [None]:
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score, classification_report

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
accuracy_score(y_test, y_pred)

### Training a Random Forest classifier using the pipeline.


In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
     n_estimators=300,
    max_depth=8,
    min_samples_leaf=20,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

In [None]:
steps2 = [
    ('preprocessor', preprocessor),
    ('classifier', rf)
]

In [None]:
pipeline2 = Pipeline(steps2)

In [None]:
pipeline2.fit(X_train, y_train)

In [None]:
y_pred_rf = pipeline2.predict(X_test)

## Model Evaluation
Evaluating model performance using confusion matrix and classification report.


In [None]:
print(classification_report(y_test, y_pred_rf))

In [None]:
confusion_matrix(y_test, y_pred_rf)


### Saving the trained pipeline for deployment.


In [None]:
import joblib
joblib.dump(pipeline2, "Churn_pipeline.pkl")

In [None]:
import joblib
pipeline = joblib.load("Churn_pipeline.pkl")


In [None]:
X_new = pd.DataFrame({
    "gender": ["Male"],
    "Partner": ["Yes"],
    "Dependents": ["No"],
    "PhoneService": ["Yes"],
    "PaperlessBilling": ["Yes"],
    "tenure": [25],
    "SeniorCitizen": [0],
    "MultipleLines": ["No"],
    "InternetService": ["Fiber optic"],
    "OnlineSecurity": ["No"],
    "OnlineBackup": ["Yes"],
    "DeviceProtection": ["No"],
    "TechSupport": ["No"],
    "StreamingTV": ["Yes"],
    "StreamingMovies": ["Yes"],
    "Contract": ["Month-to-month"],
    "PaymentMethod": ["Electronic check"],
    "MonthlyCharges": [85.5],
    "TotalCharges": [900.0]
})

## Threshold Optimization
Using `predict_proba` to apply a custom decision threshold.


In [None]:
y_prob = pipeline.predict_proba(X_new)


In [None]:
y_prob_churn = y_prob[:, 1]
print(y_prob_churn)


## Final Results
The optimized threshold improved recall for churned customers,
making the model more suitable for business use.
