### **Business Scenario**

A telecom company is facing a serious issue with **customer churn.**
Every time a customer leaves, the company loses revenue and must<br>
spend extra money to acquire a new customer.<br>

Management wants to **identify customers who are likely to leave** so 
that proactive retention offers can be provided.<br>
You are hired as a **Data Analyst** to build a solution that can help the
company make this decision.<br>

### **Tasks**
1. Load the dataset and understand the available customer attributes.
2. Identify relevant input features that may influence customer churn.
3. Build a model that can estimate the probability of a customer
   leaving the company.
4. Use the model to classify customers into:
    - Likely to churn
    - Likely to stay
5. Predict the churn outcome for unseen customer records.

In [127]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [128]:
df = pd.read_csv(r"C:\Users\DELL\Downloads\WA_Fn-UseC_-Telco-Customer-Churn (1).csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [129]:
print(df.info())
print(df.describe())
print("Missing values:\n", df.isnull().sum())

# Drop missing values
df = df.dropna()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [130]:
le = LabelEncoder()

# Encode all object columns
for col in df.select_dtypes(include="object").columns:
    df[col] = le.fit_transform(df[col])

# Check encoding
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,5375,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,1,2,29.85,2505,0
1,3962,1,0,0,0,34,1,0,0,2,...,2,0,0,0,1,0,3,56.95,1466,0
2,2564,1,0,0,0,2,1,0,0,2,...,0,0,0,0,0,1,3,53.85,157,1
3,5535,1,0,0,0,45,0,1,0,2,...,2,2,0,0,1,0,0,42.3,1400,0
4,6511,0,0,0,0,2,1,0,1,0,...,0,0,0,0,0,1,2,70.7,925,1


In [131]:
X = df.drop("Churn", axis=1)
y = df["Churn"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")

Train shape: (5634, 20), Test shape: (1409, 20)


In [132]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [133]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

In [134]:
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()

print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(cm)
print(f"Correctly identified churn customers: {tp}")
print(f"Non-churn customers misclassified: {fp}")

Accuracy: 0.79
Confusion Matrix:
[[922 113]
 [181 193]]
Correctly identified churn customers: 193
Non-churn customers misclassified: 113


In [135]:
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.89      0.86      1035
           1       0.63      0.52      0.57       374

    accuracy                           0.79      1409
   macro avg       0.73      0.70      0.72      1409
weighted avg       0.78      0.79      0.78      1409



In [138]:
# Example: new customer data
new_customer = pd.DataFrame({
    'customerID':[0],
    'gender': [1],          # 0 or 1 after encoding
    'SeniorCitizen': [0],
    'Partner': [0],
    'Dependents': [0],
    'tenure': [12],
    'PhoneService': [1],
    'MultipleLines': [0],
    'InternetService': [2],
    'OnlineSecurity': [0],
    'OnlineBackup': [0],
    'DeviceProtection': [0],
    'TechSupport': [0],
    'StreamingTV': [0],
    'StreamingMovies': [0],
    'Contract': [0],
    'PaperlessBilling': [1],
    'PaymentMethod': [3],
    'MonthlyCharges': [70.0],
    'TotalCharges': [840.0]
})

# Scale numeric features
new_customer_scaled = scaler.transform(new_customer)

# Predict probability
prob = model.predict_proba(new_customer_scaled)[:,1][0]
pred = model.predict(new_customer_scaled)[0]

label = "Likely to Churn" if pred == 1 else "Likely to Stay"
print(f"Prediction: {label}, Churn Probability: {prob:.2f}")

Prediction: Likely to Churn, Churn Probability: 0.66
