# 🧠 Customer Churn Detection – MLOps Starter Notebook

This notebook explores a sample customer churn dataset and builds a basic machine learning model to predict whether a customer is likely to leave.

**Steps:**
- Load and explore the data
- Preprocess and clean it
- Train a churn prediction model
- Track and save the model for deployment


In [43]:
# === 📦 1. Import Libraries ===
# We'll use pandas for working with tabular data
import pandas as pd

# === 📂 2. Load the CSV file ===
# This loads the customer churn dataset into a DataFrame
df = pd.read_csv('../data/customer_churn.csv')  # Adjust path if needed

# === 🔍 3. Preview the first few rows ===
# Always start by checking what kind of data you're working with
df.head()


Unnamed: 0,CustomerID,Gender,Age,Tenure,Balance,NumOfProducts,IsActiveMember,EstimatedSalary,Exited
0,9809925,Female,27,10,139704.0,2,0,125102.33,1
1,36123788,Female,25,1,168941.23,1,0,138120.19,0
2,36489822,Male,36,4,115364.73,3,0,112234.93,0
3,24657143,Female,56,5,246867.28,3,0,99995.04,0
4,18416740,Male,40,4,240876.52,4,0,121838.49,0


In [44]:
# === 📐 4. Check data shape ===
print("Shape of dataset:", df.shape)  # rows and columns

# === 🔎 5. Data types and nulls ===
print("\nDataFrame Info:")
print(df.info())  # shows column types and non-null counts

# === ⚠️ 6. Missing values ===
print("\nMissing values per column:")
print(df.isnull().sum())  # quick check for nulls

# === 📊 7. Churn value counts ===
print("\nChurn distribution (Exited column):")
print(df['Exited'].value_counts())  # 0 = stayed, 1 = churned


Shape of dataset: (1000, 9)

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerID       1000 non-null   int64  
 1   Gender           1000 non-null   object 
 2   Age              1000 non-null   int64  
 3   Tenure           1000 non-null   int64  
 4   Balance          1000 non-null   float64
 5   NumOfProducts    1000 non-null   int64  
 6   IsActiveMember   1000 non-null   int64  
 7   EstimatedSalary  1000 non-null   float64
 8   Exited           1000 non-null   int64  
dtypes: float64(2), int64(6), object(1)
memory usage: 70.4+ KB
None

Missing values per column:
CustomerID         0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

Churn distribution (Exited column):
Exited
1 

In [45]:
from sklearn.model_selection import train_test_split

# === 🧹 1. Drop ID column ===
df = df.drop('CustomerID', axis=1)

# === 🔄 2. Convert Gender to 0/1 ===
# We'll map 'Female' to 0 and 'Male' to 1
df['Gender'] = df['Gender'].map({'Female': 0, 'Male': 1})

# === 📁 3. Split into features (X) and target (y) ===
X = df.drop('Exited', axis=1)
y = df['Exited']

# === 🧪 4. Split into training and testing sets ===
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Preview shape of splits
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


Train shape: (800, 7)
Test shape: (200, 7)


In [46]:
# === 🤖 Train First Model: Logistic Regression ===
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 📌 Create the model
model = LogisticRegression(max_iter=1000)

# 📌 Train (fit) the model to our training data
model.fit(X_train, y_train)

# 📌 Predict on the test set
y_pred = model.predict(X_test)

# 📌 Check how it did
print("✅ Accuracy:", accuracy_score(y_test, y_pred))
print("\n🧾 Classification Report:")
print(classification_report(y_test, y_pred))

print("\n🧩 Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


✅ Accuracy: 0.485

🧾 Classification Report:
              precision    recall  f1-score   support

           0       0.51      0.35      0.41       104
           1       0.47      0.64      0.54        96

    accuracy                           0.48       200
   macro avg       0.49      0.49      0.48       200
weighted avg       0.49      0.48      0.47       200


🧩 Confusion Matrix:
[[36 68]
 [35 61]]


In [47]:
# === 🌲 Try a better model: Random Forest ===
from sklearn.ensemble import RandomForestClassifier

# 📌 Create the model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# 📌 Train the model
rf_model.fit(X_train, y_train)

# 📌 Predict on test data
rf_preds = rf_model.predict(X_test)

# 📌 Evaluate performance
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("✅ Random Forest Accuracy:", accuracy_score(y_test, rf_preds))
print("\n🧾 Classification Report:")
print(classification_report(y_test, rf_preds))

print("\n🧩 Confusion Matrix:")
print(confusion_matrix(y_test, rf_preds))


✅ Random Forest Accuracy: 0.465

🧾 Classification Report:
              precision    recall  f1-score   support

           0       0.48      0.42      0.45       104
           1       0.45      0.51      0.48        96

    accuracy                           0.47       200
   macro avg       0.47      0.47      0.46       200
weighted avg       0.47      0.47      0.46       200


🧩 Confusion Matrix:
[[44 60]
 [47 49]]


In [48]:
# 💾 Now save the model
import joblib
joblib.dump(rf_model, "../model/churn_model.pkl")

print("✅ Model saved successfully!")



✅ Model saved successfully!
