# Applied Machine Learning - Class 1

Welcome to the first class of Applied Machine Learning! In this session, we will get familiar with loading data, cleaning it, visualizing, and building predictive models. We'll use a **Telco Customer Churn dataset** to predict whether a customer is likely to stop using a company's services.

**Business Use Case:** Customer churn (or attribution) is a major concern for many businesses, especially those with subscription models (telecom, streaming, SaaS, etc.). Losing customers means losing revenue. Identifying customers at risk of churning allows businesses to take proactive steps to retain them, such as offering special promotions or addressing their concerns. Our goal is to classify customers as 'Will Not Churn' (0) or 'Will Churn' (1).

Let's get started! 📉➡️📈

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import plotly.figure_factory as ff

# Load the Telco Customer Churn dataset
# Make sure 'class1_dataset.csv' is loaded in the sidebar
try:
    df = pd.read_csv('class1_dataset.csv')
except FileNotFoundError:
    print("ERROR: 'class1_dataset.csv' not found. Please download it and place it in the correct directory.")
    # df = pd.DataFrame() # Placeholder to allow notebook to run further cells, but they won't be meaningful
    raise # Re-raise the error

# Display the first few rows of the dataset
print("First 5 rows of the dataset:")
df.head()

## Basic Data Exploration 📊

Let's look at the structure of our customer data and check for any immediate issues like missing values or incorrect data types. The features describe customer demographics, account information, and services they use.

In [None]:
# Check the structure of the dataset
print("\nDataset Information:")
df.info()

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

### Data Cleaning and Initial Transformation
Some columns might need adjustments:
1.  `customerID`: This is just an identifier and won't be useful for modeling, so we can drop it.
2.  `TotalCharges`: This column is sometimes read as an object type if it contains spaces (representing new customers with no charges yet). We need to convert it to a numeric type and handle any resulting missing values (e.g., by filling with 0 or median/mean for those few cases).
3.  `Churn`: Our target variable is 'Yes'/'No'. We'll convert this to 1/0 for modeling.

In [None]:
# Drop customerID as it's not a predictive feature
df = df.drop('customerID', axis=1)

# Convert 'TotalCharges' to numeric. Errors='coerce' will turn non-numeric values (like spaces) into NaN
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Check for NaNs created in TotalCharges
print(f"\nNaNs in TotalCharges after conversion: {df['TotalCharges'].isnull().sum()}")

# Handle missing TotalCharges. For new customers, TotalCharges might be 0. Or impute with median.
# Let's see which customers have missing TotalCharges - often those with tenure 0
# print(df[df['TotalCharges'].isnull()][['tenure', 'MonthlyCharges', 'TotalCharges']])
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].median()) # Impute with median

# Convert target variable 'Churn' to binary (0/1)
df['Churn'] = df['Churn'].map({'No': 0, 'Yes': 1})

print("\nCleaned dataset info:")
df.info()
df.head()

## Descriptive Statistics

Summary statistics for numerical features help us understand their distributions (e.g., average tenure, monthly charges).

In [None]:
# Get summary statistics for numerical columns
print("\nDescriptive Statistics (Numerical Features):")
df.describe()

## Data Visualizations: Understanding Customer Behavior

Visualizing data helps identify patterns related to churn. We'll look at:
* The overall churn rate.
* How churn varies with features like contract type, tenure, and monthly charges.

In [None]:
# Plot a pie chart for churn distribution
churn_counts = df['Churn'].value_counts()
fig_churn_pie = px.pie(values=churn_counts.values,
                       names=churn_counts.index.map({0: 'No Churn', 1: 'Churn'}),
                       title='Customer Churn Distribution',
                       hole=0.3)
fig_churn_pie.update_traces(textinfo='percent+label')
fig_churn_pie.show()
print(f"Customers who did not churn: {churn_counts.get(0, 0)}")
print(f"Customers who churned: {churn_counts.get(1, 0)}")
print(f"Churn rate: {churn_counts.get(1, 0) / len(df) * 100:.2f}%\n")

In [None]:
# Example: Churn by Contract type
fig_contract_churn = px.histogram(df, x='Contract', color='Churn',
                                  barmode='group', text_auto=True,
                                  title='Churn Distribution by Contract Type',
                                  labels={'Churn': 'Churn Status (0: No, 1: Yes)'})
fig_contract_churn.show()

# Example: Tenure distribution by Churn
fig_tenure_churn = px.histogram(df, x='tenure', color='Churn', marginal='box',
                                title='Tenure Distribution by Churn Status',
                                labels={'Churn': 'Churn Status (0: No, 1: Yes)'})
fig_tenure_churn.show()

# Example: MonthlyCharges distribution by Churn
fig_monthly_churn = px.histogram(df, x='MonthlyCharges', color='Churn', marginal='box',
                                 title='Monthly Charges Distribution by Churn Status',
                                 labels={'Churn': 'Churn Status (0: No, 1: Yes)'})
fig_monthly_churn.show()

print("Visualizations like these help understand factors correlated with churn. For example, customers on month-to-month contracts tend to churn more, and customers with very low or very high tenure might show different churn behaviors.")

## Preparing the Data for Modeling 🛠️

Machine learning models require numerical input. We need to:
1.  **Identify Categorical and Numerical Features:** Separate columns by data type.
2.  **One-Hot Encode Categorical Features:** Convert categorical variables into a numerical format that models can understand (e.g., 'Contract' type 'Month-to-month' becomes a set of binary columns).
3.  **Scale Numerical Features:** Bring numerical features (like 'tenure', 'MonthlyCharges') to a similar scale using `StandardScaler`. This helps algorithms that are sensitive to feature magnitudes (e.g., Logistic Regression, SVC, Neural Networks).
4.  **Define Features (X) and Target (y):** 'Churn' is our target.
5.  **Split Data:** Divide into training and testing sets.

In [None]:
# Define features (X) and target variable (y)
X = df.drop('Churn', axis=1)
y = df['Churn']

# Identify categorical and numerical columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

print(f"Categorical features: {list(categorical_features)}")
print(f"Numerical features: {list(numerical_features)}")

# Create preprocessing pipelines for numerical and categorical features
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore', drop='first') # drop='first' to avoid multicollinearity

# Create a column transformer to apply transformations to the correct columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough' # In case there are columns not specified (shouldn't be here)
)

# Split the dataset into training and testing sets
# We use stratify by 'y' to ensure similar class proportions in train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Apply preprocessing: fit on training data, transform both training and testing data
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Get feature names after one-hot encoding for better interpretability later (optional but good)
try:
    feature_names_out = preprocessor.get_feature_names_out()
except AttributeError:
    # For older scikit-learn versions, a bit more manual work might be needed if you want exact names
    # For now, we'll proceed without them for simplicity if get_feature_names_out is not available
    feature_names_out = None
    print("Note: preprocessor.get_feature_names_out() not available. Using processed data without explicit new feature names.")

print("\nX_train_processed shape:", X_train_processed.shape)
print("X_test_processed shape:", X_test_processed.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

# If you used a sparse matrix from OneHotEncoder and want to convert to dense for some models/operations:
if hasattr(X_train_processed, "toarray"):
    X_train_processed = X_train_processed.toarray()
    X_test_processed = X_test_processed.toarray()


## Machine Learning Models for Churn Prediction

We will now train several classification models to predict customer churn. For each model, we will:
1.  Train the model on the processed `X_train_processed` and `y_train` data.
2.  Make predictions on the processed `X_test_processed` data.
3.  Evaluate its performance using:
    * **Accuracy:** Overall correctness of predictions.
    * **Confusion Matrix:** A table showing True Positives (correctly predicted churn), True Negatives (correctly predicted no churn), False Positives (predicted churn, but didn't), and False Negatives (predicted no churn, but did).

For this introductory class, we are focusing on accuracy and the confusion matrix. In more advanced sessions, you'd explore other metrics like precision, recall, F1-score, and AUC, especially when dealing with datasets where one class is much rarer than the other (imbalanced datasets) or when the costs of different types of errors vary significantly.

### 1. Logistic Regression

A good baseline linear model for binary classification. It estimates the probability of a customer churning.

In [None]:
# Logistic Regression
lr = LogisticRegression(max_iter=1000, random_state=42, solver='liblinear') # Specified solver for potentially better convergence
lr.fit(X_train_processed, y_train)
y_pred_lr = lr.predict(X_test_processed)

print("--- Logistic Regression --- ")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr) * 100:.2f}%")

# Confusion matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
fig_lr = ff.create_annotated_heatmap(z=cm_lr, x=['No Churn', 'Churn'], y=['No Churn', 'Churn'], colorscale='Viridis')
fig_lr.update_layout(title='Confusion Matrix for Logistic Regression (Actual vs. Predicted)',
                     xaxis_title="Predicted Label", yaxis_title="Actual Label")
fig_lr.show()

**Interpreting the Confusion Matrix for Churn:**
```
                Predicted
                No Churn    Churn
Actual  No Churn     TN          FP  (Type I Error - Predicted churn, but customer stayed)
        Churn        FN          TP  (Type II Error - Predicted no churn, but customer left)
```
* **True Negatives (TN):** Customers correctly identified as *not* churning.
* **False Positives (FP):** Customers predicted to churn, but they actually stayed. (Cost: unnecessary retention efforts).
* **False Negatives (FN):** Customers predicted to stay, but they actually churned. (Cost: lost customer and revenue, often the primary concern).
* **True Positives (TP):** Customers correctly identified as churning.

The ideal scenario is to maximize TN and TP, while minimizing FP and FN.

### 2. Decision Tree

A non-linear model that creates a tree of rules. Can be very interpretable but might overfit without tuning.

In [None]:
# Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42, max_depth=5) # Added max_depth to prevent overfitting for illustration
dt.fit(X_train_processed, y_train)
y_pred_dt = dt.predict(X_test_processed)

print("--- Decision Tree --- ")
print(f"Accuracy: {accuracy_score(y_test, y_pred_dt) * 100:.2f}%")

# Confusion matrix
cm_dt = confusion_matrix(y_test, y_pred_dt)
fig_dt = ff.create_annotated_heatmap(z=cm_dt, x=['No Churn', 'Churn'], y=['No Churn', 'Churn'], colorscale='Blues')
fig_dt.update_layout(title='Confusion Matrix for Decision Tree (Actual vs. Predicted)',
                     xaxis_title="Predicted Label", yaxis_title="Actual Label")
fig_dt.show()

# Optional: Visualize the tree (might be large for many features)
# from sklearn.tree import plot_tree
# import matplotlib.pyplot as plt
# if feature_names_out is not None:
#   plt.figure(figsize=(20,10))
#   plot_tree(dt, filled=True, feature_names=list(feature_names_out), class_names=['No Churn', 'Churn'], max_depth=3, fontsize=10)
#   plt.show()

### 3. Support Vector Classifier (SVC)

Finds a hyperplane that best separates the classes. Can be powerful but sometimes slower to train on very large datasets.

In [None]:
# Support Vector Classifier
# Using a linear kernel for faster training in this example. Non-linear kernels (e.g., 'rbf') can be more powerful but slower.
svc = SVC(kernel='linear', random_state=42)
print("Starting SVC training... This might take a moment.")
svc.fit(X_train_processed, y_train)
print("SVC training complete.")
y_pred_svc = svc.predict(X_test_processed)

print("--- Support Vector Classifier --- ")
print(f"Accuracy: {accuracy_score(y_test, y_pred_svc) * 100:.2f}%")

# Confusion matrix
cm_svc = confusion_matrix(y_test, y_pred_svc)
fig_svc = ff.create_annotated_heatmap(z=cm_svc, x=['No Churn', 'Churn'], y=['No Churn', 'Churn'], colorscale='Greens')
fig_svc.update_layout(title='Confusion Matrix for SVC (Actual vs. Predicted)',
                     xaxis_title="Predicted Label", yaxis_title="Actual Label")
fig_svc.show()

### 4. Random Forest

An ensemble model using multiple decision trees. Generally robust and performs well.

In [None]:
# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1, max_depth=10) # n_jobs=-1 uses all processors, added max_depth
rf.fit(X_train_processed, y_train)
y_pred_rf = rf.predict(X_test_processed)

print("--- Random Forest --- ")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf) * 100:.2f}%")

# Confusion matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
fig_rf = ff.create_annotated_heatmap(z=cm_rf, x=['No Churn', 'Churn'], y=['No Churn', 'Churn'], colorscale='Oranges')
fig_rf.update_layout(title='Confusion Matrix for Random Forest (Actual vs. Predicted)',
                     xaxis_title="Predicted Label", yaxis_title="Actual Label")
fig_rf.show()

### 5. Gradient Boosting

Another powerful ensemble technique that builds trees sequentially, each correcting its predecessor.

In [None]:
# Gradient Boosting Classifier
gb = GradientBoostingClassifier(random_state=42, n_estimators=100, max_depth=3) # Added max_depth
gb.fit(X_train_processed, y_train)
y_pred_gb = gb.predict(X_test_processed)

print("--- Gradient Boosting --- ")
print(f"Accuracy: {accuracy_score(y_test, y_pred_gb) * 100:.2f}%")

# Confusion matrix
cm_gb = confusion_matrix(y_test, y_pred_gb)
fig_gb = ff.create_annotated_heatmap(z=cm_gb, x=['No Churn', 'Churn'], y=['No Churn', 'Churn'], colorscale='Purples')
fig_gb.update_layout(title='Confusion Matrix for Gradient Boosting (Actual vs. Predicted)',
                     xaxis_title="Predicted Label", yaxis_title="Actual Label")
fig_gb.show()

### 6. Multi-layer Perceptron (MLP) - Neural Network

A basic neural network. Can model complex relationships but may require more data and tuning.

In [None]:
# Multi-layer Perceptron Classifier
# Scaled data is important for neural networks.
mlp = MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=300, random_state=42, early_stopping=True, n_iter_no_change=10) # Simplified architecture, added early stopping
mlp.fit(X_train_processed, y_train)
y_pred_mlp = mlp.predict(X_test_processed)

print("--- Multi-layer Perceptron (MLP) --- ")
print(f"Accuracy: {accuracy_score(y_test, y_pred_mlp) * 100:.2f}%")

# Confusion matrix
cm_mlp = confusion_matrix(y_test, y_pred_mlp)
fig_mlp = ff.create_annotated_heatmap(z=cm_mlp, x=['No Churn', 'Churn'], y=['No Churn', 'Churn'], colorscale='Greys')
fig_mlp.update_layout(title='Confusion Matrix for MLP (Actual vs. Predicted)',
                     xaxis_title="Predicted Label", yaxis_title="Actual Label")
fig_mlp.show()

## Conclusion & Next Steps 🚀

In this class, we've explored a common business problem – customer churn – and applied various machine learning models to predict it. We covered:
* Loading and **cleaning real-world data**, including handling missing values and incorrect data types (like 'TotalCharges').
* The importance of **feature preprocessing**: converting categorical features to a numerical format (one-hot encoding) and scaling numerical features.
* Visualizing data to understand churn patterns.
* Training several different classification models, from simple linear models to more complex ensembles and neural networks.
* Evaluating models using **accuracy** and interpreting the **confusion matrix** to understand the types of errors our models make.

**Further Exploration:**
* **Feature Importance:** Understanding which factors are most predictive of churn (e.g., contract type, tenure). Tree-based models like Random Forest and Gradient Boosting can provide feature importance scores.
* **Cost-Benefit Analysis:** In a real business scenario, the cost of a False Negative (failing to predict a churner) is usually higher than a False Positive (incorrectly flagging a loyal customer for a retention offer). This can lead to choosing models or thresholds that specifically minimize high-cost errors.
* **Advanced Evaluation Metrics:** Exploring precision, recall, F1-score, and ROC-AUC to get a better sense of model performance, especially if the churn rate was very low (imbalanced).
* **Hyperparameter Tuning:** Optimizing the settings of each model (e.g., number of trees in a Random Forest, layers in an MLP) to improve performance.
* **Interpretable AI (XAI):** Using techniques like SHAP or LIME to explain individual predictions, which is crucial for business adoption and trust.
