# 1.4 - Your First AI Model

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/madeforai/madeforai/blob/main/docs/understanding-ai/module-1/1.4-first-ai-model.ipynb)

---

**Build a complete machine learning model from scratch‚Äîdata to deployment in under 30 minutes.**

## üìö What You'll Learn

- **Load and explore real data**: Understand your dataset before building models
- **Data preprocessing**: Handle missing values, encode categories, and scale features
- **Train multiple models**: Try different algorithms and compare performance
- **Evaluate properly**: Use metrics that actually matter for your problem
- **The complete ML workflow**: From raw data to trained model

## ‚è±Ô∏è Estimated Time
30-35 minutes

## üìã Prerequisites
- Basic Python and pandas knowledge
- Understanding of machine learning concepts (Chapter 1.3)
- Enthusiasm to build your first model! üöÄ

## üéØ Our Mission: Predict Customer Churn

**The Business Problem:**

You work for a telecom company. Customers are leaving (churning), and it costs 5x more to acquire new customers than to retain existing ones. Your job: **build a model that predicts which customers are likely to churn so the company can intervene early.**

**Why This Matters:**
- Churn prediction is used by Netflix, Spotify, banks, and virtually every subscription business
- It's a classic **binary classification** problem (churn: Yes/No)
- Perfect first project‚Äîreal business value, clean dataset, measurable impact

**What We'll Build:**
1. Load and explore customer data
2. Clean and prepare features
3. Train multiple classification models
4. Compare performance using proper metrics
5. Select the best model and interpret results

Let's dive in! üèä‚Äç‚ôÇÔ∏è

In [None]:
# Setup: Install required packages
# Uncomment if running in Google Colab
# !pip install numpy pandas matplotlib seaborn scikit-learn plotly -q

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
warnings.filterwarnings('ignore')
np.random.seed(42)

print("‚úÖ All libraries loaded successfully!")
print("üéØ Ready to build your first AI model!")
print("\nüíº Project: Telecom Customer Churn Prediction")

## üìä Step 1: Load and Explore the Data

**Golden Rule:** Never start building models without understanding your data first!

We'll use a synthetic telecom churn dataset with realistic customer attributes.

In [None]:
# Create synthetic telecom customer data (in real projects, you'd load from CSV/database)
np.random.seed(42)
n_samples = 1000

# Generate realistic customer data
data = pd.DataFrame({
    'customer_id': range(1, n_samples + 1),
    'age': np.random.randint(18, 70, n_samples),
    'gender': np.random.choice(['Male', 'Female'], n_samples),
    'tenure_months': np.random.randint(1, 73, n_samples),  # 0-6 years
    'monthly_charges': np.random.uniform(20, 120, n_samples),
    'total_charges': None,  # Will calculate from tenure and monthly
    'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples, p=[0.5, 0.3, 0.2]),
    'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples, p=[0.3, 0.5, 0.2]),
    'online_security': np.random.choice(['Yes', 'No', 'No internet service'], n_samples, p=[0.3, 0.5, 0.2]),
    'tech_support': np.random.choice(['Yes', 'No', 'No internet service'], n_samples, p=[0.25, 0.55, 0.2]),
    'paperless_billing': np.random.choice(['Yes', 'No'], n_samples, p=[0.6, 0.4]),
    'payment_method': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], 
                                       n_samples, p=[0.35, 0.2, 0.25, 0.2])
})

# Calculate total charges
data['total_charges'] = data['tenure_months'] * data['monthly_charges'] + np.random.normal(0, 100, n_samples)
data.loc[data['tenure_months'] == 0, 'total_charges'] = data.loc[data['tenure_months'] == 0, 'monthly_charges']

# Create churn label (target variable)
# Higher churn probability for: short tenure, month-to-month contracts, high charges, electronic check
churn_prob = 0.1  # Base churn rate
churn_prob += (data['tenure_months'] < 12) * 0.3  # New customers more likely to churn
churn_prob += (data['contract_type'] == 'Month-to-month') * 0.25
churn_prob += (data['monthly_charges'] > 80) * 0.15
churn_prob += (data['payment_method'] == 'Electronic check') * 0.2
churn_prob += (data['online_security'] == 'No') * 0.1
churn_prob += (data['tech_support'] == 'No') * 0.1

data['churn'] = (np.random.random(n_samples) < churn_prob).astype(int)

# Introduce some missing values (realistic scenario)
missing_idx = np.random.choice(data.index, size=int(0.02 * n_samples), replace=False)
data.loc[missing_idx, 'total_charges'] = np.nan

print("üìÅ Dataset loaded successfully!\n")
print(f"Dataset shape: {data.shape}")
print(f"Columns: {data.shape[1]}")
print(f"Rows: {data.shape[0]}\n")
print("First few rows:")
display(data.head())

In [None]:
# Explore the dataset structure
print("üìã Dataset Information:\n")
print(data.info())
print("\n" + "="*80)
print("\nüìä Summary Statistics:\n")
display(data.describe())
print("\n" + "="*80)
print("\nüîç Missing Values:")
print(data.isnull().sum())
print("\n" + "="*80)
print("\nüéØ Target Variable Distribution:")
churn_counts = data['churn'].value_counts()
print(f"Not Churned (0): {churn_counts[0]} ({churn_counts[0]/len(data)*100:.1f}%)")
print(f"Churned (1): {churn_counts[1]} ({churn_counts[1]/len(data)*100:.1f}%)")

In [None]:
# Visualize key relationships
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. Churn distribution
churn_counts.plot(kind='bar', ax=axes[0, 0], color=['#10b981', '#ef4444'], alpha=0.7, edgecolor='white', linewidth=2)
axes[0, 0].set_title('Churn Distribution', fontsize=13, fontweight='bold')
axes[0, 0].set_xlabel('Churn (0=No, 1=Yes)', fontsize=11, fontweight='bold')
axes[0, 0].set_ylabel('Number of Customers', fontsize=11, fontweight='bold')
axes[0, 0].set_xticklabels(['Not Churned', 'Churned'], rotation=0)
axes[0, 0].grid(True, alpha=0.3, axis='y')

# 2. Tenure vs Churn
data.groupby('churn')['tenure_months'].plot(kind='hist', ax=axes[0, 1], alpha=0.6, bins=20, legend=True)
axes[0, 1].set_title('Tenure Distribution by Churn', fontsize=13, fontweight='bold')
axes[0, 1].set_xlabel('Tenure (months)', fontsize=11, fontweight='bold')
axes[0, 1].set_ylabel('Frequency', fontsize=11, fontweight='bold')
axes[0, 1].legend(['No Churn', 'Churn'])
axes[0, 1].grid(True, alpha=0.3)

# 3. Monthly Charges vs Churn
data.groupby('churn')['monthly_charges'].plot(kind='hist', ax=axes[0, 2], alpha=0.6, bins=20, legend=True)
axes[0, 2].set_title('Monthly Charges by Churn', fontsize=13, fontweight='bold')
axes[0, 2].set_xlabel('Monthly Charges ($)', fontsize=11, fontweight='bold')
axes[0, 2].set_ylabel('Frequency', fontsize=11, fontweight='bold')
axes[0, 2].legend(['No Churn', 'Churn'])
axes[0, 2].grid(True, alpha=0.3)

# 4. Contract Type vs Churn
contract_churn = pd.crosstab(data['contract_type'], data['churn'], normalize='index') * 100
contract_churn.plot(kind='bar', ax=axes[1, 0], color=['#10b981', '#ef4444'], alpha=0.7, edgecolor='white', linewidth=2)
axes[1, 0].set_title('Churn Rate by Contract Type', fontsize=13, fontweight='bold')
axes[1, 0].set_xlabel('Contract Type', fontsize=11, fontweight='bold')
axes[1, 0].set_ylabel('Percentage (%)', fontsize=11, fontweight='bold')
axes[1, 0].legend(['No Churn', 'Churn'], loc='upper right')
axes[1, 0].set_xticklabels(axes[1, 0].get_xticklabels(), rotation=45, ha='right')
axes[1, 0].grid(True, alpha=0.3, axis='y')

# 5. Internet Service vs Churn
internet_churn = pd.crosstab(data['internet_service'], data['churn'], normalize='index') * 100
internet_churn.plot(kind='bar', ax=axes[1, 1], color=['#10b981', '#ef4444'], alpha=0.7, edgecolor='white', linewidth=2)
axes[1, 1].set_title('Churn Rate by Internet Service', fontsize=13, fontweight='bold')
axes[1, 1].set_xlabel('Internet Service', fontsize=11, fontweight='bold')
axes[1, 1].set_ylabel('Percentage (%)', fontsize=11, fontweight='bold')
axes[1, 1].legend(['No Churn', 'Churn'], loc='upper right')
axes[1, 1].set_xticklabels(axes[1, 1].get_xticklabels(), rotation=45, ha='right')
axes[1, 1].grid(True, alpha=0.3, axis='y')

# 6. Payment Method vs Churn  
payment_churn = pd.crosstab(data['payment_method'], data['churn'], normalize='index') * 100
payment_churn.plot(kind='bar', ax=axes[1, 2], color=['#10b981', '#ef4444'], alpha=0.7, edgecolor='white', linewidth=2)
axes[1, 2].set_title('Churn Rate by Payment Method', fontsize=13, fontweight='bold')
axes[1, 2].set_xlabel('Payment Method', fontsize=11, fontweight='bold')
axes[1, 2].set_ylabel('Percentage (%)', fontsize=11, fontweight='bold')
axes[1, 2].legend(['No Churn', 'Churn'], loc='upper right')
axes[1, 2].set_xticklabels(axes[1, 2].get_xticklabels(), rotation=45, ha='right')
axes[1, 2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nüîç Key Observations:")
print("   ‚Üí Customers with shorter tenure churn more often")
print("   ‚Üí Month-to-month contracts have highest churn rate")
print("   ‚Üí Electronic check payments correlate with higher churn")
print("   ‚Üí Fiber optic customers churn more than DSL customers")
print("\nüí° These patterns will help our model predict churn!")

## üßπ Step 2: Data Preprocessing

Raw data is messy. Before training models, we need to:
1. **Handle missing values**
2. **Encode categorical variables** (convert text to numbers)
3. **Scale numerical features** (standardize ranges)
4. **Split into train/validation/test sets**

Let's do this systematically!

In [None]:
# Create a copy for preprocessing
df = data.copy()

print("üßπ Starting Data Preprocessing...\n")

# 1. Handle missing values
print("Step 1: Handling missing values")
print(f"   Missing values before: {df['total_charges'].isnull().sum()}")
# Fill missing total_charges with median
df['total_charges'].fillna(df['total_charges'].median(), inplace=True)
print(f"   Missing values after: {df['total_charges'].isnull().sum()}")
print("   ‚úÖ Missing values handled\n")

# 2. Drop customer_id (not useful for prediction)
df = df.drop('customer_id', axis=1)
print("Step 2: Removed customer_id column\n")

# 3. Encode categorical variables
print("Step 3: Encoding categorical variables")
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
categorical_cols.remove('churn') if 'churn' in categorical_cols else None

print(f"   Categorical columns: {categorical_cols}")

# Use Label Encoding (convert categories to numbers)
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le
    
print("   ‚úÖ Categorical variables encoded\n")

print("üìä Preprocessed Data:")
display(df.head())
print(f"\n‚ú® Data preprocessing complete! Shape: {df.shape}")

In [None]:
# 4. Split features and target
print("\nüì¶ Preparing Features and Target...\n")

X = df.drop('churn', axis=1)
y = df['churn']

print(f"Features (X): {X.shape}")
print(f"Target (y): {y.shape}")
print(f"\nFeature names: {list(X.columns)}")

# 5. Train-Test Split (80-20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # stratify keeps class balance
)

print(f"\n‚úÖ Data Split:")
print(f"   Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.0f}%)")
print(f"   Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.0f}%)")

# 6. Scale numerical features
print(f"\nüîß Scaling numerical features...")
scaler = StandardScaler()

# Fit on training data, transform both train and test
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrames for better readability
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)

print("   ‚úÖ Features scaled (mean=0, std=1)")
print("\nüéØ Data is ready for model training!")

## ü§ñ Step 3: Train Multiple Models

We'll train 5 different classification algorithms and compare them:

1. **Logistic Regression**: Simple, interpretable, fast (linear model)
2. **Decision Tree**: Easy to visualize, handles non-linearity
3. **Random Forest**: Ensemble of trees, robust to overfitting
4. **Gradient Boosting**: Powerful, often wins competitions
5. **K-Nearest Neighbors**: Simple, no training phase

**Why try multiple models?**
- No single algorithm is always best
- Different models capture different patterns
- Comparing helps you understand the problem better

In [None]:
print("ü§ñ Training Multiple Classification Models...\n")
print("="*80)

# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=5),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42, n_estimators=100),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5)
}

# Store results
results = []
trained_models = {}

# Train each model
for name, model in models.items():
    print(f"\nüîÑ Training {name}...")
    
    # Train the model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions on test set
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] if hasattr(model, 'predict_proba') else None
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else None
    
    # Store results
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'ROC-AUC': roc_auc
    })
    
    trained_models[name] = {
        'model': model,
        'predictions': y_pred,
        'probabilities': y_pred_proba
    }
    
    print(f"   ‚úÖ {name} trained successfully!")
    print(f"      Accuracy: {accuracy:.3f}")

print("\n" + "="*80)
print("üéâ All models trained successfully!\n")

## üìä Step 4: Evaluate and Compare Models

**Important:** Different metrics tell different stories!

- **Accuracy**: Overall correctness (can be misleading with imbalanced data)
- **Precision**: Of predicted churners, how many actually churned? (avoid false alarms)
- **Recall**: Of actual churners, how many did we catch? (don't miss churners)
- **F1-Score**: Balance between precision and recall
- **ROC-AUC**: Overall ability to discriminate between classes

In [None]:
# Create results DataFrame
results_df = pd.DataFrame(results)

print("üìä MODEL PERFORMANCE COMPARISON")
print("="*100)
display(results_df.round(3))
print("="*100)

# Visualize comparisons
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
colors = ['#3b82f6', '#10b981', '#f59e0b', '#8b5cf6']

for idx, (metric, color) in enumerate(zip(metrics, colors)):
    ax = axes[idx // 2, idx % 2]
    
    # Sort by metric
    sorted_df = results_df.sort_values(metric, ascending=True)
    
    bars = ax.barh(sorted_df['Model'], sorted_df[metric], 
                   color=color, alpha=0.7, edgecolor='white', linewidth=2)
    
    ax.set_xlabel(metric, fontsize=12, fontweight='bold')
    ax.set_title(f'{metric} Comparison', fontsize=13, fontweight='bold')
    ax.set_xlim(0, 1)
    ax.grid(True, alpha=0.3, axis='x')
    
    # Add value labels
    for bar in bars:
        width = bar.get_width()
        ax.text(width + 0.01, bar.get_y() + bar.get_height()/2,
               f'{width:.3f}',
               ha='left', va='center', fontweight='bold', fontsize=10)

plt.tight_layout()
plt.show()

# Find best model for each metric
print("\nüèÜ Best Models by Metric:")
print("="*60)
for metric in metrics:
    best_model = results_df.loc[results_df[metric].idxmax(), 'Model']
    best_value = results_df[metric].max()
    print(f"{metric:12s}: {best_model:25s} ({best_value:.3f})")
print("="*60)

In [None]:
# ROC Curve Comparison (for models with probability predictions)
fig, ax = plt.subplots(figsize=(10, 8))

for name, model_data in trained_models.items():
    if model_data['probabilities'] is not None:
        fpr, tpr, _ = roc_curve(y_test, model_data['probabilities'])
        roc_auc = roc_auc_score(y_test, model_data['probabilities'])
        ax.plot(fpr, tpr, linewidth=2.5, label=f'{name} (AUC = {roc_auc:.3f})')

# Plot random classifier baseline
ax.plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier (AUC = 0.500)')

ax.set_xlabel('False Positive Rate', fontsize=13, fontweight='bold')
ax.set_ylabel('True Positive Rate', fontsize=13, fontweight='bold')
ax.set_title('ROC Curves: Model Comparison\n(Closer to top-left = Better)', 
            fontsize=14, fontweight='bold', pad=15)
ax.legend(fontsize=11, loc='lower right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìà ROC-AUC Interpretation:")
print("   ‚Üí 0.5 = Random guessing (no better than coin flip)")
print("   ‚Üí 0.7-0.8 = Acceptable")
print("   ‚Üí 0.8-0.9 = Excellent")
print("   ‚Üí 0.9+ = Outstanding")

In [None]:
# Confusion Matrix for best model
best_f1_model = results_df.loc[results_df['F1-Score'].idxmax(), 'Model']
best_predictions = trained_models[best_f1_model]['predictions']

cm = confusion_matrix(y_test, best_predictions)

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', square=True, 
           xticklabels=['Not Churn', 'Churn'],
           yticklabels=['Not Churn', 'Churn'],
           cbar_kws={'label': 'Count'},
           ax=ax, annot_kws={'size': 14, 'weight': 'bold'})

ax.set_xlabel('Predicted', fontsize=13, fontweight='bold')
ax.set_ylabel('Actual', fontsize=13, fontweight='bold')
ax.set_title(f'Confusion Matrix: {best_f1_model}\n(Best F1-Score Model)', 
            fontsize=14, fontweight='bold', pad=15)

plt.tight_layout()
plt.show()

# Calculate and display confusion matrix metrics
tn, fp, fn, tp = cm.ravel()
print(f"\nüìä Confusion Matrix Breakdown ({best_f1_model}):")
print("="*60)
print(f"True Negatives (TN):  {tn:4d} ‚Üê Correctly predicted NOT churn")
print(f"False Positives (FP): {fp:4d} ‚Üê Incorrectly predicted churn")
print(f"False Negatives (FN): {fn:4d} ‚Üê Missed churners (dangerous!)")
print(f"True Positives (TP):  {tp:4d} ‚Üê Correctly predicted churn")
print("="*60)
print(f"\nüí° Business Impact:")
print(f"   ‚Üí We correctly identified {tp} churners (can intervene!)")
print(f"   ‚Üí We missed {fn} churners (they'll leave unnoticed)")
print(f"   ‚Üí We had {fp} false alarms (wasted retention efforts)")

In [None]:
# Detailed classification report for best model
print(f"\nüìã DETAILED CLASSIFICATION REPORT: {best_f1_model}")
print("="*80)
print(classification_report(y_test, best_predictions, 
                          target_names=['Not Churn', 'Churn'],
                          digits=3))
print("="*80)

## üéØ Step 5: Interpret Results & Feature Importance

**The Big Question:** Which factors most influence churn?

Understanding feature importance helps:
- Build trust in the model
- Guide business strategy
- Debug model behavior
- Comply with regulations (explainability)

Let's examine feature importance from our best tree-based models.

In [None]:
# Feature importance from tree-based models
tree_models = ['Decision Tree', 'Random Forest', 'Gradient Boosting']

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, model_name in enumerate(tree_models):
    model = trained_models[model_name]['model']
    
    # Get feature importances
    importances = pd.DataFrame({
        'Feature': X.columns,
        'Importance': model.feature_importances_
    }).sort_values('Importance', ascending=True)
    
    # Plot
    ax = axes[idx]
    bars = ax.barh(importances['Feature'], importances['Importance'], 
                   alpha=0.7, edgecolor='white', linewidth=2)
    
    # Color bars by importance
    colors = plt.cm.RdYlGn(importances['Importance'] / importances['Importance'].max())
    for bar, color in zip(bars, colors):
        bar.set_color(color)
    
    ax.set_xlabel('Importance', fontsize=11, fontweight='bold')
    ax.set_title(f'{model_name}\nFeature Importance', fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

# Get top 5 features from Random Forest (typically most reliable)
rf_model = trained_models['Random Forest']['model']
rf_importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nüîç TOP 5 MOST IMPORTANT FEATURES (Random Forest):")
print("="*60)
for idx, row in rf_importances.head(5).iterrows():
    print(f"{row['Feature']:25s}: {row['Importance']:.4f}")
print("="*60)

print("\nüí° Business Insights:")
print("   ‚Üí Focus retention efforts on customers with short tenure")
print("   ‚Üí Monthly contracts are a churn risk‚Äîencourage longer terms")
print("   ‚Üí High monthly charges predict churn‚Äîconsider pricing strategy")
print("   ‚Üí Offer online security/tech support to at-risk customers")

## üöÄ Step 6: Making Predictions on New Customers

The whole point of building a model is to use it! Let's demonstrate how to:
1. Take a new customer's data
2. Preprocess it the same way as training data
3. Make a churn prediction
4. Get a probability score

In [None]:
# Create some hypothetical new customers
new_customers = pd.DataFrame([
    {
        'age': 25,
        'gender': 'Female',
        'tenure_months': 3,  # NEW customer
        'monthly_charges': 85.0,  # HIGH charges
        'total_charges': 255.0,
        'contract_type': 'Month-to-month',  # RISKY
        'internet_service': 'Fiber optic',
        'online_security': 'No',  # NO security
        'tech_support': 'No',  # NO support
        'paperless_billing': 'Yes',
        'payment_method': 'Electronic check'  # RISKY payment
    },
    {
        'age': 45,
        'gender': 'Male',
        'tenure_months': 48,  # LOYAL customer
        'monthly_charges': 45.0,  # LOW charges
        'total_charges': 2160.0,
        'contract_type': 'Two year',  # STABLE contract
        'internet_service': 'DSL',
        'online_security': 'Yes',  # HAS security
        'tech_support': 'Yes',  # HAS support
        'paperless_billing': 'No',
        'payment_method': 'Bank transfer'  # STABLE payment
    }
])

print("üÜï New Customers to Predict:\n")
display(new_customers)

# Preprocess new data (same as training data)
new_customers_processed = new_customers.copy()

# Encode categorical variables using the same encoders
for col in categorical_cols:
    new_customers_processed[col] = label_encoders[col].transform(new_customers_processed[col])

# Scale features
new_customers_scaled = scaler.transform(new_customers_processed)

# Make predictions with best model
best_model = trained_models[best_f1_model]['model']
predictions = best_model.predict(new_customers_scaled)
probabilities = best_model.predict_proba(new_customers_scaled)[:, 1]

# Display results
print("\n" + "="*80)
print("üîÆ CHURN PREDICTIONS:")
print("="*80)
for idx, (pred, prob) in enumerate(zip(predictions, probabilities)):
    risk_level = "HIGH RISK" if prob > 0.7 else "MEDIUM RISK" if prob > 0.4 else "LOW RISK"
    color = "üî¥" if prob > 0.7 else "üü°" if prob > 0.4 else "üü¢"
    
    print(f"\nCustomer {idx + 1}:")
    print(f"  Prediction: {'WILL CHURN' if pred == 1 else 'WILL STAY'}")
    print(f"  Churn Probability: {prob:.1%}")
    print(f"  Risk Level: {color} {risk_level}")
    
    if prob > 0.5:
        print(f"  ‚ö†Ô∏è  Action: Immediate retention intervention recommended!")
        print(f"     ‚Üí Offer contract incentives")
        print(f"     ‚Üí Add online security/tech support")
        print(f"     ‚Üí Consider pricing adjustment")
    else:
        print(f"  ‚úÖ Action: Standard customer care, monitor monthly")

print("\n" + "="*80)

## üéØ Exercise 1: Build Your Own Customer Profile

**Objective:** Create a customer profile and predict their churn probability

**Task:**  
Modify the customer data below to create a hypothetical customer. Then run the prediction code to see if they're likely to churn.

**Experiment with:**
- Different tenure lengths (1-72 months)
- Contract types (Month-to-month, One year, Two year)
- Service combinations
- Monthly charge levels ($20-$120)

**Questions to explore:**
1. What combination creates the highest churn risk?
2. How much does contract type affect the prediction?
3. Can you create a customer with >90% churn probability?
4. Can you create a customer with <10% churn probability?

In [None]:
# Create your own customer profile here
my_customer = pd.DataFrame([{
    'age': 30,  # Modify these values!
    'gender': 'Female',
    'tenure_months': 6,
    'monthly_charges': 75.0,
    'total_charges': 450.0,
    'contract_type': 'Month-to-month',
    'internet_service': 'Fiber optic',
    'online_security': 'No',
    'tech_support': 'No',
    'paperless_billing': 'Yes',
    'payment_method': 'Electronic check'
}])

# Preprocess and predict (same code as above)
my_customer_processed = my_customer.copy()
for col in categorical_cols:
    my_customer_processed[col] = label_encoders[col].transform(my_customer_processed[col])
my_customer_scaled = scaler.transform(my_customer_processed)

my_prediction = best_model.predict(my_customer_scaled)[0]
my_probability = best_model.predict_proba(my_customer_scaled)[0, 1]

print("\nüîÆ YOUR CUSTOMER PREDICTION:")
print("="*60)
print(f"Churn Prediction: {'WILL CHURN' if my_prediction == 1 else 'WILL STAY'}")
print(f"Churn Probability: {my_probability:.1%}")
print("="*60)

# Your observations here:
# What did you learn?

## üéØ Exercise 2: Improve the Model

**Objective:** Try to beat the current best model's F1-score

**Ideas to experiment with:**
1. **Try different hyperparameters**:
   - Change `max_depth` for Decision Tree
   - Adjust `n_estimators` for Random Forest
   - Modify `n_neighbors` for KNN

2. **Feature engineering**:
   - Create new features (e.g., `charges_per_month = total_charges / tenure_months`)
   - Try polynomial features
   - Remove less important features

3. **Handle class imbalance**:
   - Try `class_weight='balanced'` in models
   - Use SMOTE (Synthetic Minority Over-sampling)

<details>
<summary>üí° Hint: Start Here</summary>

Try training a Random Forest with more trees:
```python
improved_rf = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42)
improved_rf.fit(X_train_scaled, y_train)
```
</details>

In [None]:
# Your experimentation here!
# Try to improve the model performance

# Example starter code:
# improved_model = RandomForestClassifier(
#     n_estimators=200,  # More trees
#     max_depth=10,      # Deeper trees
#     min_samples_split=5,
#     random_state=42
# )
# improved_model.fit(X_train_scaled, y_train)
# improved_pred = improved_model.predict(X_test_scaled)
# print(f"Improved F1-Score: {f1_score(y_test, improved_pred):.3f}")



## üéì Key Takeaways

Congratulations! You've built your first complete machine learning model! Let's recap:

- ‚úÖ **Data Exploration**: Always understand your data before modeling
- ‚úÖ **Preprocessing Pipeline**: Handle missing values, encode categories, scale features
- ‚úÖ **Model Selection**: Try multiple algorithms, no free lunch theorem
- ‚úÖ **Proper Evaluation**: Use train-test split, multiple metrics, confusion matrix
- ‚úÖ **Business Context**: Metrics mean nothing without business interpretation
- ‚úÖ **Feature Importance**: Understand what drives predictions
- ‚úÖ **Deployment Ready**: Know how to make predictions on new data

### üíº Real-World Impact:

This exact workflow powers:
- Netflix predicting if you'll cancel
- Banks detecting fraudulent transactions
- Healthcare diagnosing diseases
- E-commerce personalizing recommendations

**You just built the same foundation used by billion-dollar AI systems!** üöÄ

## üìñ Further Learning

**Recommended Reading:**
- [Scikit-learn Documentation](https://scikit-learn.org/stable/user_guide.html) - Official guide to every algorithm
- [Kaggle Learn](https://www.kaggle.com/learn/intro-to-machine-learning) - Interactive ML tutorials
- [ML Mastery](https://machinelearningmastery.com/) - Practical ML tutorials

**Practice Datasets:**
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) - 500+ datasets
- [Kaggle Datasets](https://www.kaggle.com/datasets) - Real-world data
- [Scikit-learn Built-in Datasets](https://scikit-learn.org/stable/datasets.html) - Ready to use

**Competitions (Apply Your Skills):**
- [Kaggle Competitions](https://www.kaggle.com/competitions) - Compete with data scientists worldwide
- [DrivenData](https://www.drivendata.org/competitions/) - Competitions for social good

**Advanced Topics:**
- Hyperparameter tuning with GridSearchCV/RandomizedSearchCV
- Cross-validation strategies
- Handling imbalanced datasets
- Model interpretability (SHAP, LIME)
- Feature engineering techniques

## ‚û°Ô∏è What's Next?

You've completed Module 1: AI Foundations! üéâ

**You've learned:**
- What AI is and the current landscape
- How machines learn (supervised, unsupervised, reinforcement)
- Core ML concepts (loss functions, gradient descent, train-test split)
- How to build a complete ML model end-to-end

**Next in Module 2: Machine Learning Fundamentals**

You'll dive deeper into:
- **2.1 - Supervised Learning Essentials**: Master regression and classification
- **2.2 - Classification vs Regression**: When to use which and why
- **2.3 - Unsupervised Learning & Clustering**: Discover hidden patterns
- **2.4 - Model Evaluation & Metrics**: Beyond accuracy to real performance

Ready to level up? Start **[Module 2: Machine Learning Fundamentals](../module-2/2.1-supervised-learning.ipynb)**! üöÄ

---

### üí¨ Feedback & Community

**Questions?** Join our [Discord community](https://discord.gg/madeforai)

**Found a bug?** [Open an issue on GitHub](https://github.com/madeforai/madeforai/issues)

**Want to contribute?** Check our [contribution guide](https://github.com/madeforai/madeforai/blob/main/CONTRIBUTING.md)

**Share your success!** Tweet your first model with #MadeForAI

---

### üéâ Congratulations!

You're no longer a complete beginner‚Äîyou're a practicing AI engineer!

**Keep building, keep learning, keep creating!** üåü