<div style="background-image: url('https://www.dropbox.com/scl/fi/wdrnuojbnjx6lgfekrx85/mcnair.jpg?rlkey=wcbaw5au7vh5vt1g5d5x7fw8f&dl=1'); background-size: cover; background-position: center; height: 300px; display: flex; align-items: center; justify-content: center; color: white; text-shadow: 2px 2px 4px rgba(0,0,0,0.7); margin-bottom: 20px; position: relative;">
  <h1 style="text-align: center; font-size: 2.5em; margin: 0;">JGSB Python Workshop <br> Part 12: Machine Learning</h1>
  <div style="position: absolute; bottom: 10px; left: 15px; font-size: 0.9em; color: white; text-shadow: 2px 2px 4px rgba(0,0,0,0.7);">
    Authored by Kerry Back
  </div>
  <div style="position: absolute; bottom: 10px; right: 15px; text-align: right; font-size: 0.9em; color: white; text-shadow: 2px 2px 4px rgba(0,0,0,0.7);">
    Rice University, 9/6/2025
  </div>
</div>

# Summary: Machine Learning for Business Success

Machine learning is transforming business decision-making across industries. This notebook introduced you to the fundamental concepts and practical applications that every business professional should understand.

## Key Takeaways

**1. Machine Learning Types:**
- **Supervised Learning:** Predict outcomes using labeled historical data (regression, classification)
- **Unsupervised Learning:** Discover patterns without predefined targets (clustering, dimensionality reduction)
- Each type solves different business problems and requires different evaluation approaches

**2. Business Applications:**
- **Customer Analytics:** Lifetime value prediction, churn prevention, segmentation
- **Risk Management:** Credit scoring, fraud detection, operational risk assessment
- **Operations:** Demand forecasting, inventory optimization, quality control
- **Marketing:** Targeted campaigns, recommendation systems, price optimization

**3. Model Selection Strategy:**
- Start simple with linear models for interpretability
- Use ensemble methods (Random Forest) for robust performance
- Consider neural networks for complex, nonlinear patterns
- Always balance performance with business requirements

**4. Success Factors:**
- **Data Quality:** Clean, relevant, sufficient data is crucial
- **Feature Engineering:** Domain expertise creates valuable features
- **Proper Evaluation:** Use appropriate metrics and validation techniques
- **Business Integration:** Align models with operational capabilities and constraints

## Next Steps for Business Professionals

**1. Build Domain Expertise:** Understand your business data and what drives outcomes
**2. Start Small:** Begin with simple, high-impact use cases
**3. Collaborate:** Work closely with data scientists and IT teams
**4. Measure Impact:** Track business metrics, not just model metrics
**5. Iterate:** Continuously improve models based on real-world performance

## Tools and Resources

**Python Libraries for Business ML:**
- **scikit-learn:** Comprehensive machine learning library
- **pandas:** Data manipulation and analysis
- **matplotlib/seaborn:** Data visualization
- **xgboost:** Advanced gradient boosting
- **statsmodels:** Statistical modeling

**Cloud ML Platforms:**
- AWS SageMaker, Google Cloud AI, Azure ML
- Auto-ML tools for business users
- Pre-built APIs for common tasks

## Final Advice

Machine learning is a powerful tool, but it's not magic. Success requires:
- Clear business objectives
- Quality data and domain expertise  
- Appropriate model selection and evaluation
- Thoughtful integration into business processes
- Continuous monitoring and improvement

Start with problems where you have good data and clear success metrics. Build your confidence with simpler models before tackling complex challenges. Remember: a simple model that's actually used is infinitely more valuable than a complex model that sits unused.

The future belongs to organizations that can effectively combine human insight with machine intelligence to make better decisions faster.

# Machine Learning Implementation Best Practices

Successful machine learning projects require more than just algorithms. This section covers essential practices for deploying ML solutions in business environments.

## Model Selection and Evaluation

**Cross-Validation:** Always use cross-validation to get robust performance estimates
```python
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
```

**Hyperparameter Tuning:** Systematically optimize model parameters
```python
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [3, 5, 10]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
```

**Feature Engineering:** Create meaningful features from raw data
- Domain knowledge is crucial for feature creation
- Consider interaction terms, polynomial features, and transformations
- Use feature selection techniques to reduce dimensionality

## Business Integration Considerations

**Model Interpretability:** Choose models based on business requirements
- **High Interpretability:** Logistic Regression, Decision Trees, Linear Models
- **Medium Interpretability:** Random Forest (feature importance), Regularized Models
- **Low Interpretability:** Neural Networks, SVM with RBF kernel

**Performance vs. Interpretability Trade-off:**
- Regulatory industries often require interpretable models
- High-stakes decisions may need explainable predictions
- Consider ensemble methods that balance both needs

**Handling Class Imbalance:**
- Use stratified sampling for train/test splits
- Consider SMOTE for synthetic data generation
- Adjust class weights in algorithms
- Focus on precision, recall, and F1-score rather than accuracy

## Deployment and Monitoring

**Model Versioning:** Track model versions and performance over time
**Data Drift Detection:** Monitor for changes in input data distribution
**Performance Monitoring:** Set up alerts for model performance degradation
**A/B Testing:** Compare new models against current production models

## Ethical Considerations

**Bias Detection:** Check for discriminatory patterns across demographic groups
**Fairness Metrics:** Ensure equitable treatment across different populations
**Privacy Protection:** Implement appropriate data protection measures
**Transparency:** Document model decisions and limitations for stakeholders

## Exercise: Market Basket Analysis for Cross-Selling

Analyze customer purchase patterns to identify products that are frequently bought together, enabling targeted cross-selling and store layout optimization.

**Your Task:**
1. **Generate Transaction Data:** Create a dataset with 2,000 transactions containing:
   - Transaction ID, customer ID, purchase date
   - Products purchased (from 20 different products across 4 categories)
   - Realistic shopping patterns (complementary products, seasonal effects)

2. **Association Rule Mining:**
   - Calculate support, confidence, and lift for product combinations
   - Identify strong association rules (e.g., "If customers buy X, they also buy Y")
   - Find the most profitable cross-selling opportunities

3. **Customer Clustering with Purchase Behavior:**
   - Create customer profiles based on category preferences
   - Use clustering to identify distinct shopping patterns
   - Analyze seasonal and demographic variations

4. **Business Recommendations:**
   - Suggest product placement strategies for physical stores
   - Develop targeted recommendation algorithms for e-commerce
   - Calculate potential revenue impact of cross-selling initiatives
   - Create customer journey maps based on purchase sequences

**Bonus:** Implement a collaborative filtering recommendation system to suggest products to customers based on similar customers' purchases.

In [ ]:
# Advanced visualization and business interpretation

# PCA for dimensionality reduction and visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_cluster_scaled)

print("ADVANCED ANALYSIS AND VISUALIZATION")
print("="*50)
print(f"PCA explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained by 2 components: {pca.explained_variance_ratio_.sum():.1%}")

# Create comprehensive visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# PCA scatter plot with predicted clusters
scatter = axes[0, 0].scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='tab10', alpha=0.7)
axes[0, 0].set_title(f'Customer Clusters (PCA Visualization)')
axes[0, 0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
axes[0, 0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.colorbar(scatter, ax=axes[0, 0])

# PCA scatter plot with true segments (for comparison)
true_segment_colors = pd.Categorical(customer_behavior['true_segment']).codes
scatter2 = axes[0, 1].scatter(X_pca[:, 0], X_pca[:, 1], c=true_segment_colors, cmap='tab10', alpha=0.7)
axes[0, 1].set_title('True Customer Segments (PCA)')
axes[0, 1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
axes[0, 1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')

# Cluster characteristics heatmap
cluster_analysis_normalized = cluster_analysis.div(cluster_analysis.max(), axis=1)
im = axes[0, 2].imshow(cluster_analysis_normalized.T, cmap='RdYlBu_r', aspect='auto')
axes[0, 2].set_title('Cluster Characteristics (Normalized)')
axes[0, 2].set_xticks(range(len(cluster_analysis.index)))
axes[0, 2].set_xticklabels([f'Cluster {i}' for i in cluster_analysis.index])
axes[0, 2].set_yticks(range(len(clustering_features)))
axes[0, 2].set_yticklabels(clustering_features, fontsize=8)
plt.colorbar(im, ax=axes[0, 2])

# Customer value by cluster
cluster_value = customer_behavior.groupby('predicted_cluster')['annual_spending'].agg(['mean', 'count'])
axes[1, 0].bar(cluster_value.index, cluster_value['mean'], alpha=0.7, color='skyblue')
axes[1, 0].set_title('Average Annual Spending by Cluster')
axes[1, 0].set_xlabel('Cluster')
axes[1, 0].set_ylabel('Annual Spending ($)')

# Add count labels on bars
for i, (cluster, row) in enumerate(cluster_value.iterrows()):
    axes[1, 0].text(i, row['mean'] + 50, f'n={row["count"]}', ha='center', fontsize=10)

# Purchase frequency by cluster
cluster_freq = customer_behavior.groupby('predicted_cluster')['purchase_frequency'].mean()
axes[1, 1].bar(cluster_freq.index, cluster_freq.values, alpha=0.7, color='lightgreen')
axes[1, 1].set_title('Average Purchase Frequency by Cluster')
axes[1, 1].set_xlabel('Cluster')
axes[1, 1].set_ylabel('Annual Purchases')

# Engagement metrics by cluster
engagement_metrics = customer_behavior.groupby('predicted_cluster')[['email_engagement', 'mobile_app_usage']].mean()
engagement_metrics.plot(kind='bar', ax=axes[1, 2], alpha=0.7)
axes[1, 2].set_title('Engagement Metrics by Cluster')
axes[1, 2].set_xlabel('Cluster')
axes[1, 2].set_ylabel('Engagement Score')
axes[1, 2].legend()
axes[1, 2].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

# Business interpretation and marketing strategy
print(f"\nBUSINESS INTERPRETATION AND MARKETING STRATEGY")
print("="*60)

# Name clusters based on characteristics
cluster_names = {}
cluster_strategies = {}

for cluster_id in sorted(customer_behavior['predicted_cluster'].unique()):
    cluster_data = cluster_analysis.loc[cluster_id]
    
    # Determine cluster characteristics
    high_spending = cluster_data['annual_spending'] > cluster_analysis['annual_spending'].median()
    high_frequency = cluster_data['purchase_frequency'] > cluster_analysis['purchase_frequency'].median()
    high_engagement = cluster_data['email_engagement'] > cluster_analysis['email_engagement'].median()
    high_loyalty = cluster_data['loyalty_years'] > cluster_analysis['loyalty_years'].median()
    
    # Name cluster based on characteristics
    if high_spending and not high_frequency:
        name = "Premium Customers"
        strategy = "Personalized luxury offerings, VIP service, exclusive events"
    elif high_frequency and cluster_data['annual_spending'] > 1000:
        name = "Frequent Buyers"
        strategy = "Loyalty rewards, bulk discounts, early access to new products"
    elif high_engagement and cluster_data['annual_spending'] < 800:
        name = "Deal Seekers"
        strategy = "Promotional campaigns, limited-time offers, price-based marketing"
    elif cluster_data['loyalty_years'] < 1:
        name = "New Customers"
        strategy = "Onboarding campaigns, educational content, trial offers"
    else:
        name = "At-Risk Customers"
        strategy = "Re-engagement campaigns, win-back offers, surveys for feedback"
    
    cluster_names[cluster_id] = name
    cluster_strategies[cluster_id] = strategy

# Display cluster insights
for cluster_id in sorted(customer_behavior['predicted_cluster'].unique()):
    cluster_size = (customer_behavior['predicted_cluster'] == cluster_id).sum()
    cluster_data = cluster_analysis.loc[cluster_id]
    
    print(f"\nCluster {cluster_id}: {cluster_names[cluster_id]} ({cluster_size} customers)")
    print(f"Characteristics:")
    print(f"  • Annual Spending: ${cluster_data['annual_spending']:,.0f}")
    print(f"  • Purchase Frequency: {cluster_data['purchase_frequency']:.1f} times/year")
    print(f"  • Average Order Value: ${cluster_data['avg_order_value']:.0f}")
    print(f"  • Email Engagement: {cluster_data['email_engagement']:.1%}")
    print(f"  • Loyalty: {cluster_data['loyalty_years']:.1f} years")
    print(f"Marketing Strategy: {cluster_strategies[cluster_id]}")

# Calculate business impact
total_revenue = customer_behavior['annual_spending'].sum()
cluster_revenue = customer_behavior.groupby('predicted_cluster')['annual_spending'].sum()

print(f"\nREVENUE IMPACT BY CLUSTER")
print("="*40)
for cluster_id in sorted(customer_behavior['predicted_cluster'].unique()):
    revenue = cluster_revenue[cluster_id]
    revenue_pct = revenue / total_revenue * 100
    print(f"{cluster_names[cluster_id]}: ${revenue:,.0f} ({revenue_pct:.1f}% of total)")

print(f"\nTotal Company Revenue: ${total_revenue:,.0f}")

# Recommendations
print(f"\nSTRATEGIC RECOMMENDATIONS")
print("="*40)
print(f"1. Focus retention efforts on high-value segments")
print(f"2. Develop targeted campaigns for each cluster")
print(f"3. Implement personalized product recommendations")
print(f"4. Create cluster-specific communication strategies")
print(f"5. Monitor cluster migration over time")
print(f"6. A/B test marketing messages within clusters")

In [ ]:
# Prepare data for clustering
# Remove customer_id and true_segment for unsupervised learning
clustering_features = ['annual_spending', 'purchase_frequency', 'avg_order_value', 
                      'website_visits', 'support_tickets', 'loyalty_years', 
                      'email_engagement', 'mobile_app_usage']

X_cluster = customer_behavior[clustering_features].copy()

# Standardize features for clustering
scaler_cluster = StandardScaler()
X_cluster_scaled = scaler_cluster.fit_transform(X_cluster)

print("CLUSTERING DATA PREPARATION")
print("="*50)
print(f"Features for clustering: {clustering_features}")
print(f"Data shape: {X_cluster.shape}")
print(f"Data has been standardized for clustering algorithms")

# Determine optimal number of clusters using elbow method and silhouette score
def find_optimal_clusters(X, max_k=10):
    """Find optimal number of clusters using multiple methods"""
    
    inertias = []
    silhouette_scores = []
    k_range = range(2, max_k + 1)
    
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        cluster_labels = kmeans.fit_predict(X)
        
        inertias.append(kmeans.inertia_)
        silhouette_scores.append(silhouette_score(X, cluster_labels))
    
    return k_range, inertias, silhouette_scores

# Find optimal clusters
k_range, inertias, silhouette_scores = find_optimal_clusters(X_cluster_scaled, max_k=8)

print(f"\nCluster validation metrics:")
for i, k in enumerate(k_range):
    print(f"k={k}: Inertia={inertias[i]:.2f}, Silhouette Score={silhouette_scores[i]:.3f}")

# Optimal k is usually where silhouette score is highest or elbow in inertia
optimal_k = k_range[np.argmax(silhouette_scores)]
print(f"\nOptimal number of clusters (highest silhouette): {optimal_k}")

# Visualize cluster validation
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Elbow plot
axes[0].plot(k_range, inertias, 'bo-')
axes[0].set_title('Elbow Method for Optimal k')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia')
axes[0].grid(True)

# Silhouette score plot
axes[1].plot(k_range, silhouette_scores, 'ro-')
axes[1].set_title('Silhouette Score for Different k')
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].axvline(x=optimal_k, color='red', linestyle='--', alpha=0.7, label=f'Optimal k={optimal_k}')
axes[1].legend()
axes[1].grid(True)

plt.tight_layout()
plt.show()

# Perform K-means clustering with optimal k
kmeans_optimal = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
cluster_labels = kmeans_optimal.fit_predict(X_cluster_scaled)

# Add cluster labels to original data
customer_behavior['predicted_cluster'] = cluster_labels

print(f"\nCLUSTERING RESULTS (k={optimal_k})")
print("="*50)
print(f"Predicted cluster distribution:")
print(pd.Series(cluster_labels).value_counts().sort_index())

# Analyze cluster characteristics
cluster_analysis = customer_behavior.groupby('predicted_cluster')[clustering_features].mean()
print(f"\nCluster characteristics (mean values):")
print(cluster_analysis.round(2))

# Compare with true segments
print(f"\nComparison with true segments:")
comparison = pd.crosstab(customer_behavior['true_segment'], 
                        customer_behavior['predicted_cluster'], 
                        margins=True)
print(comparison)

In [ ]:
# Example: Customer Segmentation for Marketing Strategy
# Use clustering to identify distinct customer groups for targeted marketing

from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

def generate_customer_behavior_data(n_customers=1000):
    """Generate customer behavior data for segmentation"""
    np.random.seed(42)
    
    # Create distinct customer segments with different behaviors
    segment_sizes = [300, 250, 200, 150, 100]  # 5 natural segments
    all_customers = []
    
    # Segment 1: High-value, low-frequency (Premium customers)
    n1 = segment_sizes[0]
    premium_customers = pd.DataFrame({
        'annual_spending': np.random.normal(2500, 400, n1),
        'purchase_frequency': np.random.normal(8, 2, n1),
        'avg_order_value': np.random.normal(300, 50, n1),
        'website_visits': np.random.normal(25, 5, n1),
        'support_tickets': np.random.normal(1, 0.5, n1),
        'loyalty_years': np.random.normal(4, 1, n1),
        'email_engagement': np.random.normal(0.6, 0.1, n1),
        'mobile_app_usage': np.random.normal(15, 3, n1),
        'true_segment': 'Premium'
    })
    all_customers.append(premium_customers)
    
    # Segment 2: Frequent, moderate-value (Regular customers)
    n2 = segment_sizes[1]
    regular_customers = pd.DataFrame({
        'annual_spending': np.random.normal(1200, 200, n2),
        'purchase_frequency': np.random.normal(20, 4, n2),
        'avg_order_value': np.random.normal(60, 15, n2),
        'website_visits': np.random.normal(45, 8, n2),
        'support_tickets': np.random.normal(2, 1, n2),
        'loyalty_years': np.random.normal(2, 0.8, n2),
        'email_engagement': np.random.normal(0.4, 0.1, n2),
        'mobile_app_usage': np.random.normal(25, 5, n2),
        'true_segment': 'Regular'
    })
    all_customers.append(regular_customers)
    
    # Segment 3: Price-sensitive, deal-hunters (Bargain hunters)
    n3 = segment_sizes[2]
    bargain_customers = pd.DataFrame({
        'annual_spending': np.random.normal(600, 150, n3),
        'purchase_frequency': np.random.normal(15, 3, n3),
        'avg_order_value': np.random.normal(40, 10, n3),
        'website_visits': np.random.normal(35, 6, n3),
        'support_tickets': np.random.normal(1.5, 0.8, n3),
        'loyalty_years': np.random.normal(1.5, 0.5, n3),
        'email_engagement': np.random.normal(0.7, 0.1, n3),  # High engagement for deals
        'mobile_app_usage': np.random.normal(20, 4, n3),
        'true_segment': 'Bargain Hunter'
    })
    all_customers.append(bargain_customers)
    
    # Segment 4: New, exploring customers (Explorers)
    n4 = segment_sizes[3]
    explorer_customers = pd.DataFrame({
        'annual_spending': np.random.normal(300, 100, n4),
        'purchase_frequency': np.random.normal(5, 2, n4),
        'avg_order_value': np.random.normal(60, 20, n4),
        'website_visits': np.random.normal(15, 4, n4),
        'support_tickets': np.random.normal(3, 1, n4),  # Need more help
        'loyalty_years': np.random.normal(0.5, 0.3, n4),
        'email_engagement': np.random.normal(0.3, 0.1, n4),
        'mobile_app_usage': np.random.normal(10, 3, n4),
        'true_segment': 'Explorer'
    })
    all_customers.append(explorer_customers)
    
    # Segment 5: Dormant customers (At risk)
    n5 = segment_sizes[4]
    dormant_customers = pd.DataFrame({
        'annual_spending': np.random.normal(150, 50, n5),
        'purchase_frequency': np.random.normal(2, 1, n5),
        'avg_order_value': np.random.normal(75, 25, n5),
        'website_visits': np.random.normal(5, 2, n5),
        'support_tickets': np.random.normal(0.5, 0.3, n5),
        'loyalty_years': np.random.normal(3, 1, n5),  # Long-time but inactive
        'email_engagement': np.random.normal(0.1, 0.05, n5),
        'mobile_app_usage': np.random.normal(2, 1, n5),
        'true_segment': 'Dormant'
    })
    all_customers.append(dormant_customers)
    
    # Combine all segments
    combined_df = pd.concat(all_customers, ignore_index=True)
    
    # Add some noise and ensure positive values
    numeric_columns = ['annual_spending', 'purchase_frequency', 'avg_order_value', 
                      'website_visits', 'support_tickets', 'loyalty_years', 
                      'email_engagement', 'mobile_app_usage']
    
    for col in numeric_columns:
        combined_df[col] = np.maximum(combined_df[col], 0)
        if col == 'email_engagement':
            combined_df[col] = np.clip(combined_df[col], 0, 1)
    
    # Add customer IDs
    combined_df['customer_id'] = range(1, len(combined_df) + 1)
    
    # Shuffle the data
    combined_df = combined_df.sample(frac=1, random_state=42).reset_index(drop=True)
    
    return combined_df

# Generate customer data
customer_behavior = generate_customer_behavior_data(1000)

print("CUSTOMER SEGMENTATION ANALYSIS")
print("="*50)
print(f"Dataset shape: {customer_behavior.shape}")
print(f"\nFirst few rows:")
print(customer_behavior.head())

print(f"\nTrue segment distribution:")
print(customer_behavior['true_segment'].value_counts())

print(f"\nDataset statistics:")
print(customer_behavior.describe())

# Unsupervised Learning: Finding Hidden Patterns

Unsupervised learning discovers patterns in data without labeled examples. This is valuable for exploration, segmentation, and understanding the underlying structure of business data when you don't know what you're looking for.

**Key Concepts:**
- **No Target Variable:** Algorithms find patterns without being told what to look for
- **Pattern Discovery:** Identify hidden relationships, groups, and structures
- **Dimensionality Reduction:** Simplify complex data while preserving important information
- **Anomaly Detection:** Find unusual observations that might indicate problems or opportunities

**Main Types of Unsupervised Learning:**

**1. Clustering:** Group similar observations together
- **K-Means:** Partition data into k clusters based on similarity
- **Hierarchical Clustering:** Build tree-like cluster structures
- **DBSCAN:** Find clusters of varying shapes and identify outliers

**2. Dimensionality Reduction:** Reduce the number of variables while preserving information
- **Principal Component Analysis (PCA):** Find the most important directions of variation
- **t-SNE:** Visualize high-dimensional data in 2D or 3D

**3. Association Rules:** Find relationships between different items or events
- **Market Basket Analysis:** "People who buy X also buy Y"

**Business Applications:**
- **Customer Segmentation:** Group customers by behavior, preferences, or value
- **Market Research:** Identify distinct market segments and positioning opportunities
- **Fraud Detection:** Find unusual transactions that deviate from normal patterns
- **Recommendation Systems:** Suggest products based on similar customer preferences
- **Operational Efficiency:** Identify process improvements and cost reduction opportunities
- **Risk Management:** Detect anomalous patterns that might indicate problems

## Exercise: Credit Risk Assessment

Build a classification model to assess loan default risk for a financial institution. This exercise combines multiple business concepts including risk management, regulatory compliance, and profitability analysis.

**Your Task:**
1. **Generate Credit Dataset:** Create a dataset with 1,500 loan applications containing:
   - Applicant demographics: age, income, employment_years, education_level
   - Financial metrics: debt_to_income_ratio, credit_score, loan_amount, existing_loans
   - Loan characteristics: loan_purpose (Auto/Home/Personal), term_months
   - Target: loan_default (binary: 0=repaid, 1=defaulted)

2. **Data Analysis:**
   - Explore default rates by different customer segments
   - Identify key risk factors through visualization
   - Handle any class imbalance in the target variable

3. **Model Development:**
   - Compare Logistic Regression, Decision Tree, Random Forest, and SVM
   - Use appropriate evaluation metrics for imbalanced classification
   - Tune hyperparameters using cross-validation

4. **Business Application:**
   - Create a risk scoring system (Low/Medium/High risk segments)
   - Calculate the cost of false positives (rejected good customers) vs false negatives (approved bad loans)
   - Recommend approval thresholds based on business objectives
   - Estimate the impact on loan portfolio profitability

**Bonus:** Implement LIME (Local Interpretable Model-agnostic Explanations) to explain individual loan decisions for regulatory compliance.

In [ ]:
# Detailed model analysis and business insights

# Best performing model
best_classifier = classification_results[2]  # Random Forest typically performs best
best_model = best_classifier['model']
best_predictions = best_classifier['predictions']
best_probabilities = best_classifier['probabilities']

# Confusion Matrix Analysis
cm = confusion_matrix(y_test_churn, best_predictions)
tn, fp, fn, tp = cm.ravel()

print("\nDETAILED BUSINESS ANALYSIS")
print("="*50)
print(f"Best Model: {best_classifier['model_name']}")
print(f"Test Accuracy: {best_classifier['test_accuracy']:.1%}")
print(f"AUC Score: {best_classifier['auc']:.3f}")

print(f"\nConfusion Matrix Analysis:")
print(f"True Negatives (Correctly predicted retained): {tn}")
print(f"False Positives (Incorrectly predicted churn): {fp}")
print(f"False Negatives (Missed churners): {fn}")
print(f"True Positives (Correctly predicted churn): {tp}")

# Business impact analysis
total_customers = len(y_test_churn)
actual_churners = y_test_churn.sum()
predicted_churners = best_predictions.sum()

print(f"\nBusiness Impact Analysis:")
print(f"Total test customers: {total_customers}")
print(f"Actual churners: {actual_churners}")
print(f"Predicted churners: {predicted_churners}")
print(f"Precision: {tp/(tp+fp):.1%} (of predicted churners, how many actually churned)")
print(f"Recall: {tp/(tp+fn):.1%} (of actual churners, how many we caught)")

# ROI calculation (assuming intervention costs and retention value)
intervention_cost = 50  # Cost to intervene per customer
retention_value = 500  # Value of retaining a customer

# Calculate ROI for different scenarios
print(f"\nROI Analysis (Intervention cost: ${intervention_cost}, Retention value: ${retention_value}):")

# Scenario 1: No model (random intervention)
random_intervention_cost = (actual_churners / total_customers) * total_customers * intervention_cost
random_retention = actual_churners * 0.3  # Assume 30% success rate without targeting
random_value = random_retention * retention_value
random_roi = (random_value - random_intervention_cost) / random_intervention_cost

print(f"Random intervention ROI: {random_roi:.1%}")

# Scenario 2: With ML model
ml_intervention_cost = predicted_churners * intervention_cost
ml_retention = tp * 0.7  # Assume 70% success rate with targeted intervention
ml_value = ml_retention * retention_value
ml_roi = (ml_value - ml_intervention_cost) / ml_intervention_cost if ml_intervention_cost > 0 else 0

print(f"ML-guided intervention ROI: {ml_roi:.1%}")
print(f"ROI improvement: {ml_roi - random_roi:.1%} percentage points")

# Feature importance analysis
if hasattr(best_model, 'feature_importances_'):
    feature_importance_churn = pd.DataFrame({
        'Feature': churn_features,
        'Importance': best_model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    print(f"\nFeature Importance (Random Forest):")
    print("="*40)
    for i, row in feature_importance_churn.head(5).iterrows():
        print(f"{row['Feature']:<25}: {row['Importance']:.3f}")

# Visualize classification results
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Model performance comparison
models_class = classification_df['Model'].tolist()
auc_scores = classification_df['AUC'].tolist()

axes[0, 0].bar(models_class, auc_scores, alpha=0.7, color=['blue', 'green', 'red', 'orange'])
axes[0, 0].set_title('Model Performance (AUC Score)')
axes[0, 0].set_ylabel('AUC Score')
axes[0, 0].tick_params(axis='x', rotation=45)
axes[0, 0].set_ylim(0, 1)

# ROC Curves
for result in classification_results:
    fpr, tpr, _ = roc_curve(y_test_churn, result['probabilities'])
    axes[0, 1].plot(fpr, tpr, label=f"{result['model_name']} (AUC={result['auc']:.3f})")

axes[0, 1].plot([0, 1], [0, 1], 'k--', alpha=0.5)
axes[0, 1].set_title('ROC Curves')
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].legend()

# Confusion Matrix Heatmap
cm_df = pd.DataFrame(cm, index=['Retained', 'Churned'], columns=['Predicted Retained', 'Predicted Churned'])
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues', ax=axes[1, 0])
axes[1, 0].set_title(f'Confusion Matrix: {best_classifier["model_name"]}')

# Feature Importance Plot
if hasattr(best_model, 'feature_importances_'):
    top_features = feature_importance_churn.head(8)
    axes[1, 1].barh(top_features['Feature'], top_features['Importance'], alpha=0.7)
    axes[1, 1].set_title('Top Feature Importance')
    axes[1, 1].set_xlabel('Importance Score')

plt.tight_layout()
plt.show()

# Customer segmentation by churn probability
print(f"\nCUSTOMER RISK SEGMENTATION")
print("="*50)

# Create risk segments based on predicted probabilities
risk_thresholds = [0.3, 0.7]
risk_labels = ['Low Risk', 'Medium Risk', 'High Risk']

test_customers = pd.DataFrame({
    'actual_churn': y_test_churn,
    'predicted_prob': best_probabilities
})

test_customers['risk_segment'] = pd.cut(test_customers['predicted_prob'], 
                                       bins=[0] + risk_thresholds + [1.0], 
                                       labels=risk_labels)

# Analyze segments
segment_analysis = test_customers.groupby('risk_segment').agg({
    'actual_churn': ['count', 'sum', 'mean']
}).round(3)

segment_analysis.columns = ['Customer_Count', 'Actual_Churners', 'Churn_Rate']
print(segment_analysis)

print(f"\nBusiness Recommendations:")
print(f"• Focus retention efforts on High Risk customers ({segment_analysis.loc['High Risk', 'Customer_Count']} customers)")
print(f"• High Risk segment has {segment_analysis.loc['High Risk', 'Churn_Rate']:.1%} actual churn rate")
print(f"• Monitor Medium Risk customers for early warning signs")
print(f"• Low Risk customers require minimal intervention")

In [ ]:
# Prepare data for classification
def prepare_classification_data(df):
    """Prepare churn data for classification"""
    
    # Create dummy variables for contract type
    df_ml = pd.get_dummies(df, columns=['contract_type'], prefix='contract')
    
    # Define features
    feature_columns = [
        'age', 'income', 'months_subscribed', 'monthly_charges',
        'monthly_usage_hours', 'support_tickets', 'has_premium',
        'auto_pay', 'paperless_billing',
        'contract_Month-to-month', 'contract_One year', 'contract_Two year'
    ]
    
    X = df_ml[feature_columns]
    y = df_ml['churned']
    
    return X, y, feature_columns

# Prepare classification data
X_churn, y_churn, churn_features = prepare_classification_data(churn_data)

# Split data
X_train_churn, X_test_churn, y_train_churn, y_test_churn = train_test_split(
    X_churn, y_churn, test_size=0.2, random_state=42, stratify=y_churn
)

# Scale features
scaler_churn = StandardScaler()
X_train_churn_scaled = scaler_churn.fit_transform(X_train_churn)
X_test_churn_scaled = scaler_churn.transform(X_test_churn)

print("CLASSIFICATION DATA PREPARATION")
print("="*50)
print(f"Features shape: {X_churn.shape}")
print(f"Training set: {X_train_churn.shape[0]} customers")
print(f"Test set: {X_test_churn.shape[0]} customers")

print(f"\nChurn rate in training set: {y_train_churn.mean():.1%}")
print(f"Churn rate in test set: {y_test_churn.mean():.1%}")

# Train classification models
def evaluate_classifier(model, X_train, X_test, y_train, y_test, model_name):
    """Train and evaluate classification model"""
    
    # Train model
    model.fit(X_train, y_train)
    
    # Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Probability predictions (for ROC curve)
    if hasattr(model, 'predict_proba'):
        y_test_proba = model.predict_proba(X_test)[:, 1]
    else:
        y_test_proba = model.decision_function(X_test)
    
    # Calculate metrics
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    
    precision = precision_score(y_test, y_test_pred)
    recall = recall_score(y_test, y_test_pred)
    f1 = f1_score(y_test, y_test_pred)
    auc = roc_auc_score(y_test, y_test_proba)
    
    return {
        'model_name': model_name,
        'train_accuracy': train_accuracy,
        'test_accuracy': test_accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'auc': auc,
        'model': model,
        'predictions': y_test_pred,
        'probabilities': y_test_proba
    }

# Initialize classification models
classifiers = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=10),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42)
}

# Train and evaluate models
classification_results = []

print("\nCLASSIFICATION MODEL TRAINING")
print("="*50)

for name, model in classifiers.items():
    print(f"Training {name}...")
    
    # Use scaled data for SVM and Logistic Regression
    if name in ['SVM', 'Logistic Regression']:
        X_train_use = X_train_churn_scaled
        X_test_use = X_test_churn_scaled
    else:
        X_train_use = X_train_churn
        X_test_use = X_test_churn
    
    result = evaluate_classifier(model, X_train_use, X_test_use, 
                                y_train_churn, y_test_churn, name)
    classification_results.append(result)

# Create results summary
classification_df = pd.DataFrame([
    {
        'Model': r['model_name'],
        'Train Accuracy': r['train_accuracy'],
        'Test Accuracy': r['test_accuracy'],
        'Precision': r['precision'],
        'Recall': r['recall'],
        'F1-Score': r['f1_score'],
        'AUC': r['auc'],
        'Overfitting': r['train_accuracy'] - r['test_accuracy']
    }
    for r in classification_results
])

print("\nCLASSIFICATION RESULTS SUMMARY")
print("="*90)
print(classification_df.round(4))

In [ ]:
# Classification Example: Customer Churn Prediction
# Predict which customers are likely to cancel their subscription

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def generate_churn_data(n_customers=2000):
    """Generate realistic customer churn dataset"""
    np.random.seed(42)
    
    # Customer demographics
    age = np.random.normal(35, 12, n_customers)
    age = np.clip(age, 18, 75)
    
    income = np.random.lognormal(10.8, 0.4, n_customers)
    income = np.clip(income, 30000, 150000)
    
    # Service characteristics
    months_subscribed = np.random.exponential(18, n_customers)
    months_subscribed = np.clip(months_subscribed, 1, 60)
    
    monthly_charges = np.random.gamma(3, 25, n_customers)
    monthly_charges = np.clip(monthly_charges, 20, 200)
    
    # Usage patterns
    monthly_usage_hours = np.random.gamma(2, 15, n_customers)
    support_tickets = np.random.poisson(2, n_customers)
    
    # Service features
    has_premium = np.random.binomial(1, 0.3, n_customers)
    auto_pay = np.random.binomial(1, 0.6, n_customers)
    paperless_billing = np.random.binomial(1, 0.7, n_customers)
    
    # Contract type
    contract_types = np.random.choice(['Month-to-month', 'One year', 'Two year'], 
                                    n_customers, p=[0.5, 0.3, 0.2])
    
    # Calculate churn probability based on business logic
    churn_prob = (
        0.1 +  # Base churn rate
        0.001 * (45 - age) +  # Younger customers churn more
        -0.000005 * income +  # Higher income customers churn less
        -0.01 * months_subscribed +  # Longer tenure = less churn
        0.002 * monthly_charges +  # Higher charges = more churn
        -0.005 * monthly_usage_hours +  # Higher usage = less churn
        0.05 * support_tickets +  # More tickets = more churn
        -0.1 * has_premium +  # Premium customers churn less
        -0.08 * auto_pay +  # Auto-pay customers churn less
        np.where(contract_types == 'Month-to-month', 0.2,
                np.where(contract_types == 'One year', 0.05, -0.05))  # Contract effect
    )
    
    # Ensure probabilities are between 0 and 1
    churn_prob = np.clip(churn_prob, 0, 1)
    
    # Generate actual churn based on probabilities
    churned = np.random.binomial(1, churn_prob, n_customers)
    
    # Create DataFrame
    df = pd.DataFrame({
        'customer_id': range(1, n_customers + 1),
        'age': age,
        'income': income,
        'months_subscribed': months_subscribed,
        'monthly_charges': monthly_charges,
        'monthly_usage_hours': monthly_usage_hours,
        'support_tickets': support_tickets,
        'has_premium': has_premium,
        'auto_pay': auto_pay,
        'paperless_billing': paperless_billing,
        'contract_type': contract_types,
        'churned': churned
    })
    
    return df

# Generate churn data
churn_data = generate_churn_data(2000)

print("CUSTOMER CHURN PREDICTION")
print("="*50)
print(f"Dataset shape: {churn_data.shape}")
print(f"\nChurn distribution:")
churn_counts = churn_data['churned'].value_counts()
print(f"Retained customers: {churn_counts[0]} ({churn_counts[0]/len(churn_data):.1%})")
print(f"Churned customers: {churn_counts[1]} ({churn_counts[1]/len(churn_data):.1%})")

print(f"\nFirst few rows:")
print(churn_data.head())

# Analyze churn patterns
print(f"\nChurn rate by contract type:")
churn_by_contract = churn_data.groupby('contract_type')['churned'].agg(['count', 'sum', 'mean'])
churn_by_contract['churn_rate'] = churn_by_contract['mean']
print(churn_by_contract[['count', 'sum', 'churn_rate']])

# Visualize churn patterns
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Churn distribution
churn_counts.plot(kind='bar', ax=axes[0, 0], color=['lightblue', 'lightcoral'])
axes[0, 0].set_title('Churn Distribution')
axes[0, 0].set_ylabel('Number of Customers')
axes[0, 0].set_xticklabels(['Retained', 'Churned'], rotation=0)

# Age distribution by churn
churn_data.boxplot(column='age', by='churned', ax=axes[0, 1])
axes[0, 1].set_title('Age Distribution by Churn Status')
axes[0, 1].set_xlabel('Churned (0=No, 1=Yes)')

# Monthly charges by churn
churn_data.boxplot(column='monthly_charges', by='churned', ax=axes[0, 2])
axes[0, 2].set_title('Monthly Charges by Churn Status')
axes[0, 2].set_xlabel('Churned (0=No, 1=Yes)')

# Tenure by churn
churn_data.boxplot(column='months_subscribed', by='churned', ax=axes[1, 0])
axes[1, 0].set_title('Tenure by Churn Status')
axes[1, 0].set_xlabel('Churned (0=No, 1=Yes)')

# Contract type vs churn
contract_churn = churn_data.groupby(['contract_type', 'churned']).size().unstack()
contract_churn.plot(kind='bar', ax=axes[1, 1], color=['lightblue', 'lightcoral'])
axes[1, 1].set_title('Churn by Contract Type')
axes[1, 1].set_ylabel('Number of Customers')
axes[1, 1].legend(['Retained', 'Churned'])
axes[1, 1].tick_params(axis='x', rotation=45)

# Support tickets vs churn
support_churn = churn_data.groupby(['support_tickets', 'churned']).size().unstack(fill_value=0)
support_churn.plot(kind='bar', ax=axes[1, 2], color=['lightblue', 'lightcoral'])
axes[1, 2].set_title('Churn by Support Tickets')
axes[1, 2].set_ylabel('Number of Customers')
axes[1, 2].legend(['Retained', 'Churned'])

plt.tight_layout()
plt.show()

# Supervised Learning: Classification for Business Decisions

Classification predicts **categorical outcomes** rather than continuous values. This is essential for business decisions like customer segmentation, risk assessment, quality control, and marketing targeting.

**Key Concepts:**
- **Binary Classification:** Two outcomes (Buy/Don't Buy, Approve/Reject, Pass/Fail)
- **Multi-class Classification:** Multiple outcomes (High/Medium/Low Risk, Customer Segments)
- **Probability Estimates:** Models often provide confidence scores for decisions
- **Class Imbalance:** When some outcomes are much rarer than others (fraud detection)

**Business Applications:**
- **Customer Churn:** Will a customer cancel their subscription?
- **Credit Risk:** Should we approve this loan application?
- **Marketing Response:** Will a customer respond to this campaign?
- **Quality Control:** Is this product defective?
- **Employee Retention:** Is this employee likely to quit?
- **Medical Diagnosis:** Does this patient have a specific condition?

**Classification Algorithms:**
- **Logistic Regression:** Linear boundaries with probability outputs
- **Decision Trees:** Rule-based decisions that are easy to interpret
- **Random Forest:** Combines many decision trees for better accuracy
- **Support Vector Machines:** Find optimal boundaries between classes
- **Neural Networks:** Can learn complex nonlinear decision boundaries

**Evaluation Metrics:**
- **Accuracy:** Percentage of correct predictions (can be misleading with imbalanced data)
- **Precision:** Of predicted positives, how many were actually positive?
- **Recall (Sensitivity):** Of actual positives, how many did we correctly identify?
- **F1-Score:** Harmonic mean of precision and recall
- **ROC Curve & AUC:** Trade-off between true positive rate and false positive rate

## Exercise: Sales Revenue Forecasting

Create a machine learning model to predict monthly sales revenue for a retail chain based on store characteristics, marketing spend, and seasonal factors.

**Your Task:**
1. **Generate Dataset:** Create a synthetic dataset with 500 stores containing:
   - Store size (sq ft), age (years), location type (Mall/Street/Outlet)
   - Monthly marketing spend, local competition score (1-10)
   - Seasonal month (1-12), local population, median income
   - Target: Monthly revenue (based on realistic business relationships)

2. **Data Preparation:**
   - Handle categorical variables (location type)
   - Create seasonal features (sin/cos transformations for cyclical patterns)
   - Split into train/test sets (80/20)
   - Scale features appropriately

3. **Model Comparison:**
   - Train Linear, Ridge, Lasso, and Random Forest models
   - Use cross-validation to tune hyperparameters
   - Compare performance using MAE, RMSE, and R²

4. **Business Analysis:**
   - Identify the most important revenue drivers
   - Calculate the ROI of marketing spend from the model
   - Provide actionable insights for store management

**Bonus:** Add interaction terms between marketing spend and store size to capture synergy effects.

In [ ]:
# Train and compare multiple regression models

def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """Train model and calculate performance metrics"""
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Calculate metrics
    train_mae = mean_absolute_error(y_train, y_train_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    
    return {
        'model_name': model_name,
        'train_mae': train_mae,
        'test_mae': test_mae,
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'train_r2': train_r2,
        'test_r2': test_r2,
        'model': model,
        'predictions': y_test_pred
    }

# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=1.0),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

# Train and evaluate models
results = []
predictions = {}

print("MODEL TRAINING AND EVALUATION")
print("="*50)

for name, model in models.items():
    print(f"Training {name}...")
    
    # Use scaled data for regularized models, original data for Random Forest
    if name in ['Ridge Regression', 'Lasso Regression']:
        X_train_use = X_train_scaled
        X_test_use = X_test_scaled
    else:
        X_train_use = X_train
        X_test_use = X_test
    
    result = evaluate_model(model, X_train_use, X_test_use, y_train, y_test, name)
    results.append(result)
    predictions[name] = result['predictions']

# Create results DataFrame
results_df = pd.DataFrame([
    {
        'Model': r['model_name'],
        'Train MAE': r['train_mae'],
        'Test MAE': r['test_mae'],
        'Train RMSE': r['train_rmse'],
        'Test RMSE': r['test_rmse'],
        'Train R²': r['train_r2'],
        'Test R²': r['test_r2'],
        'Overfitting': r['train_r2'] - r['test_r2']
    }
    for r in results
])

print("\nMODEL PERFORMANCE COMPARISON")
print("="*80)
print(results_df.round(4))

# Visualize results
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Test R² comparison
models_list = results_df['Model'].tolist()
test_r2_list = results_df['Test R²'].tolist()

axes[0, 0].bar(models_list, test_r2_list, alpha=0.7, color=['blue', 'green', 'red', 'orange'])
axes[0, 0].set_title('Model Performance (Test R²)')
axes[0, 0].set_ylabel('R² Score')
axes[0, 0].tick_params(axis='x', rotation=45)

# Overfitting analysis
train_r2_list = results_df['Train R²'].tolist()
overfitting = [train - test for train, test in zip(train_r2_list, test_r2_list)]

axes[0, 1].bar(models_list, overfitting, alpha=0.7, color='red')
axes[0, 1].set_title('Overfitting Analysis (Train R² - Test R²)')
axes[0, 1].set_ylabel('Overfitting Score')
axes[0, 1].tick_params(axis='x', rotation=45)
axes[0, 1].axhline(y=0, color='black', linestyle='--', alpha=0.5)

# Prediction accuracy (Random Forest)
best_model = results[3]  # Random Forest
y_pred_best = best_model['predictions']

axes[1, 0].scatter(y_test, y_pred_best, alpha=0.6)
axes[1, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[1, 0].set_title(f'Prediction Accuracy: {best_model["model_name"]}')
axes[1, 0].set_xlabel('Actual CLV ($)')
axes[1, 0].set_ylabel('Predicted CLV ($)')

# Residuals plot
residuals = y_test - y_pred_best
axes[1, 1].scatter(y_pred_best, residuals, alpha=0.6)
axes[1, 1].axhline(y=0, color='red', linestyle='--')
axes[1, 1].set_title('Residuals Plot')
axes[1, 1].set_xlabel('Predicted CLV ($)')
axes[1, 1].set_ylabel('Residuals ($)')

plt.tight_layout()
plt.show()

# Best model interpretation
best_model_obj = results[3]['model']  # Random Forest
feature_importance = best_model_obj.feature_importances_

print(f"\nFEATURE IMPORTANCE (Random Forest)")
print("="*50)
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)

print(importance_df)

# Business insights
print(f"\nBUSINESS INSIGHTS")
print("="*50)
print(f"• Best model: Random Forest (Test R² = {results[3]['test_r2']:.3f})")
print(f"• Average prediction error: ${results[3]['test_mae']:.0f}")
print(f"• Most important factors for CLV:")
for i, row in importance_df.head(3).iterrows():
    print(f"  - {row['Feature']}: {row['Importance']:.3f}")

print(f"\n• Overfitting analysis:")
for result in results:
    overfitting = result['train_r2'] - result['test_r2']
    status = "Good" if overfitting < 0.1 else "Concerning" if overfitting < 0.2 else "Severe"
    print(f"  - {result['model_name']}: {overfitting:.3f} ({status})")

In [ ]:
# Prepare data for machine learning
# Convert categorical variables and split into features/target

def prepare_ml_data(df):
    """Prepare data for machine learning"""
    
    # Create dummy variables for categorical features
    df_ml = pd.get_dummies(df, columns=['segment'], prefix='segment')
    
    # Define features (X) and target (y)
    feature_columns = [
        'age', 'income', 'months_active', 'avg_monthly_spend', 
        'purchase_frequency', 'email_open_rate', 'social_media_follower',
        'segment_Budget', 'segment_Premium', 'segment_Standard'
    ]
    
    X = df_ml[feature_columns]
    y = df_ml['customer_lifetime_value']
    
    return X, y, feature_columns

# Prepare the data
X, y, feature_names = prepare_ml_data(customer_data)

print("MACHINE LEARNING DATA PREPARATION")
print("="*50)
print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"\nFeature columns:")
for i, feature in enumerate(feature_names):
    print(f"{i+1:2d}. {feature}")

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"\nData splits:")
print(f"Training set: {X_train.shape[0]} customers")
print(f"Test set: {X_test.shape[0]} customers")

# Standardize features for algorithms that are sensitive to scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nFeature scaling applied for algorithms sensitive to scale")
print(f"Original feature ranges:")
print(X_train.describe().loc[['min', 'max']])

print(f"\nScaled feature ranges:")
scaled_df = pd.DataFrame(X_train_scaled, columns=feature_names)
print(scaled_df.describe().loc[['min', 'max']])

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# Example 1: Customer Lifetime Value Prediction
# Predict CLV based on customer characteristics and behavior

def generate_clv_data(n_customers=1000):
    """Generate realistic customer lifetime value dataset"""
    np.random.seed(42)
    
    # Customer demographics
    age = np.random.normal(40, 12, n_customers)
    age = np.clip(age, 18, 80)  # Realistic age range
    
    income = np.random.lognormal(10.5, 0.5, n_customers)  # Log-normal income distribution
    income = np.clip(income, 25000, 200000)
    
    # Customer behavior metrics
    months_active = np.random.exponential(24, n_customers)  # Exponential tenure
    months_active = np.clip(months_active, 1, 60)
    
    avg_monthly_spend = np.random.gamma(2, 50, n_customers)  # Gamma spending distribution
    avg_monthly_spend = np.clip(avg_monthly_spend, 10, 500)
    
    purchase_frequency = np.random.poisson(3, n_customers) + 1  # Monthly purchase frequency
    
    # Customer segments (categorical)
    segments = np.random.choice(['Premium', 'Standard', 'Budget'], n_customers, p=[0.2, 0.5, 0.3])
    
    # Marketing engagement
    email_opens = np.random.beta(2, 5, n_customers)  # Open rate between 0-1
    social_media = np.random.binomial(1, 0.4, n_customers)  # Binary: follows on social media
    
    # Generate CLV with realistic business relationships
    clv = (
        50 +  # Base CLV
        0.002 * income +  # Higher income -> higher CLV
        15 * months_active +  # Longer tenure -> higher CLV
        8 * avg_monthly_spend +  # Higher spend -> higher CLV
        25 * purchase_frequency +  # More frequent purchases -> higher CLV
        200 * email_opens +  # Engagement -> higher CLV
        150 * social_media +  # Social media followers -> higher CLV
        np.where(segments == 'Premium', 300, 
                np.where(segments == 'Standard', 100, 0)) +  # Segment premium
        np.random.normal(0, 100, n_customers)  # Random noise
    )
    
    # Ensure CLV is positive
    clv = np.maximum(clv, 50)
    
    # Create DataFrame
    df = pd.DataFrame({
        'customer_id': range(1, n_customers + 1),
        'age': age,
        'income': income,
        'months_active': months_active,
        'avg_monthly_spend': avg_monthly_spend,
        'purchase_frequency': purchase_frequency,
        'segment': segments,
        'email_open_rate': email_opens,
        'social_media_follower': social_media,
        'customer_lifetime_value': clv
    })
    
    return df

# Generate customer data
customer_data = generate_clv_data(1000)

print("CUSTOMER LIFETIME VALUE PREDICTION")
print("="*50)
print(f"Dataset shape: {customer_data.shape}")
print(f"\nFirst few rows:")
print(customer_data.head())

print(f"\nDataset statistics:")
print(customer_data.describe())

print(f"\nCustomer segments:")
print(customer_data['segment'].value_counts())

# Data visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# CLV distribution
axes[0, 0].hist(customer_data['customer_lifetime_value'], bins=30, alpha=0.7, color='skyblue')
axes[0, 0].set_title('CLV Distribution')
axes[0, 0].set_xlabel('Customer Lifetime Value ($)')

# CLV by segment
customer_data.boxplot(column='customer_lifetime_value', by='segment', ax=axes[0, 1])
axes[0, 1].set_title('CLV by Customer Segment')
axes[0, 1].set_ylabel('CLV ($)')

# Correlation with spending
axes[0, 2].scatter(customer_data['avg_monthly_spend'], customer_data['customer_lifetime_value'], alpha=0.6)
axes[0, 2].set_title('CLV vs Monthly Spend')
axes[0, 2].set_xlabel('Average Monthly Spend ($)')
axes[0, 2].set_ylabel('CLV ($)')

# Correlation with tenure
axes[1, 0].scatter(customer_data['months_active'], customer_data['customer_lifetime_value'], alpha=0.6)
axes[1, 0].set_title('CLV vs Tenure')
axes[1, 0].set_xlabel('Months Active')
axes[1, 0].set_ylabel('CLV ($)')

# Income relationship
axes[1, 1].scatter(customer_data['income'], customer_data['customer_lifetime_value'], alpha=0.6)
axes[1, 1].set_title('CLV vs Income')
axes[1, 1].set_xlabel('Income ($)')
axes[1, 1].set_ylabel('CLV ($)')

# Purchase frequency
axes[1, 2].scatter(customer_data['purchase_frequency'], customer_data['customer_lifetime_value'], alpha=0.6)
axes[1, 2].set_title('CLV vs Purchase Frequency')
axes[1, 2].set_xlabel('Monthly Purchase Frequency')
axes[1, 2].set_ylabel('CLV ($)')

plt.tight_layout()
plt.show()

# Supervised Learning: Regression for Business Prediction

Supervised learning uses labeled training data to learn patterns that can predict outcomes for new data. **Regression** predicts continuous numerical values, making it ideal for forecasting sales, prices, revenues, costs, and other quantitative business metrics.

**Key Concepts:**
- **Features (X):** Input variables used to make predictions (customer age, company size, market conditions)
- **Target (y):** The outcome we want to predict (sales revenue, customer lifetime value, stock price)
- **Training Data:** Historical examples with known outcomes
- **Test Data:** New data used to evaluate model performance
- **Overfitting:** Model memorizes training data but fails on new data
- **Generalization:** Model's ability to perform well on unseen data

**Common Regression Algorithms:**
- **Linear Regression:** Assumes linear relationships between features and target
- **Polynomial Regression:** Captures nonlinear relationships with polynomial terms
- **Regularized Regression:** Prevents overfitting with Lasso (L1) and Ridge (L2) penalties
- **Random Forest:** Ensemble method that combines many decision trees
- **Neural Networks:** Flexible models that can learn complex nonlinear patterns

**Business Success Metrics:**
- **Mean Absolute Error (MAE):** Average prediction error in original units
- **Root Mean Square Error (RMSE):** Penalizes large errors more heavily
- **R-squared (R²):** Proportion of variance explained by the model
- **Business Impact:** Revenue gained, costs saved, or efficiency improved

# Introduction to Machine Learning for Business

Machine Learning (ML) is revolutionizing how businesses make decisions, understand customers, and optimize operations. Unlike traditional statistical analysis that focuses on understanding relationships, machine learning emphasizes **prediction** and **pattern recognition** from data.

**What Makes Machine Learning Different:**
- **Predictive Focus:** Primary goal is accurate prediction on new, unseen data
- **Pattern Discovery:** Automatically finds complex relationships in data
- **Scalability:** Handles large datasets with many variables
- **Adaptability:** Models can learn and improve as new data becomes available

**Key Business Applications:**
- **Customer Analytics:** Predicting customer behavior, churn, and lifetime value
- **Marketing Optimization:** Targeting, personalization, and campaign effectiveness
- **Risk Management:** Credit scoring, fraud detection, and operational risk
- **Operations:** Demand forecasting, inventory optimization, and quality control
- **Human Resources:** Talent acquisition, performance prediction, and retention
- **Financial Analysis:** Algorithmic trading, portfolio optimization, and market analysis

**The Machine Learning Workflow:**
1. **Problem Definition:** Clearly define the business question and success metrics
2. **Data Collection:** Gather relevant, high-quality data
3. **Data Preparation:** Clean, transform, and engineer features
4. **Model Selection:** Choose appropriate algorithms for the problem type
5. **Training & Validation:** Fit models and tune hyperparameters
6. **Evaluation:** Assess performance on unseen data
7. **Deployment:** Implement the model in business processes
8. **Monitoring:** Track performance and retrain as needed

**Types of Machine Learning:**
- **Supervised Learning:** Learning from labeled examples (predictions, classifications)
- **Unsupervised Learning:** Finding patterns in unlabeled data (clustering, segmentation)  
- **Reinforcement Learning:** Learning through trial and error (optimization, game playing)