# Week 14 Workshop: ML Models - Classification & Regression

## Water Consumption Dataset

### Objectives
1. Build and compare 3+ machine learning models
2. Create confusion matrix (classification) or residual analysis (regression)
3. Calculate and visualize feature importance
4. Document model selection rationale

### Duration: 2-3 hours

---

## Setup

Run this cell to load all necessary libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# scikit-learn imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Regression models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor

# Classification models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier

# Metrics
from sklearn.metrics import (
    # Regression metrics
    mean_squared_error, mean_absolute_error, r2_score,
    # Classification metrics
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)

# Visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded successfully!")

In [None]:
# Load Water Consumption dataset from datos.gov.co
url = "https://www.datos.gov.co/resource/k9gy-47jj.csv?$limit=10000"

df = pd.read_csv(url)

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows x {df.shape[1]} columns")
print(f"\nColumns:")
print(df.columns.tolist())

In [None]:
# Preview the data
df.head()

In [None]:
# Check data types
print("Data Types:")
print(df.dtypes)
print("\n" + "="*50)
print("\nMissing Values:")
print(df.isnull().sum())

In [None]:
# Numeric summary
df.describe()

---

# Part 1: Data Preparation

Prepare your data for modeling.

---

## 1.1 Identify Columns

Identify which columns are numeric and which are categorical.

In [None]:
# Identify column types
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print("Numeric columns:")
for col in numeric_cols:
    print(f"  - {col}")

print(f"\nCategorical columns:")
for col in categorical_cols:
    print(f"  - {col}: {df[col].nunique()} unique values")

## 1.2 Select Target Variable

Choose your target variable:
- For **regression**: Choose a continuous numeric column (e.g., consumption amount)
- For **classification**: Choose a categorical column or create categories from a numeric column

In [None]:
# YOUR CODE HERE: Select the target column
# Look for consumption-related columns

# Find potential target columns
consumption_candidates = [col for col in numeric_cols if any(x in col.lower() for x in ['consumo', 'consumption', 'valor', 'cantidad', 'total'])]
print(f"Potential target columns: {consumption_candidates}")

# Select target (UPDATE this based on your dataset)
target_col = ___  # e.g., consumption_candidates[0] or a specific column name

print(f"\nSelected target: {target_col}")
print(df[target_col].describe())

## 1.3 Handle Missing Values

Check for and handle missing values in your data.

In [None]:
# Check missing values
print("Missing values per column:")
missing = df.isnull().sum()
print(missing[missing > 0])

# YOUR CODE HERE: Handle missing values
# Option 1: Drop rows with missing values
# df_clean = df.dropna()

# Option 2: Fill with mean/median (for numeric)
# df['column'] = df['column'].fillna(df['column'].mean())

# Option 3: Fill with mode (for categorical)
# df['column'] = df['column'].fillna(df['column'].mode()[0])

df_clean = ___

print(f"\nRows before cleaning: {len(df)}")
print(f"Rows after cleaning: {len(df_clean)}")

## 1.4 Select Features

Choose which columns to use as features (X).

In [None]:
# YOUR CODE HERE: Select feature columns
# Remove target column and any ID/code columns from features

# Get numeric columns (excluding target)
feature_cols = [col for col in numeric_cols if col != target_col]

# Remove ID-like columns
id_patterns = ['id', 'codigo', 'code', 'key']
feature_cols = [col for col in feature_cols if not any(p in col.lower() for p in id_patterns)]

print(f"Selected features: {feature_cols}")
print(f"Number of features: {len(feature_cols)}")

## 1.5 Prepare X and y

In [None]:
# Prepare features (X) and target (y)
X = df_clean[feature_cols]
y = df_clean[target_col]

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

## 1.6 Train-Test Split

In [None]:
# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

---

# Part 2: Build and Compare 3+ Models

Implement at least 3 different models and compare their performance.

---

## Model 1: Linear Regression

Start with the simplest model as a baseline.

In [None]:
# Model 1: Linear Regression
# YOUR CODE HERE

# Step 1: Create the model
model_1 = ___

# Step 2: Train the model
___

# Step 3: Make predictions
y_pred_1 = ___

# Step 4: Calculate metrics
rmse_1 = np.sqrt(mean_squared_error(y_test, y_pred_1))
mae_1 = mean_absolute_error(y_test, y_pred_1)
r2_1 = r2_score(y_test, y_pred_1)

print("=== MODEL 1: LINEAR REGRESSION ===")
print(f"RMSE: {rmse_1:.4f}")
print(f"MAE: {mae_1:.4f}")
print(f"R-squared: {r2_1:.4f}")

## Model 2: Decision Tree

A tree-based model that can capture non-linear patterns.

In [None]:
# Model 2: Decision Tree
# YOUR CODE HERE

# Step 1: Create the model (use max_depth to prevent overfitting)
model_2 = ___

# Step 2: Train the model
___

# Step 3: Make predictions
y_pred_2 = ___

# Step 4: Calculate metrics
rmse_2 = np.sqrt(mean_squared_error(y_test, y_pred_2))
mae_2 = mean_absolute_error(y_test, y_pred_2)
r2_2 = r2_score(y_test, y_pred_2)

print("=== MODEL 2: DECISION TREE ===")
print(f"RMSE: {rmse_2:.4f}")
print(f"MAE: {mae_2:.4f}")
print(f"R-squared: {r2_2:.4f}")

## Model 3: Random Forest

An ensemble of decision trees for better accuracy.

In [None]:
# Model 3: Random Forest
# YOUR CODE HERE

# Step 1: Create the model
model_3 = ___

# Step 2: Train the model
___

# Step 3: Make predictions
y_pred_3 = ___

# Step 4: Calculate metrics
rmse_3 = np.sqrt(mean_squared_error(y_test, y_pred_3))
mae_3 = mean_absolute_error(y_test, y_pred_3)
r2_3 = r2_score(y_test, y_pred_3)

print("=== MODEL 3: RANDOM FOREST ===")
print(f"RMSE: {rmse_3:.4f}")
print(f"MAE: {mae_3:.4f}")
print(f"R-squared: {r2_3:.4f}")

## (Optional) Model 4: Gradient Boosting or KNN

Add a 4th model for extra credit.

In [None]:
# Model 4: (Optional) Gradient Boosting or KNN
# YOUR CODE HERE

# model_4 = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=42)
# OR
# model_4 = KNeighborsRegressor(n_neighbors=5)



## Model Comparison Table

In [None]:
# Create comparison table
# YOUR CODE HERE: Add all models you trained

comparison = pd.DataFrame({
    'Model': ['Linear Regression', 'Decision Tree', 'Random Forest'],
    'RMSE': [rmse_1, rmse_2, rmse_3],
    'MAE': [mae_1, mae_2, mae_3],
    'R-squared': [r2_1, r2_2, r2_3]
})

print("=== MODEL COMPARISON ===")
print(comparison.to_string(index=False))

# Identify best model
best_idx = comparison['R-squared'].idxmax()
best_model_name = comparison.loc[best_idx, 'Model']
print(f"\nBest Model (highest R-squared): {best_model_name}")

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 3, figsize=(14, 5))

models = comparison['Model'].tolist()
colors = ['steelblue', 'coral', 'seagreen']

# RMSE comparison
axes[0].bar(models, comparison['RMSE'], color=colors)
axes[0].set_ylabel('RMSE (lower is better)', fontsize=11)
axes[0].set_title('RMSE Comparison', fontsize=13)
axes[0].tick_params(axis='x', rotation=20)

# MAE comparison
axes[1].bar(models, comparison['MAE'], color=colors)
axes[1].set_ylabel('MAE (lower is better)', fontsize=11)
axes[1].set_title('MAE Comparison', fontsize=13)
axes[1].tick_params(axis='x', rotation=20)

# R-squared comparison
axes[2].bar(models, comparison['R-squared'], color=colors)
axes[2].set_ylabel('R-squared (higher is better)', fontsize=11)
axes[2].set_title('R-squared Comparison', fontsize=13)
axes[2].tick_params(axis='x', rotation=20)
axes[2].set_ylim(0, 1)

plt.tight_layout()
plt.show()

---

# Part 3: Confusion Matrix / Residual Analysis

**For Regression:** Create residual analysis.

**For Classification:** Create confusion matrix.

---

## Option A: Residual Analysis (for Regression)

Analyze the residuals (errors) of your best model.

In [None]:
# Residual Analysis for Best Model (Random Forest)
# YOUR CODE HERE: Use the predictions from your best model

# Calculate residuals
residuals = y_test - y_pred_3  # Update y_pred_3 to your best model's predictions

print("Residual Statistics:")
print(f"Mean: {residuals.mean():.4f} (should be close to 0)")
print(f"Std: {residuals.std():.4f}")
print(f"Min: {residuals.min():.4f}")
print(f"Max: {residuals.max():.4f}")

In [None]:
# Visualize residuals
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# 1. Residual distribution (should be normal)
axes[0].hist(residuals, bins=30, color='steelblue', edgecolor='white', alpha=0.7)
axes[0].axvline(0, color='red', linestyle='--', linewidth=2)
axes[0].set_xlabel('Residual', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].set_title('Residual Distribution', fontsize=13)

# 2. Residuals vs Predicted (should show no pattern)
axes[1].scatter(y_pred_3, residuals, alpha=0.5, color='steelblue')
axes[1].axhline(0, color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Predicted Values', fontsize=11)
axes[1].set_ylabel('Residuals', fontsize=11)
axes[1].set_title('Residuals vs Predicted', fontsize=13)

# 3. Actual vs Predicted
axes[2].scatter(y_test, y_pred_3, alpha=0.5, color='steelblue')
axes[2].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
axes[2].set_xlabel('Actual Values', fontsize=11)
axes[2].set_ylabel('Predicted Values', fontsize=11)
axes[2].set_title('Actual vs Predicted', fontsize=13)

plt.tight_layout()
plt.show()

### Residual Interpretation

**Write your interpretation below:**

*What do the residual plots tell you about your model's performance?*

- Is the residual distribution approximately normal?
- Are there any patterns in the residuals vs predicted plot?
- Are there any obvious outliers?

*YOUR INTERPRETATION HERE*

---

## Option B: Confusion Matrix (for Classification)

If you converted this to a classification problem, create a confusion matrix.

**Example:** Convert consumption to categories (Low, Medium, High)

In [None]:
# OPTIONAL: Convert to classification problem
# Create consumption categories based on quartiles

# def categorize_consumption(value, thresholds):
#     if value < thresholds[0]:
#         return 'Low'
#     elif value < thresholds[1]:
#         return 'Medium'
#     else:
#         return 'High'

# thresholds = [y.quantile(0.33), y.quantile(0.66)]
# y_cat = y.apply(lambda x: categorize_consumption(x, thresholds))

# Then train classification models and create confusion matrix...


In [None]:
# OPTIONAL: Confusion Matrix Visualization
# YOUR CODE HERE (if doing classification)

# cm = confusion_matrix(y_test_cat, y_pred_cat)
# 
# plt.figure(figsize=(8, 6))
# sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
#             xticklabels=['Low', 'Medium', 'High'],
#             yticklabels=['Low', 'Medium', 'High'])
# plt.xlabel('Predicted', fontsize=12)
# plt.ylabel('Actual', fontsize=12)
# plt.title('Confusion Matrix', fontsize=14)
# plt.show()


---

# Part 4: Feature Importance

Calculate and visualize which features are most important for predictions.

---

In [None]:
# Feature Importance from Random Forest (or your best tree-based model)
# YOUR CODE HERE: Use model_3 (Random Forest) or model_2 (Decision Tree)

# Get feature importances
importances = ___  # e.g., model_3.feature_importances_

# Create DataFrame
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': importances
}).sort_values('importance', ascending=False)

print("Feature Importance Ranking:")
print(feature_importance.to_string(index=False))

In [None]:
# Visualize Feature Importance
fig, ax = plt.subplots(figsize=(10, 6))

# Horizontal bar chart
bars = ax.barh(feature_importance['feature'], feature_importance['importance'], 
               color='steelblue', edgecolor='white')

ax.set_xlabel('Importance', fontsize=12)
ax.set_ylabel('Feature', fontsize=12)
ax.set_title('Feature Importance (Random Forest)', fontsize=14)
ax.invert_yaxis()  # Highest importance at top

# Add value labels
for bar, imp in zip(bars, feature_importance['importance']):
    ax.text(bar.get_width() + 0.01, bar.get_y() + bar.get_height()/2,
            f'{imp:.3f}', va='center', fontsize=10)

plt.tight_layout()
plt.show()

### Feature Importance Interpretation

**Write your interpretation below:**

*Which features are most important for predicting water consumption?*

*Do these results make sense from a domain perspective?*

*YOUR INTERPRETATION HERE*

---

---

# Part 5: Model Selection Rationale

Document your analysis and model recommendation.

---

## Summary of Results

**Complete the table below with your results:**

| Model | RMSE | MAE | R-squared | Notes |
|-------|------|-----|-----------|-------|
| Linear Regression | ___ | ___ | ___ | Baseline model |
| Decision Tree | ___ | ___ | ___ | ___ |
| Random Forest | ___ | ___ | ___ | ___ |
| (Optional 4th) | ___ | ___ | ___ | ___ |

---

## Model Selection Rationale

**Write your 1-page analysis below. Address each section:**

### 1. Models Tested and Performance

*Summarize what models you tested and how they performed.*

*YOUR ANALYSIS HERE*

### 2. Recommended Model

*Which model do you recommend and WHY?*

*Consider: accuracy, interpretability, speed, risk of overfitting*

*YOUR RECOMMENDATION HERE*

### 3. Trade-offs

*What are the trade-offs of your chosen model?*

*For example: Random Forest is more accurate but less interpretable than Linear Regression*

*YOUR ANALYSIS HERE*

### 4. Future Improvements

*What would you try next to improve performance?*

*Ideas: more features, feature engineering, hyperparameter tuning, more data, etc.*

*YOUR IDEAS HERE*

---

---

## Final Checklist

Before submitting, verify:

- [ ] All cells have been executed (Kernel > Restart & Run All)
- [ ] Part 1: Data is properly prepared
- [ ] Part 2: At least 3 models implemented and compared
- [ ] Part 3: Confusion matrix or residual analysis completed
- [ ] Part 4: Feature importance calculated and visualized
- [ ] Part 5: Model selection rationale written

---

*Week 14 Workshop - Data Analytics Course - Universidad Cooperativa de Colombia*