# Week 14: In-Class Exercise - ML Models: Classification & Regression

## Objective
Implement and compare multiple machine learning models using the Water Consumption dataset.

## Time: ~30 minutes

## Dataset
Water Consumption data from datos.gov.co - the same dataset we prepared in Week 13.

### What You Will Do:
1. Implement Linear Regression for consumption prediction
2. Implement Decision Tree models
3. Implement Random Forest models
4. Compare model performance using appropriate metrics

---

## Setup
Run this cell to load the necessary libraries and dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# scikit-learn imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    accuracy_score, classification_report, confusion_matrix
)

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded successfully!")

In [None]:
# Load Water Consumption dataset from datos.gov.co
# Dataset: Consumo de Agua - contains water consumption records
url = "https://www.datos.gov.co/resource/k9gy-47jj.csv?$limit=10000"

df = pd.read_csv(url)

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nColumns: {df.columns.tolist()}")

In [None]:
# Quick data inspection
df.head()

In [None]:
# Check data types and missing values
print("Data Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nNumeric columns summary:")
df.describe()

---

## Data Preparation

Before building models, we need to:
1. Identify numeric and categorical columns
2. Handle missing values
3. Select features and target variable
4. Split data into train and test sets

---

In [None]:
# Identify potential target and feature columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print("Numeric columns:")
for col in numeric_cols:
    print(f"  - {col}: min={df[col].min():.2f}, max={df[col].max():.2f}, mean={df[col].mean():.2f}")

print("\nCategorical columns:")
for col in categorical_cols:
    print(f"  - {col}: {df[col].nunique()} unique values")

In [None]:
# Select target variable (consumption-related column)
# Look for columns with names like: consumo, consumption, valor, cantidad
consumption_candidates = [col for col in numeric_cols if any(x in col.lower() for x in ['consumo', 'consumption', 'valor', 'cantidad', 'total'])]
print(f"Potential target columns: {consumption_candidates}")

# Select the target column (update based on your dataset)
if consumption_candidates:
    target_col = consumption_candidates[0]
else:
    # Use the column with highest variance if no clear target
    target_col = df[numeric_cols].var().idxmax()

print(f"\nSelected target column: {target_col}")
print(f"Target statistics:")
print(df[target_col].describe())

In [None]:
# Prepare features (X) and target (y)
# Remove the target column and non-predictive columns from features

# Feature columns: all numeric columns except target
feature_cols = [col for col in numeric_cols if col != target_col]

# Remove columns that are IDs or codes (usually not useful for prediction)
id_patterns = ['id', 'codigo', 'code', 'key']
feature_cols = [col for col in feature_cols if not any(p in col.lower() for p in id_patterns)]

print(f"Feature columns: {feature_cols}")
print(f"Number of features: {len(feature_cols)}")

In [None]:
# Prepare the data
# Drop rows with missing values in selected columns
df_clean = df[feature_cols + [target_col]].dropna()

X = df_clean[feature_cols]
y = df_clean[target_col]

print(f"Samples after cleaning: {len(X)}")
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

In [None]:
# Split data into training and testing sets
# 80% training, 20% testing

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

---

## Part 1: Linear Regression (10 minutes)

Linear Regression finds the best linear relationship between features and target.

**Formula:** $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n$

---

### Task 1.1: Train a Linear Regression Model

Remember the sklearn pattern:
1. Import (done above)
2. Create the model
3. Fit on training data
4. Predict on test data
5. Evaluate

In [None]:
# Task 1.1: Create and train Linear Regression model

# Step 1: Create the model
# YOUR CODE HERE
lr_model = ___

# Step 2: Fit the model on training data
# YOUR CODE HERE: lr_model.fit(X_train, y_train)
___

print("Linear Regression model trained!")

In [None]:
# Step 3: Make predictions
# YOUR CODE HERE
y_pred_lr = ___

print(f"First 5 predictions: {y_pred_lr[:5]}")
print(f"First 5 actual values: {y_test.values[:5]}")

In [None]:
# Step 4: Evaluate the model
# For regression, we use: RMSE, MAE, and R-squared

# Calculate metrics
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
mae_lr = mean_absolute_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print("=== LINEAR REGRESSION RESULTS ===")
print(f"RMSE (Root Mean Squared Error): {rmse_lr:.4f}")
print(f"MAE (Mean Absolute Error): {mae_lr:.4f}")
print(f"R-squared: {r2_lr:.4f}")
print(f"\nInterpretation: The model explains {r2_lr*100:.1f}% of the variance in the target.")

In [None]:
# Visualize: Actual vs Predicted
fig, ax = plt.subplots(figsize=(8, 6))

ax.scatter(y_test, y_pred_lr, alpha=0.5, color='steelblue')
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2, label='Perfect Prediction')

ax.set_xlabel('Actual Values', fontsize=12)
ax.set_ylabel('Predicted Values', fontsize=12)
ax.set_title(f'Linear Regression: Actual vs Predicted\nR-squared = {r2_lr:.4f}', fontsize=14)
ax.legend()

plt.tight_layout()
plt.show()

---

## Part 2: Decision Tree (10 minutes)

Decision Trees make predictions by learning simple decision rules from the data.

**Key parameters:**
- `max_depth`: Maximum depth of the tree (prevents overfitting)
- `min_samples_split`: Minimum samples required to split a node

---

### Task 2.1: Train a Decision Tree Regressor

In [None]:
# Task 2.1: Create and train Decision Tree model

# Step 1: Create the model with max_depth=5 to prevent overfitting
# YOUR CODE HERE: dt_model = DecisionTreeRegressor(max_depth=5, random_state=42)
dt_model = ___

# Step 2: Fit the model
# YOUR CODE HERE
___

print("Decision Tree model trained!")

In [None]:
# Step 3: Make predictions
# YOUR CODE HERE
y_pred_dt = ___

# Step 4: Evaluate
rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred_dt))
mae_dt = mean_absolute_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)

print("=== DECISION TREE RESULTS ===")
print(f"RMSE: {rmse_dt:.4f}")
print(f"MAE: {mae_dt:.4f}")
print(f"R-squared: {r2_dt:.4f}")

In [None]:
# Check for overfitting: Compare train vs test performance
y_pred_train_dt = dt_model.predict(X_train)
r2_train_dt = r2_score(y_train, y_pred_train_dt)

print(f"Training R-squared: {r2_train_dt:.4f}")
print(f"Test R-squared: {r2_dt:.4f}")
print(f"Gap: {r2_train_dt - r2_dt:.4f}")

if r2_train_dt - r2_dt > 0.1:
    print("\nWarning: Possible overfitting! Consider reducing max_depth.")
else:
    print("\nGood: No significant overfitting detected.")

---

## Part 3: Random Forest (10 minutes)

Random Forest combines multiple Decision Trees to improve predictions and reduce overfitting.

**Key parameters:**
- `n_estimators`: Number of trees in the forest
- `max_depth`: Maximum depth of each tree

---

### Task 3.1: Train a Random Forest Regressor

In [None]:
# Task 3.1: Create and train Random Forest model

# Step 1: Create the model
# YOUR CODE HERE: rf_model = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)
rf_model = ___

# Step 2: Fit the model
# YOUR CODE HERE
___

print("Random Forest model trained!")

In [None]:
# Step 3: Make predictions
# YOUR CODE HERE
y_pred_rf = ___

# Step 4: Evaluate
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("=== RANDOM FOREST RESULTS ===")
print(f"RMSE: {rmse_rf:.4f}")
print(f"MAE: {mae_rf:.4f}")
print(f"R-squared: {r2_rf:.4f}")

In [None]:
# Feature Importance from Random Forest
# This tells us which features are most important for predictions

feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance:")
print(feature_importance)

In [None]:
# Visualize Feature Importance
fig, ax = plt.subplots(figsize=(10, 6))

bars = ax.barh(feature_importance['feature'], feature_importance['importance'], color='steelblue')
ax.set_xlabel('Importance', fontsize=12)
ax.set_ylabel('Feature', fontsize=12)
ax.set_title('Random Forest: Feature Importance', fontsize=14)
ax.invert_yaxis()  # Highest importance at top

plt.tight_layout()
plt.show()

---

## Part 4: Model Comparison

Compare all three models side by side.

---

In [None]:
# Create comparison table
comparison = pd.DataFrame({
    'Model': ['Linear Regression', 'Decision Tree', 'Random Forest'],
    'RMSE': [rmse_lr, rmse_dt, rmse_rf],
    'MAE': [mae_lr, mae_dt, mae_rf],
    'R-squared': [r2_lr, r2_dt, r2_rf]
})

print("=== MODEL COMPARISON ===")
print(comparison.to_string(index=False))

# Identify best model
best_model = comparison.loc[comparison['R-squared'].idxmax(), 'Model']
print(f"\nBest Model (highest R-squared): {best_model}")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(14, 5))

models = ['Linear Regression', 'Decision Tree', 'Random Forest']
colors = ['steelblue', 'coral', 'seagreen']

# RMSE comparison
axes[0].bar(models, [rmse_lr, rmse_dt, rmse_rf], color=colors)
axes[0].set_ylabel('RMSE (lower is better)', fontsize=11)
axes[0].set_title('RMSE Comparison', fontsize=13)
axes[0].tick_params(axis='x', rotation=15)

# MAE comparison
axes[1].bar(models, [mae_lr, mae_dt, mae_rf], color=colors)
axes[1].set_ylabel('MAE (lower is better)', fontsize=11)
axes[1].set_title('MAE Comparison', fontsize=13)
axes[1].tick_params(axis='x', rotation=15)

# R-squared comparison
axes[2].bar(models, [r2_lr, r2_dt, r2_rf], color=colors)
axes[2].set_ylabel('R-squared (higher is better)', fontsize=11)
axes[2].set_title('R-squared Comparison', fontsize=13)
axes[2].tick_params(axis='x', rotation=15)
axes[2].set_ylim(0, 1)

plt.tight_layout()
plt.show()

In [None]:
# Predictions comparison plot
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

predictions = [y_pred_lr, y_pred_dt, y_pred_rf]
r2_scores = [r2_lr, r2_dt, r2_rf]

for ax, pred, model, r2, color in zip(axes, predictions, models, r2_scores, colors):
    ax.scatter(y_test, pred, alpha=0.5, color=color, s=30)
    ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
    ax.set_xlabel('Actual', fontsize=11)
    ax.set_ylabel('Predicted', fontsize=11)
    ax.set_title(f'{model}\nR-squared = {r2:.4f}', fontsize=12)

plt.tight_layout()
plt.show()

---

## Summary

In this exercise, you learned:

1. **Linear Regression**
   - Finds linear relationships between features and target
   - Simple and interpretable
   - Works best when relationships are truly linear

2. **Decision Tree**
   - Learns decision rules from data
   - Can capture non-linear patterns
   - Prone to overfitting without depth limits

3. **Random Forest**
   - Ensemble of many decision trees
   - More robust and accurate than single tree
   - Provides feature importance

4. **Regression Metrics**
   - RMSE: Average error magnitude (same units as target)
   - MAE: Average absolute error (easier to interpret)
   - R-squared: Proportion of variance explained (0-1)

---

*End of Exercise*