# Week 13: In-Class Exercise - Introduction to Machine Learning

## Objective
Prepare data for machine learning and build your first prediction model using the Water Consumption dataset.

## Time: ~30 minutes

## Dataset
Water Consumption (HISTORICO_CONSUMO) - the same dataset we explored in Weeks 4 and 5.

### What You Will Do:
1. Define a prediction problem
2. Prepare features (X) and target (y)
3. Split data into training and testing sets
4. Scale features using StandardScaler
5. Build your first Decision Tree model

---

## Setup

Run this cell to load the necessary libraries and dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:,.2f}'.format)

# Load the Water Consumption dataset
url = "https://www.datos.gov.co/api/views/wcpc-hgdr/rows.csv?accessType=DOWNLOAD"
df = pd.read_csv(url)

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nColumns: {df.columns.tolist()}")

In [None]:
# Quick data inspection
df.head()

In [None]:
# Check data types
df.info()

---

## Part 1: Define the Prediction Problem (5 minutes)

**Key Question:** What do we want to predict?

For this exercise, we will create a **classification** problem:

**Goal:** Predict whether a municipality has **HIGH** or **LOW** water consumption.

We will define:
- **HIGH consumption:** Above the median consumption
- **LOW consumption:** Below or equal to the median consumption

---

### Task 1.1: Create the Target Variable

First, let's look at the consumption distribution and create our target variable.

In [None]:
# Check the consumption column
consumption_col = 'CONSUMO_FACTURADO'

print(f"Consumption statistics:")
print(f"  Mean: {df[consumption_col].mean():,.2f}")
print(f"  Median: {df[consumption_col].median():,.2f}")
print(f"  Min: {df[consumption_col].min():,.2f}")
print(f"  Max: {df[consumption_col].max():,.2f}")

In [None]:
# Create the target variable: HIGH vs LOW consumption
median_consumption = df[consumption_col].median()

# Create binary target: 1 = HIGH, 0 = LOW
df['consumption_level'] = (df[consumption_col] > median_consumption).astype(int)

# Check the distribution
print("Target variable distribution:")
print(df['consumption_level'].value_counts())
print(f"\n0 = LOW (below {median_consumption:,.0f} m3)")
print(f"1 = HIGH (above {median_consumption:,.0f} m3)")

In [None]:
# Visualize the target distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Pie chart
labels = ['LOW', 'HIGH']
sizes = df['consumption_level'].value_counts().sort_index()
colors = ['steelblue', 'coral']
axes[0].pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
axes[0].set_title('Target Variable Distribution', fontsize=14)

# Bar chart
axes[1].bar(labels, sizes, color=colors, edgecolor='black')
axes[1].set_xlabel('Consumption Level', fontsize=12)
axes[1].set_ylabel('Count', fontsize=12)
axes[1].set_title('Class Distribution', fontsize=14)

for i, v in enumerate(sizes):
    axes[1].text(i, v + 100, str(v), ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

**Question:** Why is having balanced classes (roughly 50/50) important for classification?

*Your answer here*

---

## Part 2: Prepare Features (X) and Target (y) (8 minutes)

Now we need to select which features (input variables) we will use to predict consumption level.

**Key concepts:**
- **Features (X):** The input variables used to make predictions
- **Target (y):** The variable we want to predict

---

### Task 2.1: Select Features

Let's identify which columns can be used as features.

In [None]:
# Review available columns
print("Available columns:")
for i, col in enumerate(df.columns):
    print(f"  {i+1}. {col} ({df[col].dtype})")

In [None]:
# Check numeric columns that could be features
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric columns:")
for col in numeric_cols:
    print(f"  - {col}")

In [None]:
# Select features for our model
# IMPORTANT: We cannot use CONSUMO_FACTURADO as a feature (that would be cheating!)
# We also should not use VALOR_FACTURADO (highly correlated with consumption)

# Let's use these features:
feature_cols = ['NUMERO_SUSCRIPTORES', 'ANNO']  # Number of subscribers, Year

# Check if we have a usage type column (USO) - we'll need to encode it
if 'USO' in df.columns:
    print(f"\nUsage types (USO):")
    print(df['USO'].value_counts())

In [None]:
# Encode categorical variable (USO) if it exists
if 'USO' in df.columns:
    # Create a label encoder
    le = LabelEncoder()
    df['USO_encoded'] = le.fit_transform(df['USO'].fillna('UNKNOWN'))
    
    print("Encoding mapping:")
    for i, label in enumerate(le.classes_):
        print(f"  {label} -> {i}")
    
    feature_cols.append('USO_encoded')

print(f"\nFinal feature columns: {feature_cols}")

In [None]:
# Create X (features) and y (target)
# First, remove rows with missing values in our selected columns
df_clean = df[feature_cols + ['consumption_level']].dropna()

print(f"Original dataset size: {len(df)}")
print(f"After removing missing values: {len(df_clean)}")
print(f"Rows removed: {len(df) - len(df_clean)}")

In [None]:
# Create X and y
X = df_clean[feature_cols]
y = df_clean['consumption_level']

print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"\nFeatures preview:")
X.head()

---

## Part 3: Train/Test Split (5 minutes)

**Why split the data?**

Imagine studying for an exam by memorizing all the answers. You would score 100% on that exact test, but fail any new questions. This is called **overfitting**.

To avoid this, we:
1. **Train** the model on 80% of the data
2. **Test** the model on 20% of data it has never seen

---

In [None]:
# Task 3.1: Split the data
# YOUR CODE HERE: Use train_test_split to split X and y
# Use test_size=0.2 (20% for testing) and random_state=42 (for reproducibility)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=___,      # Fill in: What percentage for testing?
    random_state=42     # Keep this for reproducibility
)

print(f"Training set size: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"Test set size: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")

In [None]:
# Verify the class distribution is preserved in both sets
print("Class distribution in training set:")
print(y_train.value_counts(normalize=True).round(3))

print("\nClass distribution in test set:")
print(y_test.value_counts(normalize=True).round(3))

**Question:** Why do we use `random_state=42`?

*Your answer here*

---

## Part 4: Feature Scaling with StandardScaler (5 minutes)

**Why scale features?**

Different features have different scales:
- Number of subscribers: 100 - 100,000
- Year: 2016 - 2023

Some algorithms (like Decision Trees) don't need scaling, but many others (like Neural Networks, SVM) require it.

**StandardScaler** transforms data to have:
- Mean = 0
- Standard deviation = 1

---

In [None]:
# Before scaling - check the current ranges
print("Before scaling:")
print(X_train.describe().round(2))

In [None]:
# Task 4.1: Apply StandardScaler
# Create the scaler
scaler = StandardScaler()

# IMPORTANT: fit_transform on training data, only transform on test data
# YOUR CODE HERE:
X_train_scaled = scaler.fit_transform(___)  # Fill in: Which data to fit and transform?
X_test_scaled = scaler.transform(___)       # Fill in: Which data to only transform?

# Convert back to DataFrame for easier viewing
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=feature_cols)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=feature_cols)

print("After scaling (training data):")
print(X_train_scaled_df.describe().round(2))

**Important:** Notice that after scaling:
- Mean is approximately 0
- Standard deviation is approximately 1

**Question:** Why do we `fit_transform` on training data but only `transform` on test data?

*Your answer here*

---

## Part 5: Build Your First Decision Tree (7 minutes)

**What is a Decision Tree?**

A Decision Tree is like playing "20 Questions":
- Is the number of subscribers greater than 1000? Yes -> Go left, No -> Go right
- Is it a residential usage type? Yes -> Predict LOW, No -> Check another question

Each question (node) splits the data until we reach a prediction (leaf).

---

In [None]:
# Task 5.1: Create and train the Decision Tree
# Note: Decision Trees don't require scaled features, but we'll use the original for clarity

# Create the model
dt_model = DecisionTreeClassifier(
    max_depth=3,        # Limit depth to make it interpretable
    random_state=42
)

# Train the model (fit on training data)
dt_model.fit(X_train, y_train)

print("Model trained successfully!")
print(f"\nTree depth: {dt_model.get_depth()}")
print(f"Number of leaves: {dt_model.get_n_leaves()}")

In [None]:
# Task 5.2: Make predictions on the test set
y_pred = dt_model.predict(X_test)

print(f"Predictions made: {len(y_pred)}")
print(f"\nSample predictions (first 10):")
comparison = pd.DataFrame({
    'Actual': y_test.head(10).values,
    'Predicted': y_pred[:10],
    'Correct': y_test.head(10).values == y_pred[:10]
})
print(comparison.to_string(index=False))

In [None]:
# Task 5.3: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)

print("=" * 50)
print("MODEL EVALUATION")
print("=" * 50)
print(f"\nAccuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"\nThis means the model correctly predicted {accuracy*100:.1f}% of test cases.")

In [None]:
# Confusion Matrix - Shows where the model made mistakes
cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['LOW', 'HIGH'], 
            yticklabels=['LOW', 'HIGH'],
            ax=ax)
ax.set_xlabel('Predicted', fontsize=12)
ax.set_ylabel('Actual', fontsize=12)
ax.set_title('Confusion Matrix', fontsize=14)
plt.tight_layout()
plt.show()

print("\nHow to read the confusion matrix:")
print(f"  - True Negatives (LOW correctly predicted as LOW): {cm[0,0]}")
print(f"  - False Positives (LOW incorrectly predicted as HIGH): {cm[0,1]}")
print(f"  - False Negatives (HIGH incorrectly predicted as LOW): {cm[1,0]}")
print(f"  - True Positives (HIGH correctly predicted as HIGH): {cm[1,1]}")

In [None]:
# Visualize the Decision Tree
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(
    dt_model, 
    feature_names=feature_cols,
    class_names=['LOW', 'HIGH'],
    filled=True,
    rounded=True,
    fontsize=10,
    ax=ax
)
ax.set_title('Decision Tree Visualization', fontsize=16)
plt.tight_layout()
plt.show()

print("\nHow to read the tree:")
print("  - Each box shows a decision rule (e.g., 'subscribers <= 1000')")
print("  - 'gini' measures impurity (lower is better)")
print("  - 'samples' shows how many training examples reached this node")
print("  - 'value' shows [LOW count, HIGH count]")
print("  - 'class' is the predicted class for that node")

In [None]:
# Feature Importance - Which features matter most?
importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': dt_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance:")
print(importance.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(8, 5))
ax.barh(importance['Feature'], importance['Importance'], color='steelblue', edgecolor='black')
ax.set_xlabel('Importance', fontsize=12)
ax.set_title('Feature Importance', fontsize=14)
ax.invert_yaxis()
plt.tight_layout()
plt.show()

---

## Summary

In this exercise, you learned the fundamental steps of machine learning:

1. **Define the Problem**
   - We converted consumption into a binary classification (HIGH/LOW)
   - The target variable should be balanced for best results

2. **Prepare Features (X) and Target (y)**
   - Selected meaningful features that don't leak information
   - Encoded categorical variables (USO)
   - Removed missing values

3. **Train/Test Split**
   - 80% for training, 20% for testing
   - random_state for reproducibility
   - Prevents overfitting by testing on unseen data

4. **Feature Scaling**
   - StandardScaler normalizes features to mean=0, std=1
   - fit_transform on train, transform on test

5. **Build and Evaluate Model**
   - Decision Tree is interpretable and easy to understand
   - Confusion matrix shows where mistakes happen
   - Feature importance shows what the model learned

---

*Next week, we will explore more models (Random Forest, Linear Regression) and learn how to compare them!*

## Reflection Questions

1. **Why didn't we use CONSUMO_FACTURADO or VALOR_FACTURADO as features?**

   *Your answer here*

2. **What does it mean if the model has 70% accuracy? Is that good or bad?**

   *Your answer here*

3. **Looking at the feature importance, which feature was most important for predicting consumption level? Does this make sense?**

   *Your answer here*

---

*End of Exercise*