# Week 13 Workshop: Introduction to Machine Learning

## ML Data Preparation for Water Consumption Prediction

**Student Name:** (Your name here)

**Date:** (Today's date)

**Dataset:** Water Consumption (HISTORICO_CONSUMO) from datos.gov.co

---

### Workshop Objectives

1. Define a clear ML problem statement
2. Select and justify appropriate features
3. Implement proper data preparation for ML
4. Build and evaluate a baseline model
5. Document all decisions with rationale

### Duration: 2-3 hours

---

## Setup

Run this cell to load the necessary libraries and dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:,.2f}'.format)

# Load the Water Consumption dataset
url = "https://www.datos.gov.co/api/views/wcpc-hgdr/rows.csv?accessType=DOWNLOAD"
df = pd.read_csv(url)

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows x {df.shape[1]} columns")
print(f"\nColumns:")
print(df.columns.tolist())

In [None]:
# Preview the data
df.head()

In [None]:
# Check data types and missing values
print("Data types and missing values:")
df.info()

In [None]:
# Summary statistics for numeric columns
df.describe()

---

# Part 1: Problem Definition

Before building any model, we must clearly define what we want to predict and why.

---

## Task 1.1: Explore Potential Target Variables

What could we potentially predict from this dataset?

In [None]:
# Examine the main variables that could be targets
print("Potential target variables:")
print("\n1. CONSUMO_FACTURADO (Billed Consumption in m3):")
print(f"   Range: {df['CONSUMO_FACTURADO'].min():,.0f} to {df['CONSUMO_FACTURADO'].max():,.0f}")
print(f"   Mean: {df['CONSUMO_FACTURADO'].mean():,.2f}")
print(f"   Median: {df['CONSUMO_FACTURADO'].median():,.2f}")

print("\n2. VALOR_FACTURADO (Billed Amount in COP):")
print(f"   Range: {df['VALOR_FACTURADO'].min():,.0f} to {df['VALOR_FACTURADO'].max():,.0f}")
print(f"   Mean: {df['VALOR_FACTURADO'].mean():,.2f}")

print("\n3. USO (Usage Type - Categorical):")
print(df['USO'].value_counts())

## Task 1.2: Choose Classification vs Regression

For this workshop, we will create a **classification** problem.

**Goal:** Predict whether a municipality has HIGH or LOW water consumption.

Define the target variable below.

In [None]:
# TODO: Create the target variable
# Use the median as the threshold: above median = HIGH (1), below/equal = LOW (0)

consumption_col = 'CONSUMO_FACTURADO'
median_consumption = df[consumption_col].median()

print(f"Median consumption: {median_consumption:,.2f} m3")

# YOUR CODE HERE: Create the target variable
df['consumption_level'] = ___  # Hint: (df[consumption_col] > median_consumption).astype(int)

# Verify the target distribution
print("\nTarget variable distribution:")
print(df['consumption_level'].value_counts())

In [None]:
# Visualize the target distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Pie chart
labels = ['LOW', 'HIGH']
sizes = df['consumption_level'].value_counts().sort_index()
colors = ['steelblue', 'coral']

# YOUR CODE HERE: Create a pie chart showing the class distribution
axes[0].pie(___, labels=___, colors=___, autopct='%1.1f%%', startangle=90)
axes[0].set_title('Target Variable Distribution', fontsize=14)

# Bar chart
# YOUR CODE HERE: Create a bar chart
axes[1].bar(___, ___, color=colors, edgecolor='black')
axes[1].set_xlabel('Consumption Level', fontsize=12)
axes[1].set_ylabel('Count', fontsize=12)
axes[1].set_title('Class Distribution', fontsize=14)

plt.tight_layout()
plt.show()

## Task 1.3: Write Your Problem Statement

Complete the formal problem statement below:

### Problem Statement

**Type of problem:** Classification (binary)

**Target variable:** consumption_level (0 = LOW, 1 = HIGH)

**Definition of classes:**
- LOW: Consumption at or below _____ m3 (the median)
- HIGH: Consumption above _____ m3

**Goal:** Given a municipality's characteristics, predict whether their water consumption will be classified as HIGH or LOW.

**Business context:** (Write 1-2 sentences about why this prediction would be useful)

*YOUR ANSWER HERE*

---

## Task 1.4: Justify Your Choice

**Why classification instead of regression?**

*YOUR ANSWER HERE (Consider: interpretability, business use case, data characteristics)*

**Why use the median as the threshold?**

*YOUR ANSWER HERE (Consider: class balance, robustness to outliers)*

---

---

# Part 2: Feature Selection

Select which variables will be used as features (inputs) for prediction.

---

## Task 2.1: Identify All Potential Features

List all columns and evaluate each as a potential feature.

In [None]:
# Review all columns
print("All columns in the dataset:")
for i, col in enumerate(df.columns):
    dtype = df[col].dtype
    unique = df[col].nunique()
    missing = df[col].isna().sum()
    print(f"{i+1}. {col:25} | Type: {str(dtype):10} | Unique: {unique:6} | Missing: {missing}")

## Task 2.2: Eliminate Data Leakage Features

**Data leakage** occurs when information from outside the training dataset is used to create the model. This leads to overly optimistic results that don't generalize to new data.

**Question:** Which columns should we DEFINITELY NOT use as features?

### Columns to EXCLUDE (data leakage risk):

| Column | Reason for Exclusion |
|--------|---------------------|
| CONSUMO_FACTURADO | *YOUR ANSWER* (Hint: This IS what we're trying to predict) |
| VALOR_FACTURADO | *YOUR ANSWER* (Hint: How is this related to consumption?) |
| consumption_level | This is the target variable itself |
| | *Add more if applicable* |

---

## Task 2.3: Handle Categorical Variables

Machine learning models require numeric inputs. We need to encode categorical variables.

In [None]:
# Check categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print("Categorical columns:")
for col in categorical_cols:
    print(f"\n{col}:")
    print(df[col].value_counts().head(10))

In [None]:
# TODO: Encode the USO (usage type) column
# We will use LabelEncoder for simplicity

le_uso = LabelEncoder()

# YOUR CODE HERE: Encode the USO column
df['USO_encoded'] = le_uso.fit_transform(df['USO'].fillna('UNKNOWN'))

print("Encoding mapping for USO:")
for i, label in enumerate(le_uso.classes_):
    print(f"  {label} -> {i}")

In [None]:
# Optional: Encode DEPARTAMENTO if you want to use it
# Note: This creates many categories - consider if it's worth including

print(f"\nNumber of unique departments: {df['DEPARTAMENTO'].nunique()}")

# YOUR CODE HERE (optional): Encode DEPARTAMENTO
# le_dept = LabelEncoder()
# df['DEPARTAMENTO_encoded'] = le_dept.fit_transform(df['DEPARTAMENTO'].fillna('UNKNOWN'))

## Task 2.4: Document Feature Selection

Complete the feature selection table below.

### Feature Selection Table

| Feature | Include? | Rationale |
|---------|----------|----------|
| NUMERO_SUSCRIPTORES | Yes/No | *YOUR RATIONALE* |
| ANNO | Yes/No | *YOUR RATIONALE* |
| USO_encoded | Yes/No | *YOUR RATIONALE* |
| DEPARTAMENTO | Yes/No | *YOUR RATIONALE* |
| MUNICIPIO | Yes/No | *YOUR RATIONALE* |

---

In [None]:
# TODO: Define your final feature list
# Based on your analysis, select the features you will use

feature_cols = [
    # YOUR CODE HERE: List the features you selected
    # Example: 'NUMERO_SUSCRIPTORES', 'ANNO', 'USO_encoded'
]

print(f"Selected features ({len(feature_cols)}):")
for col in feature_cols:
    print(f"  - {col}")

---

# Part 3: Data Preparation

Prepare the data for machine learning with proper methodology.

---

## Task 3.1: Handle Missing Values

Check for and handle missing values in your selected features.

In [None]:
# Check missing values in selected features
print("Missing values in selected features:")
for col in feature_cols:
    missing = df[col].isna().sum()
    pct = missing / len(df) * 100
    print(f"  {col}: {missing} ({pct:.2f}%)")

# Also check target
print(f"\n  consumption_level: {df['consumption_level'].isna().sum()}")

In [None]:
# TODO: Remove rows with missing values in selected columns
# Create a clean dataset with only the columns we need

cols_needed = feature_cols + ['consumption_level']
df_clean = df[cols_needed].dropna()

print(f"Original dataset size: {len(df)}")
print(f"After removing missing values: {len(df_clean)}")
print(f"Rows removed: {len(df) - len(df_clean)} ({(len(df) - len(df_clean))/len(df)*100:.2f}%)")

**Decision:** Did you drop rows or impute missing values? Why?

*YOUR ANSWER HERE*

---

## Task 3.2: Create X (Features) and y (Target)

In [None]:
# TODO: Create X and y

X = df_clean[___]  # YOUR CODE: Which columns for features?
y = df_clean[___]  # YOUR CODE: Which column for target?

print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"\nFeatures preview:")
X.head()

## Task 3.3: Train/Test Split

In [None]:
# TODO: Split the data into training and testing sets
# Use 80% for training, 20% for testing
# Set random_state=42 for reproducibility

X_train, X_test, y_train, y_test = train_test_split(
    ___,  # YOUR CODE: features
    ___,  # YOUR CODE: target
    test_size=___,     # YOUR CODE: what percentage for testing?
    random_state=42
)

print(f"Training set: {len(X_train)} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"Test set: {len(X_test)} samples ({len(X_test)/len(X)*100:.1f}%)")

In [None]:
# Verify class distribution is preserved
print("Class distribution:")
print("\nOriginal:")
print(y.value_counts(normalize=True).round(3))
print("\nTraining set:")
print(y_train.value_counts(normalize=True).round(3))
print("\nTest set:")
print(y_test.value_counts(normalize=True).round(3))

**Question:** Is the class distribution similar across all sets? Why is this important?

*YOUR ANSWER HERE*

---

## Task 3.4: Feature Scaling

In [None]:
# Before scaling - check the current ranges
print("Before scaling (training data):")
print(X_train.describe().round(2))

In [None]:
# TODO: Apply StandardScaler
# IMPORTANT: fit_transform on training data, only transform on test data

scaler = StandardScaler()

# YOUR CODE HERE:
X_train_scaled = scaler.fit_transform(___)  # Which data to fit and transform?
X_test_scaled = scaler.transform(___)       # Which data to only transform?

# Convert to DataFrame for viewing
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=feature_cols)

print("After scaling (training data):")
print(X_train_scaled_df.describe().round(2))

**Question:** Why do we fit the scaler ONLY on training data and not on the full dataset?

*YOUR ANSWER HERE (Hint: Think about data leakage)*

---

## Task 3.5: Verify Data Integrity

In [None]:
# Final verification
print("=" * 50)
print("DATA PREPARATION SUMMARY")
print("=" * 50)

print(f"\n1. FEATURES:")
print(f"   Selected: {feature_cols}")
print(f"   Number of features: {len(feature_cols)}")

print(f"\n2. DATASET SIZES:")
print(f"   Training: {X_train_scaled.shape}")
print(f"   Test: {X_test_scaled.shape}")

print(f"\n3. TARGET DISTRIBUTION:")
print(f"   Train - LOW: {(y_train==0).sum()}, HIGH: {(y_train==1).sum()}")
print(f"   Test  - LOW: {(y_test==0).sum()}, HIGH: {(y_test==1).sum()}")

print(f"\n4. SCALING:")
print(f"   Mean of scaled training data: ~0")
print(f"   Std of scaled training data: ~1")

print("\n" + "=" * 50)

---

# Part 4: Baseline Model

Build a simple Decision Tree model as our baseline.

---

## Task 4.1: Train the Decision Tree

In [None]:
# TODO: Create and train the Decision Tree model
# Note: Decision Trees don't need scaled features, but we'll use them for consistency

# Create the model with limited depth for interpretability
dt_model = DecisionTreeClassifier(
    max_depth=3,        # Limit tree depth
    random_state=42     # For reproducibility
)

# YOUR CODE HERE: Train the model
dt_model.fit(___, ___)  # Which data to train on?

print("Model trained successfully!")
print(f"Tree depth: {dt_model.get_depth()}")
print(f"Number of leaves: {dt_model.get_n_leaves()}")

## Task 4.2: Evaluate Model Performance

In [None]:
# TODO: Make predictions on the test set
y_pred = dt_model.predict(___)  # Which data to predict on?

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print("=" * 50)
print("MODEL EVALUATION")
print("=" * 50)
print(f"\nAccuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['LOW', 'HIGH'],
            yticklabels=['LOW', 'HIGH'],
            ax=ax)
ax.set_xlabel('Predicted', fontsize=12)
ax.set_ylabel('Actual', fontsize=12)
ax.set_title(f'Confusion Matrix (Accuracy: {accuracy:.2%})', fontsize=14)
plt.tight_layout()
plt.show()

print("\nConfusion Matrix Breakdown:")
print(f"  True Negatives (LOW -> LOW): {cm[0,0]}")
print(f"  False Positives (LOW -> HIGH): {cm[0,1]}")
print(f"  False Negatives (HIGH -> LOW): {cm[1,0]}")
print(f"  True Positives (HIGH -> HIGH): {cm[1,1]}")

In [None]:
# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['LOW', 'HIGH']))

## Task 4.3: Interpret the Results

In [None]:
# Visualize the Decision Tree
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(
    dt_model,
    feature_names=feature_cols,
    class_names=['LOW', 'HIGH'],
    filled=True,
    rounded=True,
    fontsize=10,
    ax=ax
)
ax.set_title('Decision Tree Visualization', fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Feature Importance
importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': dt_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance:")
print(importance.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(8, 5))
bars = ax.barh(importance['Feature'], importance['Importance'], color='steelblue', edgecolor='black')
ax.set_xlabel('Importance', fontsize=12)
ax.set_title('Feature Importance', fontsize=14)
ax.invert_yaxis()

# Add value labels
for bar, val in zip(bars, importance['Importance']):
    ax.text(val + 0.01, bar.get_y() + bar.get_height()/2, f'{val:.3f}', 
            va='center', fontsize=10)

plt.tight_layout()
plt.show()

## Task 4.4: Interpretation Questions

Answer these questions based on your model results:

**1. What is the overall accuracy? Is this good or bad?**

*YOUR ANSWER HERE (Consider: What would random guessing achieve? Is our model better?)*

**2. Which feature is most important for predicting consumption level?**

*YOUR ANSWER HERE (Look at the feature importance chart)*

**3. Looking at the confusion matrix, does the model make more false positives or false negatives?**

*YOUR ANSWER HERE*

**4. What are 2-3 ways we could potentially improve this model?**

*YOUR ANSWER HERE (Consider: more features, different model, more data, etc.)*

---

---

# Part 5: Documentation

Summarize all your decisions and findings.

---

## Summary: ML Data Preparation Decisions

### Problem Definition

| Aspect | Decision | Rationale |
|--------|----------|----------|
| Problem type | Classification | *YOUR RATIONALE* |
| Target variable | consumption_level (HIGH/LOW) | *YOUR RATIONALE* |
| Threshold | Median consumption | *YOUR RATIONALE* |

### Feature Selection

| Feature | Included | Type | Rationale |
|---------|----------|------|----------|
| NUMERO_SUSCRIPTORES | Yes/No | Numeric | *YOUR RATIONALE* |
| ANNO | Yes/No | Numeric | *YOUR RATIONALE* |
| USO_encoded | Yes/No | Encoded categorical | *YOUR RATIONALE* |
| (Add others) | | | |

### Data Preparation

| Step | Decision | Details |
|------|----------|--------|
| Missing values | Drop/Impute | *How many rows affected?* |
| Train/Test split | 80/20 | *random_state=42* |
| Scaling | StandardScaler | *Mean=0, Std=1* |

### Baseline Model Performance

| Metric | Value |
|--------|-------|
| Accuracy | *YOUR VALUE* |
| Precision (HIGH) | *YOUR VALUE* |
| Recall (HIGH) | *YOUR VALUE* |
| Most important feature | *YOUR VALUE* |

---

## Next Steps for Week 14

List 3-5 improvements or explorations for next week:

1. *YOUR IDEA* (e.g., "Try Random Forest model")
2. *YOUR IDEA* (e.g., "Add more features like DEPARTAMENTO")
3. *YOUR IDEA* (e.g., "Try regression instead of classification")
4. 
5. 

---

## Reflection Questions

### 1. What was the most challenging part of preparing data for ML?

*YOUR ANSWER HERE*

### 2. Why is documentation of decisions important in ML projects?

*YOUR ANSWER HERE*

### 3. How would you explain your model to a non-technical stakeholder?

*YOUR ANSWER HERE (Write 2-3 sentences)*

### 4. What surprised you about the feature importance results?

*YOUR ANSWER HERE*

---

## Submission Checklist

Before submitting, verify that you have:

- [ ] All cells executed without errors (Kernel > Restart & Run All)
- [ ] Problem statement clearly defined
- [ ] Feature selection documented with rationale
- [ ] Train/test split correctly implemented (80/20)
- [ ] Feature scaling properly applied (fit on train only)
- [ ] Baseline model trained and evaluated
- [ ] Confusion matrix created and interpreted
- [ ] Feature importance analyzed
- [ ] All documentation sections completed
- [ ] Reflection questions answered

---

*Week 13 Workshop - Data Analytics Course - Universidad Cooperativa de Colombia*