# Lecture 8: Programming Example - Linear Models for Prediction

## Introduction: Building Your First Predictive Model for Transportation Demand

Welcome back, junior data consultant! Your Capital Bikes Washington D.C. client has been impressed with your statistical analysis and visualization work. Now they're ready for the defining moment: building a predictive model that forecasts hourly bike demand. The CEO has just approved your recommendation to develop a demand forecasting system that will transform their operations from reactive to predictive.

Think of predictive modeling as the ultimate consulting deliverable. While statistics reveal patterns and visualizations communicate insights, predictive models operationalize intelligence - they generate actionable forecasts that drive real-time business decisions. You'll master linear regression, the foundational algorithm that quantifies relationships between variables and produces interpretable predictions executives can trust and act upon.

Your task: build a linear regression model that predicts hourly bike demand using weather conditions and temporal patterns. You'll learn to calculate correlation coefficients that measure relationship strength, engineer time-based features that capture demand cycles, implement scikit-learn models with proper train-test splits, and apply cross-validation for robust performance estimates. Every technique ensures your forecasts are accurate, reliable, and operationally actionable.

> **🚀 Interactive Learning Alert**
>
> This is a hands-on predictive modeling tutorial with real forecasting challenges. For the best experience:
>
> - **Click "Open in Colab"** at the bottom to run code interactively
> - **Execute each code cell** by pressing **Shift + Enter**
> - **Complete the challenges** to practice your predictive modeling skills
> - **Think like a consultant** - model accuracy directly impacts operational planning

---

## Step 1: Calculate Correlations and Visualize Relationships

Let's analyze which variables have linear relationships with bike demand by calculating correlations and creating visualizations.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Load dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Calculate correlation and regression for temperature vs demand (using raw data)
slope, intercept, r_value, p_value, std_err = stats.linregress(df['temp'], df['count'])

# Chart 1: Raw scatter plot (all individual observations)
plt.figure(figsize=(10, 6))
plt.scatter(df['temp'], df['count'], alpha=0.3, s=10, label='Raw Data Points')

# Add regression line (calculated from raw data)
temp_range = np.array([df['temp'].min(), df['temp'].max()])
demand_trend = slope * temp_range + intercept
plt.plot(temp_range, demand_trend, 'r-', linewidth=3, label=f'Regression Line (r={r_value:.3f})')

plt.xlabel('Temperature (°C)', fontsize=12)
plt.ylabel('Hourly Bike Rentals', fontsize=12)
plt.title('Raw Data: Temperature vs Demand (All Observations)', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("=== VISUALIZATION NOTE ===")
print("The scatter plot above shows all 10,886 hourly observations.")
print("While this represents the complete data, the density makes it hard to see")
print("the underlying pattern clearly. Let's create a cleaner visualization...\n")

# Chart 2: Binned visualization (for pedagogical clarity)
# Create temperature bins and calculate mean demand for each bin
df['temp_bin'] = pd.cut(df['temp'], bins=20)
binned = df.groupby('temp_bin', observed=True)['count'].agg(['mean', 'std', 'count'])
binned['se'] = binned['std'] / np.sqrt(binned['count'])
binned['temp_center'] = binned.index.map(lambda x: x.mid)

plt.figure(figsize=(10, 6))
plt.errorbar(binned['temp_center'], binned['mean'],
            yerr=binned['se']*1.96,  # 95% confidence interval
            fmt='o', markersize=8, capsize=5, capthick=2,
            color='#2ECC71', ecolor='#2ECC71', alpha=0.7,
            label='Binned Mean Demand (95% CI)')

# Add the SAME regression line (still from raw data)
plt.plot(temp_range, demand_trend, 'r-', linewidth=3, label=f'Regression Line (r={r_value:.3f})')

plt.xlabel('Temperature (°C)', fontsize=12)
plt.ylabel('Hourly Bike Rentals', fontsize=12)
plt.title('Clean Visualization: Temperature vs Demand (Binned Averages)', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Print correlation statistics
print("=== CORRELATION ANALYSIS ===")
print(f"Correlation (r): {r_value:.3f}")
print(f"R-squared: {r_value**2:.3f} ({(r_value**2)*100:.1f}% variance explained)")
print(f"Slope: {slope:.2f} bikes per °C")

print("\n=== IMPORTANT CLARIFICATION ===")
print("Both charts show the SAME regression line calculated from the raw data.")
print("The binned visualization groups observations into temperature ranges and shows")
print("their average demand - this is purely for pedagogical purposes to help you see")
print("the pattern more clearly. However, the regression analysis always uses the")
print("complete raw data (all 10,886 observations), not the binned averages.")
print("This is the correct approach: bin for visualization clarity, but analyze raw data.")

**What this does:**

- Loads data and calculates correlation between temperature and demand using raw data
- Shows two visualizations: raw scatter (complete but messy) and binned averages (clean and clear)
- Both charts display the same regression line calculated from the raw data
- Clarifies that binning is a visualization technique, not part of the statistical analysis
- Prints correlation statistics (r ≈ 0.39 shows moderate positive relationship)
- R² of ~15% means temperature alone explains only 15% of demand variation

### Challenge 1: Explore Humidity-Demand Relationship

Your client asks: "Does humidity affect bike demand like temperature does?" Investigate the humidity-demand relationship and compare it to the temperature relationship.

**Your Task:** Calculate correlation statistics and create a clean binned visualization for humidity vs demand.

In [None]:
# Your code here - analyze humidity-demand relationship

# Step 1: Calculate regression statistics for humidity (using raw data)
slope_hum, intercept_hum, r_value_hum, _, _ = stats.linregress(df['_____'], df['_____'])

# Step 2: Create binned visualization for clarity
df['humidity_bin'] = pd.cut(df['humidity'], bins=20)
humidity_binned = df.groupby('humidity_bin', observed=True)['count'].agg(['mean', 'std', 'count'])
humidity_binned['se'] = humidity_binned['std'] / np.sqrt(humidity_binned['count'])
humidity_binned['humidity_center'] = humidity_binned.index.map(lambda x: x.mid)

plt.figure(figsize=(10, 6))
plt.errorbar(humidity_binned['_____'], humidity_binned['_____'],
            yerr=humidity_binned['se']*1.96,
            fmt='o', markersize=8, capsize=5, capthick=2,
            color='#E74C3C', ecolor='#E74C3C', alpha=0.7,
            label='Binned Mean Demand (95% CI)')

# Add regression line (from raw data)
humidity_range = np.array([df['humidity'].min(), df['humidity'].max()])
demand_trend_hum = _____ * humidity_range + _____
plt.plot(humidity_range, demand_trend_hum, 'b-', linewidth=3, label=f'Regression Line (r={r_value_hum:.3f})')

plt.xlabel('Humidity (%)', fontsize=12)
plt.ylabel('Hourly Bike Rentals', fontsize=12)
plt.title('Humidity-Demand Relationship Analysis', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Step 3: Compare correlations
print("=== CORRELATION COMPARISON ===")
print(f"Temperature correlation: r = {r_value:.3f} (explains {(r_value**2)*100:.1f}%)")
print(f"Humidity correlation: r = {r_value_hum:.3f} (explains {(r_value_hum**2)*100:.1f}%)")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Calculate regression using raw data: `stats.linregress(df['humidity'], df['count'])` extracts all 5 return values (use underscore _ for unused values like `slope_hum, intercept_hum, r_value_hum, _, _ = ...`). For the binned visualization, use `humidity_binned['humidity_center']` and `humidity_binned['mean']` in the errorbar plot. The regression line uses slope_hum and intercept_hum: demand = slope_hum × humidity + intercept_hum. Remember: regression is calculated from raw data, binning is just for visualization clarity. Humidity typically shows a negative correlation (higher humidity = uncomfortable = fewer riders), so compare the R² values to determine which weather factor matters more. Negative correlation means humid days reduce demand (discomfort effect), and this insight helps Capital City Bikes plan for different weather conditions.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Step 1: Calculate regression statistics for humidity (using raw data)
slope_hum, intercept_hum, r_value_hum, _, _ = stats.linregress(df['humidity'], df['count'])

# Step 2: Create binned visualization for clarity
df['humidity_bin'] = pd.cut(df['humidity'], bins=20)
humidity_binned = df.groupby('humidity_bin', observed=True)['count'].agg(['mean', 'std', 'count'])
humidity_binned['se'] = humidity_binned['std'] / np.sqrt(humidity_binned['count'])
humidity_binned['humidity_center'] = humidity_binned.index.map(lambda x: x.mid)

plt.figure(figsize=(10, 6))
plt.errorbar(humidity_binned['humidity_center'], humidity_binned['mean'],
            yerr=humidity_binned['se']*1.96,
            fmt='o', markersize=8, capsize=5, capthick=2,
            color='#E74C3C', ecolor='#E74C3C', alpha=0.7,
            label='Binned Mean Demand (95% CI)')

# Add regression line (from raw data)
humidity_range = np.array([df['humidity'].min(), df['humidity'].max()])
demand_trend_hum = slope_hum * humidity_range + intercept_hum
plt.plot(humidity_range, demand_trend_hum, 'b-', linewidth=3, label=f'Regression Line (r={r_value_hum:.3f})')

plt.xlabel('Humidity (%)', fontsize=12)
plt.ylabel('Hourly Bike Rentals', fontsize=12)
plt.title('Humidity-Demand Relationship Analysis', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Step 3: Compare correlations
print("=== CORRELATION COMPARISON ===")
print(f"Temperature correlation: r = {r_value:.3f} (explains {(r_value**2)*100:.1f}%)")
print(f"Humidity correlation: r = {r_value_hum:.3f} (explains {(r_value_hum**2)*100:.1f}%)")
print(f"\nInsight: {'Temperature' if abs(r_value) > abs(r_value_hum) else 'Humidity'} shows stronger correlation with demand")
print(f"Remember: Regression calculated from raw data, binning used only for clear visualization")
```

</details>

---

## Step 2: Engineer Time-Based Features

Weather alone can't capture temporal patterns like rush hours and daily cycles. Let's create time-based features that capture these patterns.

In [None]:
# Import libraries and load data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Sort by datetime for proper temporal ordering
df = df.sort_values('datetime').reset_index(drop=True)

# Extract hour and create cyclical encoding
df['hour'] = df['datetime'].dt.hour
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

# Create lag features (demand from previous hours)
df['demand_lag_1h'] = df['count'].shift(1)
df['demand_lag_24h'] = df['count'].shift(24)

# Extract day of week and month
df['dayofweek'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month

# Remove rows with NaN values from lag features
df_clean = df.dropna(subset=['demand_lag_1h', 'demand_lag_24h'])

# Visualize hourly pattern
hourly_avg = df_clean.groupby('hour')['count'].mean()
plt.figure(figsize=(10, 5))
plt.plot(hourly_avg.index, hourly_avg.values, marker='o', linewidth=2)
plt.xlabel('Hour of Day', fontsize=12)
plt.ylabel('Average Bike Rentals', fontsize=12)
plt.title('Daily Demand Pattern Shows Rush Hours', fontsize=14, fontweight='bold')
plt.xticks(range(0, 24, 2))
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("=== ENGINEERED FEATURES ===")
print(f"Total features created: hour, hour_sin, hour_cos, lag_1h, lag_24h, dayofweek, month")
print(f"Dataset size after cleaning: {len(df_clean):,} observations")
print(f"Peak hours: {hourly_avg.nlargest(3).index.tolist()}")

**What this does:**
- Extracts hour from datetime and creates cyclical sin/cos encoding
- Creates lag features (demand from 1 hour and 24 hours ago)
- Extracts day of week and month for weekly/seasonal patterns
- Removes rows with NaN values from lag features
- Visualizes daily demand pattern showing morning and evening peaks

### Challenge 2: Visualize Weekly Demand Patterns

Your client asks: "Do we see different demand patterns on different days of the week?" Analyze and visualize weekly patterns.

**Your Task:** Calculate and visualize average demand by day of week.

In [None]:
# Your code here - analyze weekly patterns

# Calculate average demand by day of week
day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekly_pattern = df_clean.groupby(_____)['_____']._____()

# Visualize weekly pattern
plt.figure(figsize=(10, 6))
plt.bar(_____, weekly_pattern.values, color='teal', edgecolor='black')
plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Average Hourly Demand', fontsize=12)
plt.title('Weekly Demand Pattern', fontsize=14, fontweight='bold')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Print insights
print("=== WEEKLY DEMAND PATTERN ===")
for i, day in enumerate(day_names):
    print(f"{day}: {weekly_pattern.iloc[i]:.1f} bikes per hour")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


The dayofweek feature was already created in Step 2 (0=Monday through 6=Sunday). Use `.groupby('dayofweek')['count'].mean()` to calculate average demand for each day. Pass day_names to plt.bar() along with the values to get readable labels. Weekdays show relatively stable commuting patterns, Saturday typically peaks with leisure riding, while Sunday often drops (rest/preparation day).

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Calculate average demand by day of week
day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekly_pattern = df_clean.groupby('dayofweek')['count'].mean()

# Visualize weekly pattern
plt.figure(figsize=(10, 6))
plt.bar(day_names, weekly_pattern.values, color='teal', edgecolor='black')
plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Average Hourly Demand', fontsize=12)
plt.title('Weekly Demand Pattern', fontsize=14, fontweight='bold')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

# Print insights
print("=== WEEKLY DEMAND PATTERN ===")
for i, day in enumerate(day_names):
    print(f"{day}: {weekly_pattern.iloc[i]:.1f} bikes per hour")
print(f"\nBest day: {day_names[weekly_pattern.idxmax()]} ({weekly_pattern.max():.1f} bikes)")
print(f"Weakest day: {day_names[weekly_pattern.idxmin()]} ({weekly_pattern.min():.1f} bikes)")
```

</details>

---

## Step 3: Train a Linear Regression Model

Let's train a linear regression model using scikit-learn to predict bike demand from our engineered features.

In [None]:
# Import libraries and load/prepare data
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Load and prepare data (same as Step 2)
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.sort_values('datetime').reset_index(drop=True)

# Create features
df['hour'] = df['datetime'].dt.hour
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['demand_lag_1h'] = df['count'].shift(1)
df['demand_lag_24h'] = df['count'].shift(24)
df['dayofweek'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month
df_clean = df.dropna(subset=['demand_lag_1h', 'demand_lag_24h'])

# Prepare feature matrix and target
feature_columns = ['temp', 'humidity', 'windspeed', 'hour_sin', 'hour_cos',
                   'demand_lag_1h', 'demand_lag_24h', 'dayofweek', 'month']
X = df_clean[feature_columns]
y = df_clean['count']

# Create and train model
model = LinearRegression()
model.fit(X, y)

# Make predictions and evaluate
y_pred = model.predict(X)
r2 = r2_score(y, y_pred)
rmse = np.sqrt(mean_squared_error(y, y_pred))
mae = mean_absolute_error(y, y_pred)

print("=== MODEL TRAINING COMPLETE ===")
print(f"Features used: {len(feature_columns)}")
print(f"\nModel Coefficients:")
for feature, coef in zip(feature_columns, model.coef_):
    print(f"  {feature:20s}: {coef:+8.2f}")

print(f"\n=== PERFORMANCE (all data) ===")
print(f"R²:   {r2:.4f} ({r2*100:.1f}% variance explained)")
print(f"RMSE: {rmse:.2f} bikes per hour")
print(f"MAE:  {mae:.2f} bikes per hour")

print("\nNote: These metrics are optimistic - we need train-test split!")

**What this does:**
- Prepares feature matrix X (9 features) and target y (bike demand)
- Trains LinearRegression model with .fit(X, y)
- Makes predictions with .predict(X) and calculates performance metrics
- Shows coefficients revealing which features increase/decrease demand
- R² of ~81% means model explains 81% of demand variation

### Challenge 3: Train Model with Different Feature Combinations

Your client asks: "Which features matter most? Can we get similar performance with fewer features?" Compare models using different feature sets.

**Your Task:** Train three models with different feature combinations and compare their performance.

In [None]:
# Your code here - compare models with different feature sets

# Model 1: Weather-only features
weather_features = ['temp', 'humidity', 'windspeed']
X_weather = df_clean[_____]
model_weather = LinearRegression()
model_weather.fit(X_weather, y)
y_pred_weather = model_weather.predict(X_weather)
r2_weather = r2_score(y, y_pred_weather)

# Model 2: Time-only features
time_features = ['hour_sin', 'hour_cos', 'dayofweek', 'month']
X_time = df_clean[_____]
model_time = LinearRegression()
model_time.fit(X_time, y)
y_pred_time = model_time.predict(X_time)
r2_time = r2_score(y, y_pred_time)

# Model 3: Complete model (weather + time + lags)
complete_features = feature_columns  # Already defined above
X_complete = df_clean[complete_features]
model_complete = LinearRegression()
model_complete.fit(X_complete, y)
y_pred_complete = model_complete.predict(X_complete)
r2_complete = r2_score(y, y_pred_complete)

# Compare results
print("=== MODEL COMPARISON ===")
print(f"Weather-only model:  R² = {r2_weather:.4f} ({_____} features)")
print(f"Time-only model:     R² = {r2_time:.4f} ({_____} features)")
print(f"Complete model:      R² = {r2_complete:.4f} ({_____} features)")
print(f"\nBest model: {'Weather' if r2_weather > max(r2_time, r2_complete) else 'Time' if r2_time > r2_complete else 'Complete'}")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Follow the same pattern for each model: create X subset using `df_clean[feature_list]` → fit model with `LinearRegression().fit(X, y)` → predict with `model.predict(X)` → calculate R² with `r2_score(y_true, y_pred)`. Weather features (temp, humidity, windspeed) capture environmental conditions, while time features (hour_sin, hour_cos, dayofweek, month) capture temporal patterns, and the complete model uses all 9 features including lag variables. Don't forget to fit each model separately (don't reuse the previous model object), and use `len(feature_list)` to count features correctly. The weather-only model is useful when you have weather forecasts but no historical demand, the time-only model works for seasonal planning independent of weather, and the complete model delivers best performance by combining all signals - this comparison shows which features contribute most value to your client's forecasting accuracy.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Model 1: Weather-only features
weather_features = ['temp', 'humidity', 'windspeed']
X_weather = df_clean[weather_features]
model_weather = LinearRegression()
model_weather.fit(X_weather, y)
y_pred_weather = model_weather.predict(X_weather)
r2_weather = r2_score(y, y_pred_weather)

# Model 2: Time-only features
time_features = ['hour_sin', 'hour_cos', 'dayofweek', 'month']
X_time = df_clean[time_features]
model_time = LinearRegression()
model_time.fit(X_time, y)
y_pred_time = model_time.predict(X_time)
r2_time = r2_score(y, y_pred_time)

# Model 3: Complete model (weather + time + lags)
complete_features = feature_columns
X_complete = df_clean[complete_features]
model_complete = LinearRegression()
model_complete.fit(X_complete, y)
y_pred_complete = model_complete.predict(X_complete)
r2_complete = r2_score(y, y_pred_complete)

# Compare results
print("=== MODEL COMPARISON ===")
print(f"Weather-only model:  R² = {r2_weather:.4f} ({len(weather_features)} features)")
print(f"Time-only model:     R² = {r2_time:.4f} ({len(time_features)} features)")
print(f"Complete model:      R² = {r2_complete:.4f} ({len(complete_features)} features)")

print(f"\nInsights:")
print(f"- Time features alone explain {r2_time*100:.1f}% of variation (temporal patterns dominate)")
print(f"- Weather + lag features add {(r2_complete-r2_time)*100:.1f}% additional explanation")
print(f"- Lag features are crucial: they capture sequential dependencies")
print(f"\nRecommendation: Use complete model for best forecasting accuracy")
```

</details>

---

## Step 4: Create Chronological Train-Test Split

Let's evaluate model performance honestly by splitting data chronologically: train on past, test on future.

In [None]:
# Import libraries and prepare data
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Load and prepare data (same as previous steps)
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.sort_values('datetime').reset_index(drop=True)

# Create features
df['hour'] = df['datetime'].dt.hour
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['demand_lag_1h'] = df['count'].shift(1)
df['demand_lag_24h'] = df['count'].shift(24)
df['dayofweek'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month
df_clean = df.dropna(subset=['demand_lag_1h', 'demand_lag_24h'])

feature_columns = ['temp', 'humidity', 'windspeed', 'hour_sin', 'hour_cos',
                   'demand_lag_1h', 'demand_lag_24h', 'dayofweek', 'month']
X = df_clean[feature_columns]
y = df_clean['count']

# Create chronological split: 80% train, 20% test
split_index = int(len(df_clean) * 0.8)
X_train = X.iloc[:split_index]
X_test = X.iloc[split_index:]
y_train = y.iloc[:split_index]
y_test = y.iloc[split_index:]

# Train model on training data only
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate on both training and testing sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

train_r2 = r2_score(y_train, y_train_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_r2 = r2_score(y_test, y_test_pred)
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

print("=== CHRONOLOGICAL TRAIN-TEST SPLIT ===")
print(f"Training set: {len(X_train)} obs ({len(X_train)/len(df_clean)*100:.1f}%)")
print(f"Testing set:  {len(X_test)} obs ({len(X_test)/len(df_clean)*100:.1f}%)")

print(f"\n=== PERFORMANCE COMPARISON ===")
print(f"Training:  R² = {train_r2:.4f}, RMSE = {train_rmse:.2f} bikes")
print(f"Testing:   R² = {test_r2:.4f}, RMSE = {test_rmse:.2f} bikes")
print(f"Gap:       {train_r2 - test_r2:.4f}")

if (train_r2 - test_r2) < 0.05:
    print("\n✓ Small gap - model generalizes well to future data")
else:
    print("\n⚠ Moderate gap - some overfitting present")

**What this does:**
- Splits data chronologically using .iloc[] (first 80% = train, last 20% = test)
- Trains model only on historical data (training set)
- Evaluates on both sets to compare training vs testing performance
- Performance gap shows how well model generalizes to unseen future data
- Test R² and RMSE reveal realistic forecasting accuracy

### Challenge 4: Experiment with Different Split Ratios

Your client asks: "What if we used more data for training? Would that improve test performance?" Compare 70/30, 80/20, and 90/10 splits.

**Your Task:** Train models with three different split ratios and compare their test performance.

In [None]:
# Your code here - compare different split ratios

split_ratios = [0.7, 0.8, 0.9]
results = []

for ratio in split_ratios:
    # Calculate split index for this ratio
    split_idx = int(len(df_clean) * _____)

    # Split data chronologically
    X_train_temp = X.iloc[:_____]
    X_test_temp = X.iloc[_____:]
    y_train_temp = y.iloc[:_____]
    y_test_temp = y.iloc[_____:]

    # Train model
    model_temp = LinearRegression()
    model_temp.fit(X_train_temp, y_train_temp)

    # Evaluate on test set
    y_test_pred_temp = model_temp.predict(X_test_temp)
    test_r2_temp = r2_score(y_test_temp, y_test_pred_temp)
    test_rmse_temp = np.sqrt(mean_squared_error(y_test_temp, y_test_pred_temp))

    # Store results
    results.append({
        'ratio': f"{int(ratio*100)}/{int((1-ratio)*100)}",
        'train_size': len(X_train_temp),
        'test_size': len(X_test_temp),
        'test_r2': test_r2_temp,
        'test_rmse': test_rmse_temp
    })

# Display comparison
print("=== SPLIT RATIO COMPARISON ===")
for result in results:
    print(f"{result['ratio']} split:")
    print(f"  Training: {result['train_size']} obs  |  Testing: {result['test_size']} obs")
    print(f"  Test R²: {result['test_r2']:.4f}  |  Test RMSE: {result['test_rmse']:.2f}")
    print()

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Loop through each ratio (0.7, 0.8, 0.9) and calculate `split_index = int(len(df_clean) * ratio)` for each iteration - don't reuse the same split_idx for all ratios. For each ratio, use `.iloc[:split_idx]` for training and `.iloc[split_idx:]` for testing, making sure to use chronological splitting (not random sampling). Train a fresh LinearRegression model for each split to ensure fair comparison, then store test R² and RMSE in the results list. Larger training sets (90/10) provide more data for learning patterns but leave less data for reliable testing evaluation, while larger test sets (70/30) give more reliable performance estimates but less training data. The 70/30 split maximizes test data reliability, 80/20 provides standard balance, and 90/10 maximizes learning capacity - your client should understand this trade-off between learning and evaluation reliability when choosing their validation strategy.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
split_ratios = [0.7, 0.8, 0.9]
results = []

for ratio in split_ratios:
    # Calculate split index for this ratio
    split_idx = int(len(df_clean) * ratio)

    # Split data chronologically
    X_train_temp = X.iloc[:split_idx]
    X_test_temp = X.iloc[split_idx:]
    y_train_temp = y.iloc[:split_idx]
    y_test_temp = y.iloc[split_idx:]

    # Train model
    model_temp = LinearRegression()
    model_temp.fit(X_train_temp, y_train_temp)

    # Evaluate on test set
    y_test_pred_temp = model_temp.predict(X_test_temp)
    test_r2_temp = r2_score(y_test_temp, y_test_pred_temp)
    test_rmse_temp = np.sqrt(mean_squared_error(y_test_temp, y_test_pred_temp))

    # Store results
    results.append({
        'ratio': f"{int(ratio*100)}/{int((1-ratio)*100)}",
        'train_size': len(X_train_temp),
        'test_size': len(X_test_temp),
        'test_r2': test_r2_temp,
        'test_rmse': test_rmse_temp
    })

# Display comparison
print("=== SPLIT RATIO COMPARISON ===")
for result in results:
    print(f"{result['ratio']} split:")
    print(f"  Training: {result['train_size']} obs  |  Testing: {result['test_size']} obs")
    print(f"  Test R²: {result['test_r2']:.4f}  |  Test RMSE: {result['test_rmse']:.2f}")
    print()

print("Insight: Larger training sets provide more learning data, but smaller test sets reduce evaluation reliability.")
print("Recommendation: 80/20 split balances learning and evaluation needs for most applications.")
```

</details>

---

## Step 5: Apply Cross-Validation for Robust Evaluation

A single train-test split gives one performance estimate. Cross-validation provides multiple estimates for more reliable evaluation.

In [None]:
# Import libraries and prepare data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.metrics import r2_score

# Load and prepare data
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.sort_values('datetime').reset_index(drop=True)

# Create features
df['hour'] = df['datetime'].dt.hour
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['demand_lag_1h'] = df['count'].shift(1)
df['demand_lag_24h'] = df['count'].shift(24)
df['dayofweek'] = df['datetime'].dt.dayofweek
df['month'] = df['datetime'].dt.month
df_clean = df.dropna(subset=['demand_lag_1h', 'demand_lag_24h'])

feature_columns = ['temp', 'humidity', 'windspeed', 'hour_sin', 'hour_cos',
                   'demand_lag_1h', 'demand_lag_24h', 'dayofweek', 'month']
X = df_clean[feature_columns]
y = df_clean['count']

# Create TimeSeriesSplit with 5 folds
tscv = TimeSeriesSplit(n_splits=5)

# Show fold structure
print("=== TIME SERIES CROSS-VALIDATION ===")
print("Fold structure (expanding window):")
for fold_num, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
    train_dates = df_clean.iloc[train_idx]['datetime']
    test_dates = df_clean.iloc[test_idx]['datetime']
    print(f"Fold {fold_num}: Train {len(train_idx):,} obs, Test {len(test_idx):,} obs")

# Run cross-validation
cv_scores = cross_val_score(LinearRegression(), X, y, cv=tscv, scoring='r2')

# Show results
print(f"\n=== CROSS-VALIDATION RESULTS ===")
for i, score in enumerate(cv_scores, 1):
    print(f"Fold {i}: R² = {score:.4f}")

cv_mean = cv_scores.mean()
cv_std = cv_scores.std()

print(f"\nMean R²: {cv_mean:.4f} (±{cv_std:.4f})")
print(f"Range:   {cv_scores.min():.4f} to {cv_scores.max():.4f}")

if cv_std < 0.03:
    print("\n✓ Low variability - consistent performance across time")
elif cv_std < 0.06:
    print("\n⚠ Moderate variability - performance varies somewhat")
else:
    print("\n✗ High variability - unstable across time periods")

**What this does:**
- Creates TimeSeriesSplit with 5 folds (expanding window approach)
- Each fold trains on past data and tests on next time period
- cross_val_score() automates training and evaluation across all folds
- Mean R² provides robust performance estimate
- Standard deviation shows performance consistency across time

### Challenge 5: Analyze Performance Variability Across Folds

Your client asks: "Which time periods are hardest to predict? Why does performance vary?" Investigate which folds show best and worst performance.

**Your Task:** Calculate detailed metrics for each fold and analyze what makes some periods easier to predict.

In [None]:
# Your code here - detailed fold analysis

fold_results = []

for fold_num, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
    # Split data for this fold
    X_train_fold = X.iloc[_____]
    X_test_fold = X.iloc[_____]
    y_train_fold = y.iloc[_____]
    y_test_fold = y.iloc[_____]

    # Train model on this fold
    model_fold = LinearRegression()
    model_fold.fit(X_train_fold, y_train_fold)

    # Evaluate on test set
    y_pred_fold = model_fold.predict(X_test_fold)
    r2_fold = r2_score(y_test_fold, y_pred_fold)
    rmse_fold = np.sqrt(mean_squared_error(y_test_fold, y_pred_fold))

    # Get test period dates and characteristics
    test_dates_fold = df_clean.iloc[test_idx]['datetime']
    avg_temp = df_clean.iloc[test_idx]['temp'].mean()
    avg_demand = df_clean.iloc[test_idx]['count'].mean()

    # Store results
    fold_results.append({
        'fold': fold_num,
        'test_start': test_dates_fold.min().date(),
        'test_end': test_dates_fold.max().date(),
        'r2': r2_fold,
        'rmse': rmse_fold,
        'avg_temp': avg_temp,
        'avg_demand': avg_demand
    })

# Display results
print("=== FOLD-BY-FOLD ANALYSIS ===")
for result in fold_results:
    print(f"\nFold {result['fold']}: {result['test_start']} to {result['test_end']}")
    print(f"  R²: {result['r2']:.4f}  |  RMSE: {result['rmse']:.2f}")
    print(f"  Avg temp: {result['avg_temp']:.1f}°C  |  Avg demand: {result['avg_demand']:.0f} bikes")

# Identify best and worst folds
best_fold = max(fold_results, key=lambda x: x['r2'])
worst_fold = min(fold_results, key=lambda x: x['r2'])

print(f"\n=== PERFORMANCE INSIGHTS ===")
print(f"Best fold: Fold {best_fold['fold']} (R² = {best_fold['r2']:.4f})")
print(f"  Period: {best_fold['test_start']} to {best_fold['test_end']}")
print(f"Worst fold: Fold {worst_fold['fold']} (R² = {worst_fold['r2']:.4f})")
print(f"  Period: {worst_fold['test_start']} to {worst_fold['test_end']}")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Loop through `enumerate(tscv.split(X), 1)` to iterate through folds with numbering, using train_idx and test_idx with `.iloc[]` to split features and target for each fold. Don't reuse the same model across folds - create a fresh LinearRegression instance each time to ensure fair comparison. Use train_idx for training and test_idx for testing (remember to calculate metrics on the test set using y_test_fold vs y_pred_fold). Calculate both R² and RMSE to understand performance completely, and extract test period characteristics like dates (using `df_clean.iloc[test_idx]['datetime']`), average temperature, and average demand. Use `max(list, key=lambda x: x['r2'])` to find the fold with best performance. Seasonal effects mean winter folds might show different patterns than summer, growth trends mean later folds include newer data with potentially different patterns, and holidays or special events in the test period affect predictability - understanding this variation helps your client identify when models need updating.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
fold_results = []

for fold_num, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
    # Split data for this fold
    X_train_fold = X.iloc[train_idx]
    X_test_fold = X.iloc[test_idx]
    y_train_fold = y.iloc[train_idx]
    y_test_fold = y.iloc[test_idx]

    # Train model on this fold
    model_fold = LinearRegression()
    model_fold.fit(X_train_fold, y_train_fold)

    # Evaluate on test set
    y_pred_fold = model_fold.predict(X_test_fold)
    r2_fold = r2_score(y_test_fold, y_pred_fold)
    rmse_fold = np.sqrt(mean_squared_error(y_test_fold, y_pred_fold))

    # Get test period dates and characteristics
    test_dates_fold = df_clean.iloc[test_idx]['datetime']
    avg_temp = df_clean.iloc[test_idx]['temp'].mean()
    avg_demand = df_clean.iloc[test_idx]['count'].mean()

    # Store results
    fold_results.append({
        'fold': fold_num,
        'test_start': test_dates_fold.min().date(),
        'test_end': test_dates_fold.max().date(),
        'r2': r2_fold,
        'rmse': rmse_fold,
        'avg_temp': avg_temp,
        'avg_demand': avg_demand
    })

# Display results
print("=== FOLD-BY-FOLD ANALYSIS ===")
for result in fold_results:
    print(f"\nFold {result['fold']}: {result['test_start']} to {result['test_end']}")
    print(f"  R²: {result['r2']:.4f}  |  RMSE: {result['rmse']:.2f}")
    print(f"  Avg temp: {result['avg_temp']:.1f}°C  |  Avg demand: {result['avg_demand']:.0f} bikes")

# Identify best and worst folds
best_fold = max(fold_results, key=lambda x: x['r2'])
worst_fold = min(fold_results, key=lambda x: x['r2'])

print(f"\n=== PERFORMANCE INSIGHTS ===")
print(f"Best fold: Fold {best_fold['fold']} (R² = {best_fold['r2']:.4f})")
print(f"  Period: {best_fold['test_start']} to {best_fold['test_end']}")
print(f"  Characteristics: {best_fold['avg_temp']:.1f}°C avg, {best_fold['avg_demand']:.0f} bikes avg")
print(f"\nWorst fold: Fold {worst_fold['fold']} (R² = {worst_fold['r2']:.4f})")
print(f"  Period: {worst_fold['test_start']} to {worst_fold['test_end']}")
print(f"  Characteristics: {worst_fold['avg_temp']:.1f}°C avg, {worst_fold['avg_demand']:.0f} bikes avg")
print(f"\nNote: Differences in performance across folds may indicate changes in demand patterns")
print(f"over time (business growth, operational changes) or varying seasonal predictability.")
```

</details>

---

## Summary: Professional Linear Regression Modeling for Transportation Demand

**What We've Accomplished:**
- **Analyzed linear relationships** using correlation analysis and scipy.stats.linregress to understand which variables drive bike demand
- **Engineered time-based features** including cyclical encoding (hour_sin/cos), lag features (demand history), and temporal components (day, month)
- **Trained predictive models** using scikit-learn's LinearRegression with multiple feature combinations to forecast hourly bike demand
- **Implemented chronological train-test splits** that preserve time series integrity and provide honest performance estimates for forecasting
- **Applied cross-validation** with TimeSeriesSplit to obtain robust performance estimates across multiple temporal windows

**Key Technical Skills Mastered:**
- **Correlation analysis**: scipy.stats.linregress for calculating slopes, intercepts, and R² values
- **Feature engineering**: cyclical encoding with sin/cos, lag features with .shift(), datetime extraction with .dt accessor
- **Model training**: LinearRegression.fit() for learning patterns, .predict() for generating forecasts
- **Time series splitting**: chronological .iloc[] indexing, TimeSeriesSplit for expanding window cross-validation
- **Performance evaluation**: r2_score, mean_squared_error, cross_val_score for comprehensive model assessment
- **Result interpretation**: translating R² and RMSE into business-relevant forecasting accuracy estimates

**Next Steps:**
In the next module, you'll advance to model evaluation techniques, learning to assess prediction quality, diagnose model limitations, and communicate performance to stakeholders. You'll also explore non-linear models that can capture more complex patterns beyond linear relationships.

Your forecasting model transforms Capital City Bikes from reactive operations to predictive planning, enabling data-driven decisions about bike distribution, staffing levels, and capacity management. You've demonstrated the systematic machine learning workflow that professional data scientists use to deliver reliable predictions for real-world business applications!