# Lecture 8: Linear Models for Prediction - Building Your First Predictive Engine

## Learning Objectives

By the end of this lecture, you will be able to:
- Understand linear regression fundamentals and their mathematical foundation
- Recognize linear relationships in transportation and urban mobility data
- Implement linear regression using scikit-learn's LinearRegression class
- Create proper train-test splits to evaluate model performance on unseen data
- Apply chronological splitting strategies for time series forecasting problems
- Use cross-validation techniques to obtain robust performance estimates
- Interpret model coefficients and communicate insights to business stakeholders
- Evaluate model performance using RMSE, R-squared, and MAE metrics
- Position linear regression as a foundation for advanced modeling approaches

---

## 1. Your Predictive Modeling Journey Begins

### The Consultant's Next Challenge: From Data to Predictions

Six months into your consulting engagement with Capital City Bikes, your expertise in data preparation and feature engineering has transformed raw operational data into a sophisticated analytical foundation. The client's CEO approaches you with an urgent new challenge: "We need to predict demand for next week to optimize our bike distribution and staffing decisions. Our investors want to see concrete forecasting capabilities before our Series A funding round."

This is **the moment every data consultant anticipates** - moving beyond descriptive analysis to build predictive capabilities that directly drive business decisions. Just as a bridge engineer progresses from understanding materials and forces to designing structures that bear actual loads, you're transitioning from data preparation to building models that must perform reliably in real business conditions.

Your client needs more than just predictions - they need predictions they can understand, trust, and act upon. Board members will question the methodology. Operations managers will base daily decisions on your forecasts. Investors will evaluate the company's analytical sophistication based on your work. This requires not just technical accuracy, but **clear communication and robust methodology** that stakeholders can comprehend and defend.

### Why Linear Regression: The Foundation of Predictive Intelligence

Think of linear regression as learning to drive with a manual transmission before advancing to complex automated systems. While sophisticated machine learning algorithms like neural networks and ensemble methods can achieve impressive accuracy, they often function as "black boxes" that provide predictions without explanation. Linear regression, by contrast, offers **transparency that's essential for building stakeholder confidence** and business understanding.

Linear regression serves multiple crucial purposes in your consulting toolkit. First, it **establishes baseline performance** that more complex models must exceed to justify their additional complexity. Second, it provides interpretable insights that help stakeholders understand which factors actually drive their business outcomes. Third, it offers rapid development cycles that enable quick iteration and learning, particularly valuable in fast-moving startup environments.

These qualities become your competitive advantage as a consultant. While competitors might deliver accurate predictions through complex algorithms, you'll provide **predictions plus understanding**, enabling your client to make informed decisions and confidently communicate their analytical capabilities to investors and partners.

## 2. Linear Regression Fundamentals

This section establishes the mathematical and conceptual foundation of linear regression before exploring its implementation. We'll start by defining linear regression from first principles, then examine how linear relationships manifest in real data, understand the mathematics behind the regression line, and finally connect these concepts to practical business applications in urban mobility.

### 2.1. What is Linear Regression

Let's begin by establishing a clear definition of linear regression and understanding why this foundational technique remains essential for modern predictive modeling. We'll explore the core concept, examine how it represents relationships mathematically, and understand why its transparency makes it particularly valuable for business applications. This conceptual foundation will prepare you to implement linear regression effectively in transportation consulting scenarios.

Linear regression is **a statistical method that models the relationship between a dependent variable** (what we want to predict) **and one or more independent variables** (what we use to make predictions) by fitting a straight line through the data points. The fundamental assumption of linear regression is that the relationship between variables can be expressed as a linear equation, meaning that changes in the independent variables produce proportional changes in the dependent variable.

At its core, linear regression seeks to find the "best-fit" line that minimizes the distance between the actual data points and the predicted values on the line. This line represents the average relationship between the input variables and the target variable, allowing us to make predictions for new, unseen data points.

The power of linear regression lies in **its simplicity and interpretability**. Unlike complex algorithms that act as "black boxes," linear regression provides clear, understandable relationships that can be easily communicated to stakeholders. Each coefficient in the model tells us exactly how much the target variable is expected to change when the corresponding input variable increases by one unit, holding all other variables constant.

In the context of predictive modeling, linear regression serves as both a practical prediction tool and a baseline for comparison with more sophisticated algorithms. Its transparency makes it particularly valuable in business contexts where understanding the "why" behind predictions is as important as the predictions themselves.

Let's see how linear relationships appear in real bike-sharing data:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Create a simple scatter plot showing temperature-demand relationship
plt.figure(figsize=(10, 6))
plt.scatter(df['temp'], df['count'], alpha=0.3, s=20)
plt.xlabel('Temperature (°C)', fontsize=12)
plt.ylabel('Hourly Bike Rentals', fontsize=12)
plt.title('Linear Relationship: Temperature vs. Bike Demand', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate correlation to quantify relationship strength
correlation = df['temp'].corr(df['count'])
print(f"Temperature-Demand Correlation: {correlation:.3f}")
print(f"Temperature explains {(correlation**2)*100:.1f}% of demand variation")

**What this demonstrates:**
- **Visual evidence of linear relationships** in transportation data - warmer temperatures generally associate with higher demand
- The scatter plot reveals a clear upward trend, confirming that linear regression is an appropriate modeling approach
- The moderate correlation (r ≈ 0.39) indicates that while temperature matters, other factors also influence demand significantly
- This visual validation justifies using linear regression for demand prediction in bike-sharing systems

### 2.2. Linear Relationships in Data

Now that we've defined linear regression, let's explore what constitutes a linear relationship and how to identify these patterns in transportation data. We'll examine the characteristics of linear relationships, understand positive versus negative associations, and learn to recognize when linear modeling is appropriate. This understanding will guide your feature selection and model design decisions.

A linear relationship exists when **two variables change together at a constant rate**. In mathematical terms, this means that as one variable increases or decreases, the other variable changes by a consistent amount. When plotted on a graph, these relationships appear as straight lines, hence the term "linear."

Linear relationships can be positive or negative. In a positive linear relationship, both variables move in the same direction - as one increases, the other increases proportionally. In a negative linear relationship, the variables move in opposite directions - as one increases, the other decreases proportionally. **The strength of a linear relationship is measured by how closely the data points cluster around the best-fit line.**

Real-world data rarely exhibits perfect linear relationships, but many phenomena demonstrate approximately linear patterns that can be effectively modeled using linear regression. The key is identifying variables that have reasonably consistent relationships with each other, even if some variability exists around the general trend.

In transportation and urban mobility contexts, several relationships tend to be approximately linear. Weather conditions, time patterns, and seasonal factors often exhibit linear relationships with transportation demand. For example, as temperature increases within a comfortable range, bike-sharing usage typically increases at a relatively consistent rate. Similarly, as the working day progresses during morning rush hour, subway ridership increases in a predictable linear pattern.

The identification of linear relationships is crucial for successful linear regression modeling. Variables with strong linear relationships will produce more accurate and reliable models, while variables with weak or non-linear relationships may require transformation or alternative modeling approaches.

Let's visualize both positive and negative linear relationships in bike-sharing data:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Create side-by-side plots showing positive and negative relationships
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Panel 1: Positive relationship (Temperature vs. Demand)
axes[0].scatter(df['temp'], df['count'], alpha=0.3, s=15, color='#2ECC71')
slope1, intercept1, r1, _, _ = stats.linregress(df['temp'], df['count'])
line_x1 = np.array([df['temp'].min(), df['temp'].max()])
axes[0].plot(line_x1, slope1 * line_x1 + intercept1, 'r--', linewidth=2.5,
             label=f'Trend Line (r = {r1:.3f})')
axes[0].set_xlabel('Temperature (°C)', fontsize=11)
axes[0].set_ylabel('Hourly Bike Rentals', fontsize=11)
axes[0].set_title('Positive Linear Relationship\nHigher Temp → Higher Demand',
                  fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Panel 2: Negative relationship (Humidity vs. Demand)
axes[1].scatter(df['humidity'], df['count'], alpha=0.3, s=15, color='#E74C3C')
slope2, intercept2, r2, _, _ = stats.linregress(df['humidity'], df['count'])
line_x2 = np.array([df['humidity'].min(), df['humidity'].max()])
axes[1].plot(line_x2, slope2 * line_x2 + intercept2, 'b--', linewidth=2.5,
             label=f'Trend Line (r = {r2:.3f})')
axes[1].set_xlabel('Humidity (%)', fontsize=11)
axes[1].set_ylabel('Hourly Bike Rentals', fontsize=11)
axes[1].set_title('Negative Linear Relationship\nHigher Humidity → Lower Demand',
                  fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("=== Linear Relationship Analysis ===")
print(f"Temperature (positive): r = {r1:.3f}, slope = {slope1:.2f} rides/°C")
print(f"Humidity (negative): r = {r2:.3f}, slope = {slope2:.2f} rides/%")

**What this demonstrates:**
- **Positive linear relationships** show upward trends - as temperature increases, demand increases consistently
- **Negative linear relationships** show downward trends - as humidity increases, demand decreases
- The correlation coefficients (r values) quantify relationship strength - temperature shows moderate positive correlation, humidity shows moderate negative correlation
- The slopes tell the business story: each 1°C temperature increase adds approximately 9.2 rides per hour, while each 1% humidity increase reduces demand by approximately 2.2 rides per hour
- These quantified relationships enable **concrete operational planning** and weather-responsive capacity adjustments

### 2.3. The Regression Line

Having identified linear relationships, let's now understand how to mathematically represent these patterns through the regression line. We'll explore the equation structure, interpret slope and intercept parameters, and learn how regression finds the optimal line that best fits the data. This mathematical foundation will help you understand what your models are actually calculating.

The regression line, also known as **the line of best fit**, is the mathematical representation of the linear relationship between variables. For simple linear regression with one independent variable, this line is expressed using the familiar equation: y = mx + b, where y is the dependent variable, x is the independent variable, m is the slope, and b is the y-intercept.

The slope (m) represents the rate of change - how much y increases or decreases for every one-unit increase in x. The y-intercept (b) represents the value of y when x equals zero. Together, these parameters define the position and angle of the line that best represents the relationship in the data.

For multiple linear regression, which involves more than one independent variable, the equation extends to: y = b₀ + b₁x₁ + b₂x₂ + ... + bₙxₙ, where b₀ is the intercept, and b₁, b₂, through bₙ are the coefficients for each independent variable x₁, x₂, through xₙ. Each coefficient represents the expected change in y for a one-unit increase in the corresponding x variable, assuming all other variables remain constant.

The process of finding the best regression line involves **minimizing the sum of squared differences** between the actual data points and the predicted values on the line. This method, called ordinary least squares, ensures that the line represents the best possible fit given the available data. The resulting line minimizes prediction errors across all data points, providing the most accurate representation of the underlying relationship.

The quality of the regression line is measured by how well it explains the variability in the data. A perfect fit would have all data points lying exactly on the line, while a poor fit would show data points scattered widely around the line with no clear pattern.

Let's calculate and visualize a regression line with its equation parameters:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Calculate regression line parameters
slope, intercept, r_value, p_value, std_err = stats.linregress(df['temp'], df['count'])

# Create visualization showing the regression line
plt.figure(figsize=(12, 7))
plt.scatter(df['temp'], df['count'], alpha=0.2, s=20, label='Actual Data')

# Plot the regression line
line_x = np.array([df['temp'].min(), df['temp'].max()])
line_y = slope * line_x + intercept
plt.plot(line_x, line_y, 'r-', linewidth=3, label='Regression Line')

# Add equation annotation
equation_text = f'y = {slope:.2f}x + {intercept:.2f}\n'
equation_text += f'Correlation: r = {r_value:.3f}\n'
equation_text += f'R² = {r_value**2:.3f}'
plt.text(0.05, 0.95, equation_text, transform=plt.gca().transAxes,
         fontsize=12, verticalalignment='top',
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.xlabel('Temperature (°C)', fontsize=12, fontweight='bold')
plt.ylabel('Hourly Bike Rentals', fontsize=12, fontweight='bold')
plt.title('The Regression Line: Mathematical Representation of Temperature-Demand Relationship',
          fontsize=13, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Demonstrate prediction using the regression equation
example_temps = [10, 15, 20, 25, 30]
print("\n=== Using the Regression Equation for Predictions ===")
print(f"Regression Equation: Demand = {slope:.2f} × Temperature + {intercept:.2f}\n")
for temp in example_temps:
    predicted_demand = slope * temp + intercept
    print(f"At {temp}°C: Predicted demand = {predicted_demand:.0f} rides per hour")

print(f"\nInterpretation:")
print(f"- Slope ({slope:.2f}): Each 1°C increase adds ~{slope:.0f} rides per hour")
print(f"- Intercept ({intercept:.2f}): Baseline demand at 0°C would be ~{intercept:.0f} rides")
print(f"- R² ({r_value**2:.3f}): Temperature explains {(r_value**2)*100:.1f}% of demand variation")

**What this demonstrates:**
- **The regression line equation** (y = mx + b) translates into concrete business predictions
- The slope parameter quantifies the temperature effect: approximately 9.2 additional rides per hour for each degree Celsius increase
- The intercept provides the theoretical baseline demand (though 0°C is outside typical operating conditions)
- **R² (coefficient of determination)** reveals that temperature alone explains only 15.5% of demand variation, indicating other factors matter significantly
- The equation enables **scenario planning**: managers can estimate demand at different forecasted temperatures
- Visualization shows both the general trend (line) and variability around it (scatter), helping stakeholders understand prediction uncertainty

### 2.4. Business Applications

With the mathematical foundation established, let's explore how linear regression translates into practical business value. We'll examine real-world applications across industries, then focus specifically on transportation and urban mobility use cases. Understanding these applications will help you identify opportunities to apply linear regression in your consulting engagements.

Linear regression finds **extensive application in business contexts** due to its interpretability and practical utility. Organizations use linear regression for demand forecasting, price optimization, resource allocation, and performance analysis. The clear relationship between input variables and outcomes makes linear regression particularly valuable for strategic planning and decision-making processes.

In demand forecasting, businesses use linear regression to predict future sales, customer traffic, or service utilization based on historical patterns and influencing factors. The model coefficients provide actionable insights into which factors most significantly drive demand, enabling targeted interventions and resource optimization.

Price optimization represents another key application area. Linear regression can model the relationship between pricing strategies and sales volumes, helping businesses identify optimal price points that maximize revenue or market share. The interpretable nature of the results allows managers to understand exactly how price changes will impact demand.

Resource allocation decisions benefit from **linear regression's ability to quantify relationships** between input investments and output results. Organizations can model the relationship between staffing levels and service quality, marketing spend and customer acquisition, or facility capacity and operational efficiency.

In urban mobility and transportation contexts, linear regression supports route planning, fleet optimization, and service scheduling decisions. Transportation authorities use linear regression to understand how weather conditions, special events, and seasonal patterns affect ridership, enabling proactive service adjustments and capacity planning.

Let's apply linear regression to a concrete business scenario for Capital City Bikes:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Business Scenario: Predict demand based on temperature and working day status
# Prepare features for multiple linear regression
X = df[['temp', 'workingday']]
y = df['count']

# Train a simple linear regression model
model = LinearRegression()
model.fit(X, y)

# Extract model parameters
temp_coefficient = model.coef_[0]
workday_coefficient = model.coef_[1]
intercept = model.intercept_

# Make predictions
predictions = model.predict(X)
r2 = r2_score(y, predictions)

print("=== LINEAR REGRESSION BUSINESS APPLICATION ===")
print("\nModel Equation:")
print(f"Demand = {intercept:.1f} + ({temp_coefficient:.1f} × Temperature) + ({workday_coefficient:.1f} × Working Day)")

print("\n--- Business Insights ---")
print(f"Temperature Impact: Each 1°C increase adds {temp_coefficient:.0f} rides per hour")
print(f"Working Day Effect: Working days generate {workday_coefficient:.0f} more rides than weekends/holidays")
print(f"Model Accuracy: Explains {r2*100:.1f}% of demand variation (R² = {r2:.3f})")

# Scenario Planning: What-if analysis for business decisions
print("\n--- Scenario Planning for Capital City Bikes ---")
scenarios = [
    {"name": "Cold Weekend", "temp": 10, "workday": 0},
    {"name": "Warm Weekend", "temp": 25, "workday": 0},
    {"name": "Cold Weekday", "temp": 10, "workday": 1},
    {"name": "Warm Weekday", "temp": 25, "workday": 1}
]

for scenario in scenarios:
    predicted_demand = intercept + (temp_coefficient * scenario['temp']) + (workday_coefficient * scenario['workday'])
    print(f"{scenario['name']:15s}: {predicted_demand:.0f} rides per hour expected")

# Operational recommendation
optimal_staffing_weekday = intercept + (temp_coefficient * 20) + (workday_coefficient * 1)
optimal_staffing_weekend = intercept + (temp_coefficient * 20) + (workday_coefficient * 0)
print(f"\nOperational Recommendation (20°C conditions):")
print(f"  Weekday staffing target: {optimal_staffing_weekday:.0f} bikes per station")
print(f"  Weekend staffing target: {optimal_staffing_weekend:.0f} bikes per station")
print(f"  Weekday requires {((optimal_staffing_weekday/optimal_staffing_weekend - 1)*100):.1f}% more capacity")

**What this demonstrates:**
- **Multiple linear regression** combines several business factors (temperature, working day status) into a single predictive model
- Model coefficients translate directly into business language: "working days need 30 more bikes than weekends"
- **Scenario planning** enables proactive capacity decisions based on weather forecasts and calendar information
- The what-if analysis shows demand ranging from 52 rides (cold weekend) to 291 rides (warm weekday) - **a 5.6-fold variation** requiring flexible operations
- Quantified insights support **evidence-based resource allocation**: weekday operations require 17% more capacity than weekends at similar temperatures
- This transparency enables stakeholder confidence - executives can understand exactly how the model makes recommendations

## 3. Machine Learning Implementation

This section transitions from mathematical theory to practical implementation using machine learning tools. We'll build your understanding step by step: first mastering the LinearRegression tool itself, then understanding why proper evaluation matters, learning to create train-test splits, adapting for time series data, gaining robust performance estimates through cross-validation, and finally diving deep into performance metrics. This systematic progression ensures you can confidently deploy linear regression models for real-world demand forecasting.

### 3.1. From Mathematics to Code

The transition from mathematical concepts to computational implementation represents **a crucial step in applying linear regression to real-world problems**. While the underlying mathematics remains the same, machine learning libraries provide optimized implementations that handle the computational complexity and offer additional functionality for model management and evaluation.

**The scikit-learn library** provides a comprehensive implementation of linear regression that automates the process of finding optimal coefficients through mathematical optimization algorithms. These implementations efficiently handle large datasets, multiple variables, and various data types while maintaining numerical stability and computational efficiency.

The key advantage of scikit-learn is its **seamless connection with the entire data science process**. Think of it like LEGO bricks: every piece, whether it's a tiny 1x1 block or a specialized wheel, connects the same way using the same bumps and holes. Once you learn how the pieces fit together, you can build anything - a house, a car, or a spaceship - using the same connection mechanism. Scikit-learn provides standardized interfaces that work the same way whether you're preparing data, training models, making predictions, or measuring performance. This consistency means once you learn the pattern for linear regression, you can apply the same approach to decision trees, neural networks, or any other algorithm - the `.fit()` and `.predict()` methods work identically across all of them, like LEGO pieces that always snap together the same way.

Let's see the complete journey from manual calculation to automated scikit-learn implementation:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from scipy import stats

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

print("=== FROM MATHEMATICS TO CODE ===\n")

# Method 1: Manual calculation using scipy (shows the mathematics)
print("--- Method 1: Mathematical Approach (scipy) ---")
slope, intercept, r_value, p_value, std_err = stats.linregress(df['temp'], df['count'])
print(f"Equation: Demand = {slope:.2f} × Temperature + {intercept:.2f}")
print(f"R-squared: {r_value**2:.3f}")
print(f"Standard error: {std_err:.2f}")

# Manual prediction
example_temp = 20
manual_prediction = slope * example_temp + intercept
print(f"Manual prediction at {example_temp}°C: {manual_prediction:.0f} rides\n")

# Method 2: Machine learning approach (scikit-learn)
print("--- Method 2: Machine Learning Approach (scikit-learn) ---")
# Prepare data in scikit-learn format (2D array for features)
X = df[['temp']]  # Must be 2D for scikit-learn
y = df['count']

# Create and train the model
ml_model = LinearRegression()
ml_model.fit(X, y)

# Extract parameters (same as manual calculation)
ml_slope = ml_model.coef_[0]
ml_intercept = ml_model.intercept_
print(f"Equation: Demand = {ml_slope:.2f} × Temperature + {ml_intercept:.2f}")

# Calculate R-squared
ml_predictions = ml_model.predict(X)
from sklearn.metrics import r2_score
ml_r2 = r2_score(y, ml_predictions)
print(f"R-squared: {ml_r2:.3f}")

# ML prediction
ml_prediction = ml_model.predict([[example_temp]])[0]
print(f"ML prediction at {example_temp}°C: {ml_prediction:.0f} rides\n")

# Verify equivalence
print("--- Verification: Both Methods Produce Identical Results ---")
print(f"Slopes match: {abs(slope - ml_slope) < 0.01}")
print(f"Intercepts match: {abs(intercept - ml_intercept) < 0.01}")
print(f"Predictions match: {abs(manual_prediction - ml_prediction) < 0.01}")

print("\n--- Key Advantages of ML Implementation ---")
print("✓ Standardized interface works with all scikit-learn algorithms")
print("✓ Easy extension to multiple features (just add columns)")
print("✓ Built-in performance metrics and cross-validation")
print("✓ Numerical stability for large datasets")
print("✓ Integration with pipelines and preprocessing tools")

**What this demonstrates:**
- **Mathematical and machine learning approaches produce identical results** - the ML implementation is simply automating the mathematics we've learned
- The scikit-learn interface provides a **consistent pattern** (fit, predict, score) that works across all algorithms, simplifying future learning
- ML implementations handle edge cases and numerical stability automatically, making them production-ready
- The 2D array format requirement (X must be `[[value]]` not `[value]`) reflects ML's design for multi-feature models
- **Transitioning to ML tools** enables scaling from simple models (one feature) to complex models (many features) without changing your code structure
- This equivalence validation builds confidence: you understand both the mathematics and the implementation

### 3.2. Basic Model Usage

Now that we understand how scikit-learn automates the mathematics, let's master the LinearRegression class mechanics before diving into evaluation complexities. We'll focus purely on the tool itself: how to create models, train them, make predictions, and interpret the learned parameters. Think of this as learning to drive a car before worrying about navigation or traffic rules.

The scikit-learn LinearRegression class provides **a standardized interface** that makes machine learning accessible and consistent. The beauty of this interface is its simplicity: every model in scikit-learn follows the same pattern, which means once you master linear regression, you already know how to use decision trees, neural networks, and hundreds of other algorithms.

**The core LinearRegression workflow** consists of just three essential steps: create the model object, fit it to data, and use it to make predictions. This pattern remains identical whether you're building simple single-feature models or complex multi-feature systems, providing a solid foundation for all your machine learning work.

Let's start by understanding **the key components** of the LinearRegression class. When you create a model, you're instantiating an object that can learn relationships from data. When you call `.fit()`, the model analyzes your data to find optimal coefficients. When you call `.predict()`, it applies those learned relationships to new data. These three operations form the backbone of machine learning with scikit-learn.

**After fitting**, the model stores its learned parameters in special attributes. The `.coef_` attribute contains the coefficients (slopes) for each feature, showing how much each factor influences your predictions. The `.intercept_` attribute stores the baseline value when all features are zero. Together, these parameters define the linear equation the model discovered in your data.

Let's explore the LinearRegression class step by step:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

print("=== BASIC LINEAR REGRESSION USAGE ===\n")

# Step 1: Prepare data
print("--- Step 1: Data Preparation ---")
feature_columns = ['temp', 'humidity', 'windspeed', 'workingday']
X = df[feature_columns]  # Features must be a 2D DataFrame
y = df['count']          # Target is a 1D Series

print(f"Dataset size: {len(df):,} hourly observations")
print(f"Feature matrix: {X.shape[0]} rows × {X.shape[1]} features")
print(f"Features: {list(X.columns)}")
print(f"Target: count (hourly bike rentals)\n")

# Step 2: Create and train the model
print("--- Step 2: Model Creation and Training ---")
model = LinearRegression()  # Create model with default parameters
print(f"Model created: {model}")

model.fit(X, y)  # Train the model on ALL available data
print("Model trained on complete dataset!")
print(f"Training completed for {X.shape[0]:,} observations\n")

# Step 3: Inspect learned parameters
print("--- Step 3: Inspect Learned Parameters ---")
print(f"Intercept (β₀): {model.intercept_:.2f} rides")
print("\nCoefficients (slopes):")
for feature, coef in zip(feature_columns, model.coef_):
    direction = "increases" if coef > 0 else "decreases"
    print(f"  {feature:12s}: {coef:+8.2f} → demand {direction} by {abs(coef):.2f} rides per unit")

print("\n--- Model Equation ---")
equation = f"Predicted Demand = {model.intercept_:.1f}"
for feature, coef in zip(feature_columns, model.coef_):
    sign = "+" if coef >= 0 else "-"
    equation += f" {sign} {abs(coef):.1f}×{feature}"
print(equation + "\n")

# Step 4: Make predictions
print("--- Step 4: Making Predictions ---")
predictions = model.predict(X)  # Predict for all observations
print(f"Generated {len(predictions):,} predictions")
print(f"First 5 predictions: {predictions[:5].round(1)}")
print(f"First 5 actual values: {y.iloc[:5].values}")

# Compare a few predictions
print("\n--- Prediction Examples ---")
for i in range(3):
    actual = y.iloc[i]
    predicted = predictions[i]
    error = actual - predicted
    print(f"Observation {i+1}: Actual = {actual:.0f}, Predicted = {predicted:.0f}, Error = {error:+.0f} rides")

# Step 5: Predict new scenarios
print("\n--- Step 5: New Scenario Prediction ---")
new_scenario = pd.DataFrame({
    'temp': [25],         # 25°C temperature
    'humidity': [50],     # 50% humidity
    'windspeed': [10],    # 10 km/h windspeed
    'workingday': [1]     # Working day
})
new_prediction = model.predict(new_scenario)[0]
print(f"Scenario: Warm working day (25°C, 50% humidity, 10 km/h wind)")
print(f"Predicted hourly demand: {new_prediction:.0f} bikes")

print("\n--- Key Takeaways ---")
print("✓ LinearRegression() creates a model object ready to learn")
print("✓ .fit(X, y) trains the model by finding optimal coefficients")
print("✓ .predict(X) applies learned relationships to make predictions")
print("✓ .coef_ and .intercept_ reveal what the model learned")
print("✓ The same simple pattern works for any number of features")

**What this demonstrates:**
- **Three-step workflow**: create model → fit to data → make predictions - this pattern applies to all scikit-learn algorithms
- The model trained on 10,886 complete observations, learning how temperature, humidity, windspeed, and workingday relate to bike demand
- **Coefficient interpretation**: temp coefficient of +7.86 means each 1°C increase adds ~8 bikes per hour; humidity coefficient of -2.25 means each 1% humidity increase reduces demand by ~2 bikes
- The intercept of 6.14 represents baseline demand when all features are at zero (though this specific value isn't meaningful since we can't have 0% humidity in reality)
- **Predictions are deterministic**: given the same input features, the model always produces the same prediction using the linear equation
- We can generate predictions for individual scenarios (like our 25°C working day example) or for thousands of observations at once
- The model handles multiple features automatically - we don't need to change our code whether we have 1 feature or 100 features
- **Understanding coefficients enables business insights**: we can tell Capital City Bikes that temperature is the strongest demand driver, while humidity works against rentals

### 3.3. Why We Need Separate Test Data

Now that we can train models and make predictions, a critical question arises: how do we know if our model is any good? Looking at our previous example, the model made predictions for all 10,886 observations - but those are the same observations it used for training. Can we trust those predictions to tell us how well the model will perform on tomorrow's data? Let's explore why this question matters and why machine learning requires a fundamentally different approach to evaluation.

**The fundamental challenge** in machine learning is distinguishing between models that have learned genuine patterns versus models that have simply memorized the training data. Imagine studying for an exam using practice problems. If the actual exam contains those exact same problems, a perfect score tells you nothing about whether you truly understand the concepts or just memorized the answers. The same principle applies to predictive models.

When we train a model on data and then evaluate it on that same data, we're essentially **giving it the exam questions during the study session**. The model can achieve high accuracy simply by memorizing the training examples, without learning the underlying patterns that would help it predict new situations. This memorization problem, called **overfitting**, produces models that look great on paper but fail in real-world deployment.

**Consider the Capital City Bikes scenario**: If our model memorizes that "on January 5th at 3pm it rained and 12 bikes were rented," that's useless for predicting February 10th's demand. What we need is for the model to learn that "rainy days typically reduce demand by 30%" - a general pattern that applies to any rainy day, not memorized facts about specific past days.

The solution is deceptively simple but absolutely crucial: **we must evaluate model performance on data the model has never seen**. By setting aside some observations during training and using them only for evaluation, we can test whether the model learned real patterns or just memorized training examples. This approach simulates the real-world scenario where the model must predict tomorrow's demand using only yesterday's data.

**How evaluation creates trust**: When a model predicts well on held-out data it's never seen, we gain confidence that it learned genuine relationships. When training performance is much better than held-out performance, we know the model memorized rather than learned. This distinction is the difference between a model that's ready for business deployment and one that will fail in production.

Let's see this problem in action and understand why separate test data is essential:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

print("=== THE TRAINING DATA EVALUATION PROBLEM ===\n")

# Prepare features and target
X = df[['temp', 'humidity', 'windspeed', 'workingday']]
y = df['count']

# Train a model on ALL data
print("--- Training model on complete dataset ---")
model = LinearRegression()
model.fit(X, y)
print(f"Model trained on {len(X):,} observations\n")

# Evaluate on the SAME data used for training
print("--- Evaluating on training data ---")
training_predictions = model.predict(X)
training_r2 = r2_score(y, training_predictions)
training_rmse = np.sqrt(mean_squared_error(y, training_predictions))

print(f"R² score: {training_r2:.4f} ({training_r2*100:.2f}% variance explained)")
print(f"RMSE: {training_rmse:.2f} rides")
print("\n❓ Question: Does this R² of {:.2f}% tell us how well the model will predict tomorrow's demand?".format(training_r2*100))
print("❓ Question: Can we trust this model for Capital City Bikes' operations?")

print("\n--- The Problem: We Can't Answer These Questions! ---")
print("\nWhy training performance is unreliable:")
print("1. Model has 'seen' all these observations during training")
print("2. High accuracy might mean genuine learning OR just memorization")
print("3. No way to distinguish between the two scenarios")
print("4. Real deployment requires predicting NEW data not in training set")

print("\n--- What We Need Instead ---")
print("✓ Evaluate on observations the model has NEVER seen")
print("✓ Simulate real-world scenario: predict future from past")
print("✓ Compare training vs testing performance to detect memorization")
print("✓ Trust metrics that reflect true predictive capability")

print("\n--- Real-World Deployment Scenario ---")
print("Capital City Bikes wants to predict next Monday's demand.")
print("The model must predict using:")
print("  • Weather features for Monday (temp, humidity, windspeed, workingday)")
print("  • Patterns learned from historical data (NOT Monday's actual demand)")
print("\nThe model will NEVER have 'seen' next Monday during training!")
print("Training data evaluation can't tell us how accurate Monday's prediction will be.")
print("\nConclusion: We need SEPARATE TEST DATA that simulates this future prediction scenario.")

**What this demonstrates:**
- **Training data evaluation is circular logic**: we're testing the model on the same data it learned from
- An R² of 34% on training data might indicate good learning, or it might hide a model that would perform far worse on new data
- **The memorization risk**: without separate test data, we can't distinguish between models that learned patterns versus models that memorized examples
- Real-world deployment always involves predicting data the model hasn't seen - our evaluation must simulate this scenario
- **The trust problem**: Capital City Bikes needs confidence that the model will actually work when deployed, not just reassurance based on training data performance

**The concept of overfitting** becomes clear through this lens. An overfitted model achieves high training performance by memorizing specific examples rather than learning general patterns. It's like a student who memorizes answers without understanding concepts - perfect scores on practice problems, but failure on new exam questions. We can only detect overfitting by comparing training performance to performance on separate test data.

**This is why every machine learning project requires setting aside test data** before training begins. These held-out observations serve as a proxy for real-world deployment, giving us an honest assessment of whether the model learned patterns that generalize to new situations. In the next subsection, we'll learn exactly how to create this test data and use it to build models worthy of Capital City Bikes' trust.

### 3.4. Train-Test Split for Model Evaluation

Now that we understand why separate test data is essential, let's learn how to create it and use it properly. We'll implement the train-test split strategy, build our evaluation workflow, and learn to interpret the comparison between training and testing performance. This subsection establishes the foundational evaluation pattern you'll use throughout your machine learning career.

**The train-test split strategy** divides our dataset into two distinct subsets before any training occurs. The **training set** (typically 70-80% of data) is used to fit the model - this is the data the model learns from. The **testing set** (typically 20-30% of data) is held aside completely, never touching the training process, and used solely to evaluate the trained model's performance on unseen data.

The split must occur **before training begins** to ensure the test set truly simulates future data. If we peek at the test set during model development, adjust our approach based on test performance, or allow any information from the test set to influence training, we've violated the independence requirement and our evaluation becomes unreliable.

**Random splitting** is the standard approach for most machine learning problems. By randomly assigning observations to training and testing sets, we ensure both sets have similar distributions of all variables - seasons, weather conditions, days of the week, demand levels, etc. This similarity is crucial because we want both sets to represent the same underlying patterns, just split into "learning" and "testing" subsets.

**The evaluation workflow** follows a strict sequence: split data → train on training set → predict on both sets → compare metrics. By generating predictions for both training and testing data, we can compare their performance metrics directly. When both show similar performance, the model learned generalizable patterns. When training performance significantly exceeds testing performance, the model overfitted.

Let's implement train-test split and proper model evaluation:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

print("=== TRAIN-TEST SPLIT FOR MODEL EVALUATION ===\n")

# Step 1: Prepare features and target
print("--- Step 1: Data Preparation ---")
feature_columns = ['temp', 'humidity', 'windspeed', 'workingday']
X = df[feature_columns]
y = df['count']
print(f"Total observations: {len(X):,}")
print(f"Features: {feature_columns}\n")

# Step 2: Create train-test split
print("--- Step 2: Train-Test Split ---")
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% for testing, 80% for training
    random_state=42     # For reproducibility
)

print(f"Training set: {len(X_train):,} observations ({len(X_train)/len(X)*100:.1f}%)")
print(f"Testing set:  {len(X_test):,} observations ({len(X_test)/len(X)*100:.1f}%)")
print(f"\n✓ Test set is held aside - model will NEVER see it during training\n")

# Step 3: Train model on training data ONLY
print("--- Step 3: Train Model on Training Data ---")
model = LinearRegression()
model.fit(X_train, y_train)
print("Model trained successfully!")
print(f"Model learned from {len(X_train):,} training observations\n")

# Step 4: Generate predictions for BOTH sets
print("--- Step 4: Generate Predictions ---")
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)
print(f"Training predictions: {len(train_predictions):,} values")
print(f"Testing predictions:  {len(test_predictions):,} values\n")

# Step 5: Evaluate performance on BOTH sets
print("--- Step 5: Compare Training vs Testing Performance ---")

# Calculate metrics for training set
train_r2 = r2_score(y_train, train_predictions)
train_rmse = np.sqrt(mean_squared_error(y_train, train_predictions))

# Calculate metrics for testing set
test_r2 = r2_score(y_test, test_predictions)
test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))

print("Training Performance (data model learned from):")
print(f"  R²:   {train_r2:.4f} ({train_r2*100:.1f}% variance explained)")
print(f"  RMSE: {train_rmse:.2f} rides per hour")

print("\nTesting Performance (unseen data):")
print(f"  R²:   {test_r2:.4f} ({test_r2*100:.1f}% variance explained)")
print(f"  RMSE: {test_rmse:.2f} rides per hour")

# Detect overfitting by comparing performances
performance_gap = train_r2 - test_r2
print(f"\nPerformance Gap: {performance_gap:.4f}")
if performance_gap < 0.05:
    print("✓ Minimal gap - model learned generalizable patterns")
elif performance_gap < 0.15:
    print("⚠ Moderate gap - some overfitting present")
else:
    print("✗ Large gap - significant overfitting detected")

# Step 6: Visualize training vs testing performance
print("\n--- Step 6: Visual Comparison ---")
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training set predictions
axes[0].scatter(y_train, train_predictions, alpha=0.4, s=15, color='blue')
axes[0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()],
             'r--', linewidth=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Demand', fontsize=12)
axes[0].set_ylabel('Predicted Demand', fontsize=12)
axes[0].set_title(f'Training Set (R² = {train_r2:.3f})', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Testing set predictions
axes[1].scatter(y_test, test_predictions, alpha=0.4, s=15, color='orange')
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
             'r--', linewidth=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual Demand', fontsize=12)
axes[1].set_ylabel('Predicted Demand', fontsize=12)
axes[1].set_title(f'Testing Set (R² = {test_r2:.3f})', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n=== KEY INSIGHTS FOR CAPITAL CITY BIKES ===")
print(f"✓ Test R² of {test_r2:.1%} represents realistic prediction capability")
print(f"✓ Test RMSE of {test_rmse:.0f} rides = typical prediction error")
print(f"✓ Similar train/test performance = model ready for deployment")
print(f"✓ Random splitting ensured representative train and test distributions")

**What this demonstrates:**
- **scikit-learn's `train_test_split()` function** handles the random splitting automatically, ensuring we don't accidentally introduce bias
- We set `test_size=0.2` for an 80/20 split, giving us 8,708 training observations and 2,178 testing observations
- The `random_state=42` parameter ensures reproducibility - running this code again produces the same split
- **Critical workflow**: split FIRST, then train ONLY on training set, then predict on BOTH sets to compare
- Training R² of 34.8% vs testing R² of 33.5% shows minimal overfitting - the 1.3% gap indicates the model learned real patterns
- Both scatter plots show similar patterns around the perfect prediction line, visually confirming consistent performance
- **RMSE interpretation**: test RMSE of 142 rides means Capital City Bikes should expect typical errors of ±142 bikes per hour
- The testing performance (33.5% R²) is what matters for deployment - this represents realistic prediction capability on genuinely new data
- **Random splitting ensures fairness**: both sets contain mix of seasons, weather conditions, and demand levels, so neither is artificially easier
- If train R² was 90% but test R² was 35%, this would signal severe overfitting requiring model changes
- This train-test evaluation pattern is universal across all machine learning - from simple linear regression to complex deep learning

**Understanding the metrics in business context**: Capital City Bikes can now confidently report that the model explains about one-third of demand variation and typically errs by ±142 bikes per hour. This level of performance enables useful operational planning while acknowledging prediction uncertainty. The similar train/test performance confirms these estimates reflect real-world capability, not training data memorization.

### 3.5. Time Series Considerations

The train-test split we just learned works beautifully for most machine learning problems, but bike-sharing demand presents a special challenge: **the data has a time dimension**. Each observation is tied to a specific date and hour, and demand patterns evolve over time due to seasonality, trends, and temporal dependencies. Can we still use random splitting, or does time order matter? Let's explore why time series data requires a specialized splitting approach and when to apply it.

**The temporal dependency problem** arises when observations are not independent but instead connected through time. In bike-sharing data, today's demand influences tomorrow's - a holiday weekend affects patterns for days afterward, seasonal trends build gradually, and weather patterns persist across hours. Random splitting ignores these temporal connections, potentially creating an unrealistic evaluation scenario.

**Data leakage through random splitting** occurs when we randomly mix past and future observations. Imagine our training set contains data from December while our test set includes November observations. We're now training on the "future" (December) to predict the "past" (November) - an impossible scenario in real deployment where we can only use historical data to predict upcoming demand.

Consider this concrete example: Random splitting might put December 25th (Christmas, extremely low demand) in the training set and December 24th (Christmas Eve, also unusual) in the test set. The model learns December 25th's patterns and uses them to predict December 24th - but in real deployment, we'll never have tomorrow's data to predict today. This temporal leakage inflates test performance, making the model appear better than it actually is.

**When time order matters**: Not all datasets with dates require chronological splitting. If observations are truly independent despite having timestamps - like medical diagnoses from different patients recorded on different dates - random splitting remains appropriate. The key question is: "Would future information help predict past events?" For bike-sharing, the answer is yes, making chronological splitting essential.

**Chronological train-test split** preserves temporal order by using the earliest portion of data for training and the most recent portion for testing. This mirrors real-world deployment: we train on all available history, then predict tomorrow, next week, or next month. The earliest 70-80% becomes training data, and the final 20-30% becomes test data, maintaining the timeline's integrity.

Let's implement chronological splitting and compare it to random splitting:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

print("=== CHRONOLOGICAL VS RANDOM SPLITTING ===\n")

# Prepare features and target
X = df[['temp', 'humidity', 'windspeed', 'workingday']]
y = df['count']

print(f"Dataset time range: {df['datetime'].min()} to {df['datetime'].max()}")
print(f"Total observations: {len(df):,} hours\n")

# Method 1: Random split (inappropriate for time series)
print("--- Method 1: Random Split (Standard Approach) ---")
X_train_random, X_test_random, y_train_random, y_test_random = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model_random = LinearRegression()
model_random.fit(X_train_random, y_train_random)

test_pred_random = model_random.predict(X_test_random)
test_r2_random = r2_score(y_test_random, test_pred_random)
test_rmse_random = np.sqrt(mean_squared_error(y_test_random, test_pred_random))

print(f"Test R²: {test_r2_random:.4f}")
print(f"Test RMSE: {test_rmse_random:.2f} rides")
print("⚠ Problem: Training set may contain 'future' data relative to test set")
print("⚠ Problem: Temporal patterns can leak from future to past")
print("⚠ Result: Performance estimate may be optimistically biased\n")

# Method 2: Chronological split (appropriate for time series)
print("--- Method 2: Chronological Split (Time Series Approach) ---")
split_index = int(len(df) * 0.8)  # First 80% for training

X_train_chrono = X.iloc[:split_index]
X_test_chrono = X.iloc[split_index:]
y_train_chrono = y.iloc[:split_index]
y_test_chrono = y.iloc[split_index:]

# Get corresponding dates for clarity
train_end_date = df.iloc[split_index-1]['datetime']
test_start_date = df.iloc[split_index]['datetime']

print(f"Training period: {df['datetime'].min()} to {train_end_date}")
print(f"Testing period: {test_start_date} to {df['datetime'].max()}")

model_chrono = LinearRegression()
model_chrono.fit(X_train_chrono, y_train_chrono)

test_pred_chrono = model_chrono.predict(X_test_chrono)
test_r2_chrono = r2_score(y_test_chrono, test_pred_chrono)
test_rmse_chrono = np.sqrt(mean_squared_error(y_test_chrono, test_pred_chrono))

print(f"\nTest R²: {test_r2_chrono:.4f}")
print(f"Test RMSE: {test_rmse_chrono:.2f} rides")
print("✓ Benefit: Simulates real deployment (predict future from past)")
print("✓ Benefit: No temporal data leakage")
print("✓ Result: Honest performance estimate for time-based prediction\n")

# Compare the two approaches
print("--- Comparison: Random vs Chronological ---")
print(f"Random split test R²: {test_r2_random:.4f}")
print(f"Chronological split test R²: {test_r2_chrono:.4f}")
print(f"Difference: {abs(test_r2_random - test_r2_chrono):.4f}")

if test_r2_random > test_r2_chrono + 0.02:
    print("\n⚠ Random splitting overestimated performance due to temporal leakage")
elif test_r2_chrono > test_r2_random + 0.02:
    print("\n✓ Test periods have different characteristics - chronological reveals this")
else:
    print("\n✓ Results similar - but chronological is still correct for deployment")

print("\n--- Capital City Bikes Deployment Scenario ---")
print("Operational reality: Train on Jan-Nov data, predict December demand")
print("Chronological split simulates this: Train on first 80% of timeline")
print("Random split violates this: Can train on December to predict January")
print("\n✓ Conclusion: Use chronological splitting for bike-sharing demand forecasting")

**What this demonstrates:**
- **Bike-sharing is a time series**: observations connected temporally through seasons, trends, and weekly patterns
- Random splitting can create impossible scenarios where we train on "future" to predict "past"
- **Chronological splitting preserves temporal order**: earliest 80% trains, most recent 20% tests
- The date boundaries matter - our model trains on data through approximately November, then predicts December
- **Performance might differ** between methods due to seasonal effects (if December is unusual, chronological split reveals this)
- Random split may show artificially good performance by allowing temporal information leakage
- **Chronological approach matches deployment**: in production, Capital City Bikes will use all past data to predict future demand, never the reverse
- This principle applies to all time series: stock prices, weather forecasting, sales prediction - whenever order matters, preserve it in your split
- For Capital City Bikes operational planning, chronological split provides the honest performance estimate needed for confident deployment

**When to use each approach**: Use random splitting for data where observations are independent (medical diagnoses, image classification, customer surveys). Use chronological splitting when observations have temporal dependencies and deployment involves predicting future from past (demand forecasting, anomaly detection, predictive maintenance). The key test: "Would having today's data help predict tomorrow?" If yes, use chronological splitting.

### 3.6. Cross-Validation for Robust Evaluation

We've learned to evaluate models using a single train-test split, which provides one performance estimate. But here's a critical question: what if our particular test set happens to be unusually easy or difficult to predict? What if it contains only sunny weekdays, or happens to include an atypical holiday period? A single split gives us one number, but how confident should we be that this number represents true model performance? Let's explore cross-validation, the technique that provides robust, reliable performance estimates through systematic multiple evaluations.

**The single split limitation** arises from randomness in how we divide our data. Imagine Capital City Bikes gets an R² of 35% on one train-test split. Is this the model's true capability, or did the test set happen to be easy? If we created a different random split, would we get 30%? 40%? We can't know from a single evaluation. This uncertainty makes it difficult to compare models, tune parameters, or confidently report performance to stakeholders.

**Cross-validation solves this problem** by creating multiple train-test splits systematically and averaging their performance metrics. Instead of one evaluation giving one number, we get multiple evaluations giving multiple numbers - and their average provides a much more reliable estimate of model performance. The standard deviation across evaluations reveals performance consistency: low std means stable predictions regardless of data split, high std suggests the model is sensitive to training data composition.

**Standard K-Fold Cross-Validation** divides data into K equal parts (typically K=5 or K=10). In each of K iterations, one part serves as the test set while the remaining K-1 parts form the training set. Every observation gets used for testing exactly once and for training K-1 times. This systematic rotation ensures no data is wasted and every part of the dataset contributes to both training and evaluation.

Here's how 5-fold cross-validation works: divide 10,886 observations into 5 parts of ~2,177 observations each. Fold 1 uses parts 2-5 for training and part 1 for testing. Fold 2 uses parts 1, 3-5 for training and part 2 for testing. This continues until all 5 parts have served as test set once. The result: 5 different R² scores from 5 different train-test configurations, giving us a robust performance estimate.

**Why K-Fold fails for time series data** relates directly to the temporal dependency problem we discussed in subsection 3.5. Standard k-fold creates random folds, potentially putting December observations in fold 2's training set and November observations in fold 3's test set. We're back to the same violation: training on "future" to predict "past." The temporal order gets scrambled across folds, creating data leakage and overly optimistic performance estimates.

**TimeSeriesSplit provides the time series solution** through forward-chaining evaluation. Instead of random folds, TimeSeriesSplit creates chronological folds where each fold's training set includes all previous data and its test set contains the next time period. This respects temporal order in every fold, always predicting future from past, just like real deployment.

**How TimeSeriesSplit works**: Fold 1 trains on the earliest 20% of timeline and tests on the next 20%. Fold 2 trains on the first 40% and tests on the next 20%. Fold 3 trains on the first 60% and tests on the next 20%. Each fold simulates a realistic deployment scenario where we use all available history to predict the next time period. The training set grows with each fold (expanding window approach), while the test set advances chronologically through time.

Let's implement both approaches and understand when to use each:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold, TimeSeriesSplit, train_test_split
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

print("=== CROSS-VALIDATION FOR ROBUST EVALUATION ===\n")

# Prepare features and target
X = df[['temp', 'humidity', 'windspeed', 'workingday']]
y = df['count']

print(f"Dataset: {len(df):,} hourly observations")
print(f"Time range: {df['datetime'].min()} to {df['datetime'].max()}\n")

# BASELINE: Single Train-Test Split
print("--- BASELINE: Single Train-Test Split ---")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
single_split_r2 = model.score(X_test, y_test)

print(f"Single split R²: {single_split_r2:.4f}")
print("Question: Is this representative of true model performance?")
print("Question: Would a different random split give similar results?")
print("Answer: We need multiple evaluations to know!\n")

# METHOD 1: Standard K-Fold Cross-Validation (WRONG for time series)
print("--- METHOD 1: Standard K-Fold Cross-Validation ---")
print("⚠ Warning: This approach is INCORRECT for time series data!\n")

kfold = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores_kfold = cross_val_score(LinearRegression(), X, y, cv=kfold, scoring='r2')

print("K-Fold Results (5 folds):")
for i, score in enumerate(cv_scores_kfold, 1):
    print(f"  Fold {i}: R² = {score:.4f}")

kfold_mean = cv_scores_kfold.mean()
kfold_std = cv_scores_kfold.std()

print(f"\nMean R²: {kfold_mean:.4f} (±{kfold_std:.4f})")
print(f"\nProblems with K-Fold for time series:")
print("  ✗ Random folds violate temporal order")
print("  ✗ Training set may contain 'future' relative to test set")
print("  ✗ Temporal patterns leak from future to past")
print("  ✗ Performance estimate is optimistically biased")
print("  ✗ Does NOT simulate real deployment scenario\n")

# METHOD 2: TimeSeriesSplit (CORRECT for time series)
print("--- METHOD 2: TimeSeriesSplit Cross-Validation ---")
print("✓ This approach is CORRECT for time series data!\n")

tscv = TimeSeriesSplit(n_splits=5)
cv_scores_ts = cross_val_score(LinearRegression(), X, y, cv=tscv, scoring='r2')

print("TimeSeriesSplit Results (5 folds):")
fold_details = []
for i, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
    train_dates = df.iloc[train_idx]['datetime']
    test_dates = df.iloc[test_idx]['datetime']
    print(f"  Fold {i}: R² = {cv_scores_ts[i-1]:.4f}")
    print(f"    Training: {train_dates.min()} to {train_dates.max()} ({len(train_idx):,} obs)")
    print(f"    Testing:  {test_dates.min()} to {test_dates.max()} ({len(test_idx):,} obs)")
    fold_details.append((train_dates.min(), train_dates.max(), test_dates.min(), test_dates.max()))

ts_mean = cv_scores_ts.mean()
ts_std = cv_scores_ts.std()

print(f"\nMean R²: {ts_mean:.4f} (±{ts_std:.4f})")
print(f"\nBenefits of TimeSeriesSplit:")
print("  ✓ Preserves chronological order in every fold")
print("  ✓ Always predicts future from past (realistic)")
print("  ✓ Training set grows with each fold (expanding window)")
print("  ✓ No temporal data leakage")
print("  ✓ Simulates actual deployment scenario")

# COMPARISON
print("\n--- COMPARISON: Single Split vs Cross-Validation ---")
print(f"Single split R²:        {single_split_r2:.4f}")
print(f"K-Fold CV mean R²:      {kfold_mean:.4f} (±{kfold_std:.4f})")
print(f"TimeSeriesSplit mean R²: {ts_mean:.4f} (±{ts_std:.4f})")

difference = abs(single_split_r2 - ts_mean)
print(f"\nDifference (single vs TS-CV): {difference:.4f}")

if difference > 0.02:
    print("⚠ Significant difference - single split may not be representative")
else:
    print("✓ Results similar, but CV provides confidence through multiple evaluations")

print(f"\nStandard deviation interpretation:")
print(f"  K-Fold std:      {kfold_std:.4f}")
print(f"  TimeSeriesSplit std: {ts_std:.4f}")
if ts_std < 0.05:
    print("  ✓ Low std = Stable, consistent performance across time periods")
else:
    print("  ⚠ Higher std = Performance varies across time periods")

# Visualize fold structure
print("\n--- Visualizing TimeSeriesSplit Structure ---")
fig, ax = plt.subplots(figsize=(14, 6))

for i, (train_dates_min, train_dates_max, test_dates_min, test_dates_max) in enumerate(fold_details, 1):
    # Training period
    ax.barh(i, (train_dates_max - train_dates_min).days, left=train_dates_min.toordinal(),
            height=0.8, color='skyblue', label='Training' if i == 1 else '')
    # Testing period
    ax.barh(i, (test_dates_max - test_dates_min).days, left=test_dates_min.toordinal(),
            height=0.8, color='coral', label='Testing' if i == 1 else '')

ax.set_yticks(range(1, 6))
ax.set_yticklabels([f'Fold {i}' for i in range(1, 6)])
ax.set_xlabel('Timeline', fontsize=12)
ax.set_title('TimeSeriesSplit: Expanding Window Approach', fontsize=14, fontweight='bold')
ax.legend(loc='upper right')
ax.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n=== KEY INSIGHTS FOR CAPITAL CITY BIKES ===")
print(f"✓ Cross-validation provides {len(cv_scores_ts)} performance estimates instead of 1")
print(f"✓ Mean R² of {ts_mean:.4f} is more reliable than single split")
print(f"✓ Standard deviation of {ts_std:.4f} shows performance consistency")
print(f"✓ TimeSeriesSplit respects temporal order (essential for forecasting)")
print(f"✓ Each fold tests realistic deployment: predict future from past")
print(f"\nRecommendation: Report CV mean ± std as model performance estimate")
print(f"Capital City Bikes can confidently report: R² = {ts_mean:.2f} ± {ts_std:.2f}")

**What this demonstrates:**
- **Single train-test split gives one number** - useful but limited insight into model reliability
- Standard k-fold produces 5 different R² scores through systematic rotation of test sets
- **Mean across folds** (e.g., 0.3450) provides robust performance estimate less sensitive to split luck
- Standard deviation (e.g., ±0.0120) reveals consistency - low std means stable predictions
- **K-fold violates temporal order** for time series, potentially inflating performance estimates
- TimeSeriesSplit creates 5 chronological folds, each respecting time order
- **Expanding window approach**: Fold 1 trains on 16%, tests on 17%; Fold 5 trains on 67%, tests on 17%
- Each TimeSeriesSplit fold simulates real deployment - using all history to predict next period
- **Performance comparison**: If single split differs significantly from CV mean, single split might be lucky/unlucky
- Low TimeSeriesSplit std (like 0.01-0.02) indicates model performs consistently across different time periods
- **Business value**: CV mean ± std gives Capital City Bikes honest, robust performance estimate for stakeholder reporting

**When to use cross-validation**: Use CV whenever you need robust performance estimates, especially for model comparison, hyperparameter tuning, or reporting to stakeholders. For quick iterative development, single splits suffice. For final evaluation and deployment decisions, cross-validation provides the confidence boost worth the extra computation time.

**Choosing K (number of folds)**: Common choices are K=5 (faster, slightly higher variance) or K=10 (slower, slightly lower variance). For time series, K depends on dataset size and desired test set size - ensure each fold's test set represents a meaningful time period (at least several weeks for bike-sharing demand).

### 3.7. Performance Metrics Deep Dive

Now that we have robust evaluation through cross-validation, let's explore the performance metrics themselves in depth. We've already seen R² and RMSE in action, but what do these numbers really mean? How should we interpret them for business decisions? And what other metrics can help us understand model quality? This subsection provides the comprehensive metric knowledge you need to evaluate models professionally.

Performance evaluation quantifies **how well a linear regression model performs** on both training and test data. Appropriate evaluation metrics provide insights into model accuracy, reliability, and suitability for the intended application. The choice of evaluation metrics depends on the specific problem context and business requirements, with each metric revealing different aspects of model behavior.

**Root Mean Square Error (RMSE)** measures the average magnitude of prediction errors in the same units as the target variable. RMSE is calculated as the square root of the mean of squared differences between actual and predicted values. Lower RMSE values indicate better model performance, with perfect predictions yielding an RMSE of zero. RMSE penalizes large errors more heavily than small errors due to the squaring operation, making it particularly sensitive to outliers.

**R-squared (coefficient of determination)** measures the proportion of variance in the target variable that is explained by the model. R-squared ranges from 0 to 1, where 1 indicates perfect prediction and 0 indicates no predictive capability. R-squared provides an intuitive measure of model quality that is easy to communicate to stakeholders - "the model explains 35% of demand variation" is immediately understandable to non-technical audiences.

**Mean Absolute Error (MAE)** measures the average absolute difference between actual and predicted values. MAE is less sensitive to outliers than RMSE and provides a more robust measure of typical prediction accuracy. Like RMSE, MAE is expressed in the same units as the target variable, making it easy to interpret in business contexts. When RMSE is significantly higher than MAE, this indicates the presence of large prediction errors that might require investigation.

**Residual analysis** examines the patterns in prediction errors to identify potential model improvements or violations of linear regression assumptions. Residual plots help identify non-linear relationships, heteroscedasticity (non-constant variance), and other issues that may affect model performance. Ideally, residuals should be randomly scattered around zero with no obvious patterns.

Let's explore these metrics in depth with comprehensive evaluation:

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

print("=== COMPREHENSIVE PERFORMANCE EVALUATION ===\n")

# Prepare data and train model using chronological split for time series
X = df[['temp', 'humidity', 'windspeed', 'workingday', 'season']]
y = df['count']

# Split chronologically: first 80% for training, last 20% for testing
split_index = int(len(df) * 0.8)
X_train = X.iloc[:split_index]
X_test = X.iloc[split_index:]
y_train = y.iloc[:split_index]
y_test = y.iloc[split_index:]

model = LinearRegression()
model.fit(X_train, y_train)

train_pred = model.predict(X_train)
test_pred = model.predict(X_test)

# METRIC 1: Root Mean Squared Error (RMSE)
print("--- METRIC 1: Root Mean Squared Error (RMSE) ---")
train_rmse = np.sqrt(mean_squared_error(y_train, train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, test_pred))
print(f"Training RMSE: {train_rmse:.2f} rides")
print(f"Testing RMSE:  {test_rmse:.2f} rides")
print(f"\nInterpretation: On average, predictions deviate by ~{test_rmse:.0f} rides from actual demand")
print(f"Business Impact: Plan for ±{test_rmse:.0f} rides capacity buffer\n")

# METRIC 2: R-squared (R²)
print("--- METRIC 2: R-squared (Coefficient of Determination) ---")
train_r2 = r2_score(y_train, train_pred)
test_r2 = r2_score(y_test, test_pred)
print(f"Training R²: {train_r2:.4f} ({train_r2*100:.2f}% variance explained)")
print(f"Testing R²:  {test_r2:.4f} ({test_r2*100:.2f}% variance explained)")
print(f"\nInterpretation: Model explains {test_r2*100:.1f}% of demand variation")
unexplained = (1 - test_r2) * 100
print(f"Business Impact: {unexplained:.1f}% of variation driven by factors not in model\n")

# METRIC 3: Mean Absolute Error (MAE)
print("--- METRIC 3: Mean Absolute Error (MAE) ---")
train_mae = mean_absolute_error(y_train, train_pred)
test_mae = mean_absolute_error(y_test, test_pred)
print(f"Training MAE: {train_mae:.2f} rides")
print(f"Testing MAE:  {test_mae:.2f} rides")
print(f"\nInterpretation: Typical prediction error is {test_mae:.0f} rides (less sensitive to outliers than RMSE)")
print(f"Business Impact: Budget for ~{test_mae:.0f} rides average forecast error")

# METRIC 4: Comparing RMSE vs MAE
print("\n--- METRIC 4: RMSE vs MAE Comparison ---")
rmse_mae_ratio = test_rmse / test_mae
print(f"RMSE: {test_rmse:.2f} rides")
print(f"MAE:  {test_mae:.2f} rides")
print(f"RMSE/MAE ratio: {rmse_mae_ratio:.2f}")
if rmse_mae_ratio > 1.5:
    print("⚠ High ratio suggests presence of large outlier errors")
else:
    print("✓ Ratio indicates errors are fairly consistent (few large outliers)")
print()

# METRIC 5: Residual Analysis
print("--- METRIC 5: Residual Analysis ---")
test_residuals = y_test - test_pred
residual_mean = test_residuals.mean()
residual_std = test_residuals.std()
print(f"Mean residual: {residual_mean:.2f} (should be near 0)")
print(f"Std residual:  {residual_std:.2f}")

# Visualize residuals
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Residual plot
axes[0].scatter(test_pred, test_residuals, alpha=0.3, s=10)
axes[0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0].set_xlabel('Predicted Demand', fontsize=11)
axes[0].set_ylabel('Residuals (Actual - Predicted)', fontsize=11)
axes[0].set_title('Residual Plot: Checking for Patterns', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Residual distribution
axes[1].hist(test_residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Residuals (Actual - Predicted)', fontsize=11)
axes[1].set_ylabel('Frequency', fontsize=11)
axes[1].set_title('Residual Distribution: Checking Normality', fontsize=12, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# PERFORMANCE SUMMARY FOR STAKEHOLDERS
print("\n=== PERFORMANCE SUMMARY FOR CAPITAL CITY BIKES ===")
print(f"Model Accuracy:     {test_r2*100:.1f}% of demand variation explained")
print(f"Typical Error:      ±{test_mae:.0f} rides per hour (MAE)")
print(f"Maximum Error:      ±{test_rmse:.0f} rides per hour (RMSE)")
print(f"Model Stability:    Consistent performance across time periods (CV)")
print(f"Residual Behavior:  {'Acceptable' if abs(residual_mean) < 5 else 'Check for bias'}")
print(f"\n✓ Model meets minimum performance requirements for operational deployment")
print(f"✓ Recommended use: Daily demand planning with ±{test_rmse:.0f} ride buffer")
print(f"⚠ Note: {unexplained:.0f}% variance unexplained - consider additional features for improvement")

**What this demonstrates:**
- **Multiple performance metrics provide complementary insights** - RMSE, R², MAE, and residual analysis each reveal different aspects of model quality
- **RMSE of 147 rides** means predictions typically deviate by ±147 rides from actual demand - operationally significant for capacity planning
- **R² of 34%** indicates moderate predictive power - the model captures major patterns but misses substantial variation (66% unexplained)
- **MAE of 113 rides** (lower than RMSE) indicates typical errors are moderate, but some large errors inflate RMSE
- **RMSE/MAE ratio of 1.30** suggests error distribution is fairly consistent without extreme outliers (ratio < 1.5 is good)
- **Residual analysis** checks assumptions: mean near zero (0.49) indicates no systematic bias, though some scatter suggests room for improvement
- The residual plot reveals some patterns (not completely random), suggesting potential non-linear relationships or missing features
- **Residual distribution** should be roughly normal and centered at zero - deviations indicate model assumptions may be violated
- **Business translation**: Model provides useful demand estimates for operational planning, but daily operations need ±147 ride capacity buffers
- The 66% unexplained variance points to **opportunities for model enhancement** through additional features or non-linear methods
- **For more robust performance estimates**, apply cross-validation (covered in subsection 3.6) to get confidence intervals around these metrics

---

## Summary and Transition to Programming Implementation

You've mastered essential linear regression foundations: **mathematical principles, scikit-learn implementation, proper evaluation workflows, and performance metrics**. These skills transform transportation data into predictive models that generate reliable demand forecasts and actionable business insights.

Your ability to create train-test splits, apply time series considerations, use cross-validation for robust evaluation, and interpret performance metrics prepares you to build production-ready models that stakeholders can trust and understand for operational decision-making.

In the programming example, you'll implement these linear regression concepts through hands-on coding exercises, building complete prediction workflows that forecast bike-sharing demand and communicate model performance to Capital City Bikes stakeholders.