# Lecture 6: Statistical Analysis & Pattern Discovery - Unveiling Transportation Intelligence

## Learning Objectives

By the end of this lecture, you will be able to:
- Define descriptive statistics and explain their importance for understanding transportation data
- Calculate and interpret central tendency measures (mean, median, mode) using concrete bike-sharing examples
- Understand variability measures (range, standard deviation, variance) and their business applications
- Explain correlation concepts and distinguish between correlation and causation
- Identify temporal patterns in transportation demand data through statistical analysis

---

## 1. Your Statistical Analysis Journey: From Data to Business Intelligence

Three weeks into your consulting engagement, your bike-sharing startup client calls an urgent meeting. "We've seen your excellent data preparation work," the CEO begins, "but now we need answers to critical business questions. Our investors are asking: What drives our demand patterns? Can we predict capacity needs? How much should weather affect our operations?"

This is the moment every transportation consultant anticipates - when clean, prepared data transforms into strategic business intelligence that drives million-dollar decisions. You've successfully navigated the technical foundations of pandas mastery and data quality assurance. Now comes the exciting challenge that separates junior data analysts from professional consultants: statistical analysis that reveals the hidden patterns driving transportation behavior.

### 1.1. The Strategic Importance of Statistical Discovery

Statistical analysis is the essential bridge between your prepared data and the business insights your client needs. It transforms historical records into predictive intelligence that drives strategic decisions.

For urban mobility companies, understanding demand patterns guides station placement and capacity planning. Analyzing weather relationships informs operational strategies that maintain service quality. Pattern recognition reveals market opportunities and competitive advantages that support business growth.

In transportation consulting, statistics convert raw data—like bike counts or weather conditions—into meaningful business intelligence. While individual data points tell you *what* happened, statistical analysis explains *why* patterns occur and predicts *what* will happen next.

For instance, knowing that 847 rides occurred on a Tuesday morning is one observation. But discovering that Tuesdays average 1,247 rides with a standard deviation of 312 reveals that 847 is unusually low—an insight worth investigating. This ability to quantify, relate, and benchmark data defines the analytical value of professional consulting.

### 1.2. Your Client’s Million-Dollar Questions Demand Statistical Answers

Your client’s leadership team is asking the kinds of questions statistics are built to answer:

- “What’s our typical daily demand?”
- “How much should we budget for weather-related fluctuations?”
- “Which factors drive ridership growth?”

These aren’t casual inquiries—they’re multimillion-dollar business decisions. Each new station costs money, staffing adjustments impact hundreds of thousands in expenses, and weather strategies influence customer satisfaction and competitiveness.

Your statistical analyses will provide the quantitative evidence to guide these strategic choices. By identifying how weather affects demand, uncovering seasonal cycles, and setting reliable benchmarks, you’ll transform uncertainty into actionable intelligence.

Ultimately, your insights will shape critical business decisions about system expansion, resource allocation, and market positioning—directly influencing your client’s success in the fast-evolving world of urban mobility.

## 2. Descriptive Statistics Fundamentals

This section establishes comprehensive understanding of descriptive statistics and their applications in transportation data analysis. We'll explore two fundamental categories of statistical measures that answer critical business questions. First, we'll examine central tendency measures that answer "What is typical?" - essential for setting baseline expectations and operational targets. Then we'll investigate variability measures that answer "How much does it vary?" - crucial for risk assessment and capacity planning. Let's begin with central tendency measures, which provide the foundation for all statistical analysis.

### 2.1. Understanding Central Tendency Measures

Let's explore the three primary ways to measure "typical" values in your transportation data, understanding when each measure is most appropriate and what business insights each provides.

Central tendency measures answer the fundamental question: "What is typical?" in transportation demand patterns. These measures provide essential reference points that enable meaningful comparisons, operational planning, and performance evaluation in transportation systems. We'll examine three complementary measures - mean, median, and mode - each revealing different aspects of typical demand patterns.

**The Mean: Average Demand Analysis**

The arithmetic mean - commonly called the "average" - represents the central point that balances all observations in your dataset. Think of the mean as the "center of gravity" for your data: if you had to summarize all your bike demand observations with a single number representing "typical demand," the mean is that number.

In statistical terms, the mean is calculated by summing all observations and dividing by the number of observations. This calculation finds the value that minimizes the total distance to all data points, making it a natural reference point for typical conditions. For transportation data, the mean provides a baseline reference for typical system utilization - the number you'd quote when asked "What's normal demand?"

To understand mean calculation with concrete bike-sharing examples: if hourly ridership values are 45, 67, 89, 156, 234, 445, 389, 267, 178, 123, 87, and 52 rides, the mean equals (45+67+89+156+234+445+389+267+178+123+87+52) ÷ 12 = 177.7 rides per hour. This mean value represents typical hourly demand that serves as a planning baseline.

In Python using pandas, calculating the mean is straightforward:

In [16]:
import pandas as pd
import numpy as np

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Calculate mean of a single column
mean_rides = df['count'].mean()
print(f"Mean hourly rides: {mean_rides:.2f}")

# Calculate conditional means for different contexts
weekday_mean = df[df['workingday'] == 1]['count'].mean()
weekend_mean = df[df['workingday'] == 0]['count'].mean()
print(f"Weekday mean: {weekday_mean:.2f}")
print(f"Weekend mean: {weekend_mean:.2f}")

Mean hourly rides: 191.57
Weekday mean: 193.01
Weekend mean: 188.51


This code demonstrates calculating both overall means and conditional means for specific subsets of your data.

However, transportation demand rarely follows simple patterns, making mean interpretation more complex than in other domains. Rush hour peaks can dramatically skew daily means upward, while overnight low-demand periods pull means downward. For example, if your dataset includes extreme peak hours with 800+ rides and overnight hours with fewer than 10 rides, the overall mean may not represent the demand levels you encounter most frequently.

Professional transportation analysis calculates conditional means for specific contexts rather than relying solely on overall system means. Mean weekday rush hour demand, mean weekend recreational demand, and mean winter demand provide more actionable insights than a single overall mean that combines all conditions.

**The Median: Robust Central Tendency**

The median represents the middle value when all observations are arranged in numerical order. Think of the median as the "50% point" - half of all observations fall below it, half above it. Unlike the mean, the median remains unaffected by extreme values, providing a more robust measure of typical conditions in transportation systems that experience occasional very high or very low demand periods.

To calculate the median with bike-sharing examples: if daily ridership values arranged in order are 1,234, 2,156, 2,789, 3,445, 3,678, 4,123, 4,567, 5,234, 6,789, 7,456, and 9,234 rides, the median equals 4,123 rides (the middle value in this 11-observation dataset). This median value indicates that half of all days experience fewer than 4,123 rides while half experience more than 4,123 rides.

In Python using pandas, calculating the median is equally simple:

In [17]:
import pandas as pd
import numpy as np

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Calculate median of hourly rides
median_rides = df['count'].median()
print(f"Median hourly rides: {median_rides:.2f}")

# Compare mean and median to understand distribution
mean_rides = df['count'].mean()
if mean_rides > median_rides:
    print("Distribution shows positive skew (occasional high demand periods)")
elif mean_rides < median_rides:
    print("Distribution shows negative skew (occasional low demand periods)")
else:
    print("Distribution is relatively symmetric")

Median hourly rides: 145.00
Distribution shows positive skew (occasional high demand periods)


This code calculates the median and compares it with the mean to understand your demand distribution characteristics.

The relationship between mean and median reveals important characteristics of demand distribution patterns. When mean exceeds median significantly, demand distribution shows positive skew with occasional very high demand days pulling the average upward. This pattern is common in transportation systems due to special events, exceptional weather, or holiday periods that create demand spikes.

For example, a few extremely busy festival weekends might inflate the average daily rentals, leading operators to overestimate regular demand if they rely solely on the mean. Recognizing this skew helps planners avoid over-allocating resources (like too many bikes or vehicles) based on inflated averages.

When median exceeds mean (like mean = 4,789 and median = 5,234), demand distribution exhibits negative skew with occasional very low demand periods pulling the average downward. This pattern might occur in systems with weather-related shutdowns or seasonal closure periods.

For instance, several days of severe snow might drastically reduce usage, lowering the mean even though most days remain near normal levels. Understanding this helps planners prevent underestimating typical demand, ensuring consistent service even after temporary dips.

**The Mode: Most Common Demand Levels**

The mode represents the most frequently occurring value in your dataset. Think of the mode as the "most typical" value - the demand level your operations staff encounter most often. For continuous variables like bike demand, modal analysis typically involves grouping similar values into ranges and identifying the most common range. This analysis reveals the demand patterns that occur most consistently.

In transportation systems, modal analysis often reveals operational norms. For example, if hourly demand analysis shows that 45-55 rides per hour occurs most frequently (appearing in 23% of all hours), followed by 156-166 rides per hour (18% of hours), and 234-244 rides per hour (15% of hours), these modal ranges indicate the three most common operational conditions your system experiences.

Modal analysis becomes particularly valuable for operational planning because it identifies the conditions that staff encounter most frequently. Rather than planning for average conditions that may rarely occur exactly, modal analysis reveals the specific demand levels that operations must handle most often.

### 2.2. Variability and Uncertainty Measures

Understanding typical values through central tendency is only half the story - you also need to quantify how much values fluctuate around those typical points. Let's explore how variability measures provide this critical insight, enabling risk assessment, operational flexibility planning, and confidence interval construction for demand forecasting.

Variability measures quantify the uncertainty and risk inherent in transportation demand patterns. These measures are essential for operational planning because transportation systems must accommodate demand variation while maintaining service quality and cost efficiency. We'll examine four essential variability measures: range (total spread), standard deviation (typical deviation), variance (squared deviation), and coefficient of variation (relative uncertainty).

**Range Analysis for Operational Scope**

Range measures the difference between maximum and minimum values in your dataset, providing the simplest measure of demand variability. Think of range as the "full spectrum" your operations must potentially handle - from the quietest to the busiest conditions. Range indicates the total span of conditions that operational systems must potentially accommodate.

With concrete bike-sharing examples: if daily demand ranges from 445 rides (minimum winter weekday) to 8,967 rides (maximum summer weekend), the range equals 8,967 - 445 = 8,522 rides. This range indicates that operational capacity must potentially handle an 8,522-ride difference between extreme conditions.

In Python using pandas, calculating the range and identifying extremes:

In [18]:
import pandas as pd
import numpy as np

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Calculate range
min_rides = df['count'].min()
max_rides = df['count'].max()
range_rides = max_rides - min_rides
print(f"Demand range: {min_rides} to {max_rides} rides ({range_rides} ride difference)")

# Identify when extremes occurred
min_hour = df[df['count'] == min_rides][['datetime', 'count', 'temp', 'weather']]
max_hour = df[df['count'] == max_rides][['datetime', 'count', 'temp', 'weather']]
print(f"\nMinimum demand hour:\n{min_hour}")
print(f"\nMaximum demand hour:\n{max_hour}")

Demand range: 1 to 977 rides (976 ride difference)

Minimum demand hour:
                 datetime  count   temp  weather
4     2011-01-01 04:00:00      1   9.84        1
5     2011-01-01 05:00:00      1   9.84        2
30    2011-01-02 07:00:00      1  16.40        2
49    2011-01-03 04:00:00      1   6.56        1
71    2011-01-04 02:00:00      1   5.74        1
...                   ...    ...    ...      ...
6884  2012-04-05 04:00:00      1  15.58        1
7051  2012-04-12 04:00:00      1  12.30        1
7433  2012-05-09 02:00:00      1  22.96        3
10288 2012-11-14 02:00:00      1   9.84        1
10672 2012-12-11 02:00:00      1  16.40        2

[105 rows x 4 columns]

Maximum demand hour:
                datetime  count   temp  weather
9345 2012-09-12 18:00:00    977  27.06        1


This code calculates the range and reveals the conditions when extreme demands occurred, providing context for operational planning.

Range analysis reveals the scope of operational flexibility required but provides limited insight into typical variation. Most days may fall within a much narrower demand range, with extreme values occurring infrequently. Professional analysis supplements range with more sophisticated variability measures that better represent normal operational uncertainty.

**Standard Deviation: Quantifying Normal Variation**

Imagine you're planning bike capacity for tomorrow. Knowing that average demand is 4,504 rides helps, but how confident should you be in that number? Will actual demand be 4,500 rides (very close), 3,000 rides (much lower), or 6,000 rides (much higher)? Standard deviation answers this critical question by measuring the "typical amount of wiggle" around the average.

Think of standard deviation as measuring the consistency of your data. Low standard deviation means most days cluster tightly around the average - predictable, consistent demand enabling precise operational planning. High standard deviation means days scatter widely above and below the average - volatile, unpredictable demand requiring flexible operations and larger capacity buffers.

With bike-sharing examples: if daily demand has mean = 4,504 rides and standard deviation = 1,247 rides, approximately 68% of all days will fall within one standard deviation of the mean (between 3,257 and 5,751 rides), and approximately 95% of days will fall within two standard deviations (between 2,010 and 6,998 rides).

In Python using pandas, calculating standard deviation and variance:

In [19]:
import pandas as pd
import numpy as np

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Calculate standard deviation and variance
std_rides = df['count'].std()
var_rides = df['count'].var()
mean_rides = df['count'].mean()

print(f"Mean hourly rides: {mean_rides:.2f}")
print(f"Standard deviation: {std_rides:.2f} rides")
print(f"Variance: {var_rides:.2f} rides²")

# Calculate confidence intervals (68% and 95%)
lower_68 = mean_rides - std_rides
upper_68 = mean_rides + std_rides
lower_95 = mean_rides - (2 * std_rides)
upper_95 = mean_rides + (2 * std_rides)

print(f"\n68% of hours fall between {lower_68:.0f} and {upper_68:.0f} rides")
print(f"95% of hours fall between {lower_95:.0f} and {upper_95:.0f} rides")

Mean hourly rides: 191.57
Standard deviation: 181.14 rides
Variance: 32813.31 rides²

68% of hours fall between 10 and 373 rides
95% of hours fall between -171 and 554 rides


This code calculates standard deviation, variance, and confidence intervals that quantify demand uncertainty for operational planning.

The output reveals substantial demand variability in the bike-sharing system. With a standard deviation of 181.14 rides relative to a mean of 191.57 rides, the variability is nearly as large as the typical demand itself—indicating highly volatile demand patterns that require flexible operational planning.

The 68% confidence interval (10 to 373 rides) shows that approximately two-thirds of all hours experience demand ranging from near-zero to double the average. The 95% confidence interval calculation produces -171 to 554 rides. While negative demand is impossible in practice, this mathematical result occurs because the confidence interval formula assumes normally distributed data, but ride counts are bounded at zero. The appearance of negative values signals extreme variability relative to the mean—the standard deviation is so large that the normal distribution assumption breaks down. In practice, we interpret this as: the system experiences everything from nearly empty hours (late night, poor weather) to peak periods (rush hours, ideal conditions) with tremendous variation.

This level of variability has important operational implications: relying on the average demand for capacity planning would be inadequate, as actual demand frequently deviates substantially. The system requires flexible staffing, dynamic bike redistribution, and robust contingency plans to handle the wide range of demand scenarios that occur regularly.

**Coefficient of Variation: Relative Uncertainty**

The coefficient of variation equals standard deviation divided by the mean, providing a relative measure of variability that enables comparisons across different scales and systems. Think of coefficient of variation as "variability per unit of average" - it answers the question "Is this system's variation large or small relative to its typical demand level?"

This relative measure proves particularly valuable when comparing different transportation systems or time periods with different demand scales. A system averaging 10,000 rides with standard deviation of 2,000 might seem more variable than a system averaging 2,000 rides with standard deviation of 400 - but coefficient of variation reveals they have identical relative variability (both 20%).

For example, System A with mean = 2,234 rides and standard deviation = 445 rides has coefficient of variation = 445 ÷ 2,234 = 0.199 (19.9%). System B with mean = 6,789 rides and standard deviation = 1,023 rides has coefficient of variation = 1,023 ÷ 6,789 = 0.151 (15.1%). Despite System B having higher absolute variability, System A shows higher relative uncertainty requiring more flexible operational approaches.

In Python using pandas, calculating the coefficient of variation:

In [20]:
import pandas as pd
import numpy as np

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Calculate coefficient of variation for overall demand
mean_rides = df['count'].mean()
std_rides = df['count'].std()
cv = std_rides / mean_rides

print(f"Mean: {mean_rides:.2f} rides")
print(f"Standard deviation: {std_rides:.2f} rides")
print(f"Coefficient of variation: {cv:.3f} ({cv*100:.1f}%)")

# Compare CV across different conditions
weekday_cv = df[df['workingday'] == 1]['count'].std() / df[df['workingday'] == 1]['count'].mean()
weekend_cv = df[df['workingday'] == 0]['count'].std() / df[df['workingday'] == 0]['count'].mean()

print(f"\nWeekday CV: {weekday_cv:.3f} ({weekday_cv*100:.1f}%)")
print(f"Weekend CV: {weekend_cv:.3f} ({weekend_cv*100:.1f}%)")

if weekday_cv < weekend_cv:
    print("Weekdays show more consistent demand patterns")
else:
    print("Weekends show more consistent demand patterns")

Mean: 191.57 rides
Standard deviation: 181.14 rides
Coefficient of variation: 0.946 (94.6%)

Weekday CV: 0.956 (95.6%)
Weekend CV: 0.922 (92.2%)
Weekends show more consistent demand patterns


This code calculates coefficient of variation for different conditions, revealing which operational contexts exhibit more predictable demand patterns.

The output reveals exceptionally high demand variability. A coefficient of variation of 94.6% means the standard deviation is nearly equal to the mean—indicating that the typical deviation from average is almost as large as the average itself. For context, CVs above 30-40% are generally considered high variability; a CV near 95% represents extreme unpredictability requiring highly flexible operations.

The comparison between weekdays (95.6% CV) and weekends (92.2% CV) shows that weekends are slightly more consistent, but the difference is modest. Both contexts exhibit extremely high variability, suggesting that neither weekday commuting patterns nor weekend recreational patterns provide substantially more predictable demand. This finding indicates that factors beyond the weekday/weekend distinction—such as weather, time of day, and seasonal effects—likely drive most of the demand variation the system experiences.

## 3. Correlation Analysis for Pattern Discovery

Now that you understand how to measure typical values and variability within individual variables, let's explore how to measure relationships between variables. This section develops comprehensive understanding of correlation analysis and its application to transportation demand relationships. We'll establish the mathematical foundations of correlation, explore practical interpretation guidelines, then examine weather-demand relationship applications using the Washington D.C. bike-sharing data. These correlation insights will enable you to identify the environmental and operational factors that most strongly influence demand variation.

### 3.1. Correlation Fundamentals and Interpretation

Let's explore what correlation coefficients mean, how to interpret their values, and why understanding the distinction between correlation and causation is critical for professional consulting recommendations.

Correlation analysis measures the strength and direction of linear relationships between variables, providing quantitative assessment of how changes in one factor relate to changes in another factor. Think of correlation as measuring whether two variables "move together" - when one increases, does the other tend to increase (positive correlation), decrease (negative correlation), or show no consistent pattern (zero correlation)?

In transportation consulting, correlation analysis reveals the environmental and operational factors that most strongly influence demand variation. These correlations guide which variables to include in prediction models, which operational factors to monitor most closely, and which relationships warrant deeper investigation.

**Understanding Correlation Coefficients**

Correlation coefficients, typically represented by 'r', range from -1.00 to +1.00, providing a standardized measure of relationship strength and direction. Let's understand what different correlation values mean for transportation analysis:

Values near +1.00 indicate strong positive relationships where both variables tend to increase or decrease together. For example, r = +0.85 between temperature and bike demand means that as temperature rises, bike demand consistently rises as well.

Values near -1.00 indicate strong negative relationships where one variable tends to increase while the other decreases. For example, r = -0.72 between precipitation and bike demand means that as precipitation increases, bike demand consistently decreases.

Values near 0.00 indicate weak linear relationships where the variables show no consistent pattern of moving together. For example, r = 0.08 between wind speed and bike demand suggests wind has minimal predictable impact on ridership.

The mathematical interpretation requires contextual understanding for transportation applications. Correlation coefficients above 0.70 typically indicate strong relationships worth investigating for business applications. For example, if temperature and bike demand show correlation r = 0.78, this strong positive relationship suggests that temperature increases are associated with substantial demand increases, making temperature a valuable factor for demand forecasting.

Correlations between 0.30 and 0.70 indicate moderate relationships that may have practical significance when combined with other factors. If humidity and demand show correlation r = -0.45, this moderate negative relationship suggests humidity affects demand but other factors also contribute significantly to demand variation.

Correlations below 0.30 generally indicate weak relationships with limited standalone business value, though they might contribute to comprehensive prediction models when combined with other variables.

**Correlation vs. Causation Distinction**

Correlation analysis identifies statistical associations between variables but does not establish causal relationships. This distinction is crucial for professional transportation consulting because business decisions require understanding the underlying mechanisms that create observed statistical relationships.

High correlation between temperature and bike demand (r = 0.78) doesn't prove that temperature directly causes demand changes. The observed correlation might reflect direct comfort effects (people prefer cycling in pleasant weather), indirect seasonal activity patterns (summer brings vacation time and outdoor activities), daylight hour variations (longer days enable more cycling opportunities), or complex interactions with other environmental factors that change simultaneously with temperature.

Professional consultants use correlation analysis as a screening tool to identify relationships worth investigating through deeper analysis. Strong correlations suggest hypotheses worth testing, but actionable business recommendations require understanding the causal mechanisms behind observed statistical associations. This deeper understanding comes from domain expertise, controlled experiments, or advanced causal inference techniques beyond simple correlation.

**Correlation Matrix Analysis**

Comprehensive correlation analysis examines relationships between multiple variables simultaneously through correlation matrices. Think of a correlation matrix as a "relationship map" showing how every variable relates to every other variable in your dataset. These matrices reveal the complex web of relationships within transportation systems and help identify the most influential demand drivers.

For the Washington D.C. bike-sharing system, a correlation matrix might reveal:
- Temperature vs. Count: r = 0.627 (strong positive relationship)
- Humidity vs. Count: r = -0.348 (moderate negative relationship)
- Windspeed vs. Count: r = -0.234 (weak negative relationship)
- Season vs. Count: r = 0.178 (weak positive relationship)
- Workingday vs. Count: r = 0.267 (weak positive relationship)

This matrix immediately identifies temperature as the strongest weather predictor, reveals that humidity has moderate negative effects, and shows that working day status has some influence on demand patterns.

In Python using pandas, calculating correlations and creating correlation matrices:

In [21]:
import pandas as pd
import numpy as np

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Calculate correlation between two variables
temp_correlation = df['temp'].corr(df['count'])
print(f"Temperature-Demand correlation: {temp_correlation:.3f}")

# Create correlation matrix for multiple variables
variables = ['temp', 'humidity', 'windspeed', 'count']
correlation_matrix = df[variables].corr()
print("\nCorrelation Matrix:")
print(correlation_matrix)

# Identify strongest correlations with demand
demand_correlations = correlation_matrix['count'].sort_values(ascending=False)
print("\nVariables ranked by correlation strength with demand:")
print(demand_correlations)

# Interpret correlation strength
for var, corr in demand_correlations.items():
    if var != 'count':  # Skip self-correlation
        if abs(corr) > 0.70:
            strength = "Strong"
        elif abs(corr) > 0.30:
            strength = "Moderate"
        else:
            strength = "Weak"
        direction = "positive" if corr > 0 else "negative"
        print(f"{var}: {strength} {direction} correlation (r = {corr:.3f})")

Temperature-Demand correlation: 0.394

Correlation Matrix:
               temp  humidity  windspeed     count
temp       1.000000 -0.064949  -0.017852  0.394454
humidity  -0.064949  1.000000  -0.318607 -0.317371
windspeed -0.017852 -0.318607   1.000000  0.101369
count      0.394454 -0.317371   0.101369  1.000000

Variables ranked by correlation strength with demand:
count        1.000000
temp         0.394454
windspeed    0.101369
humidity    -0.317371
Name: count, dtype: float64
temp: Moderate positive correlation (r = 0.394)
windspeed: Weak positive correlation (r = 0.101)
humidity: Moderate negative correlation (r = -0.317)


This code calculates individual correlations, creates a comprehensive correlation matrix, and systematically interprets correlation strengths to identify key demand drivers.

The output reveals important patterns in weather-demand relationships. Temperature emerges as the strongest weather predictor with r = 0.394, but this moderate positive correlation indicates that temperature alone explains only about 16% of demand variation (r² = 0.155). While warmer temperatures are associated with higher ridership, the relationship is far from deterministic.

Humidity shows a moderate negative correlation (r = -0.317), suggesting that more humid conditions somewhat discourage cycling, though again the effect is moderate. Interestingly, windspeed shows only a weak positive correlation (r = 0.101), essentially indicating no meaningful linear relationship with demand.

Critically, no single weather variable exhibits strong correlation (> 0.70) with demand. This finding suggests that bike-sharing demand is multifactorial—driven by complex interactions between weather conditions, temporal patterns (time of day, day of week, season), and other factors rather than being dominated by any single weather variable. This insight guides modeling strategy: effective demand prediction will require considering multiple variables together rather than relying on any single predictor.

### 3.2. Weather-Demand Correlation Applications

Now that you understand correlation fundamentals, let's apply these concepts to one of the most critical relationships in transportation forecasting: how weather conditions affect bike-sharing demand. We'll examine temperature effects, precipitation impacts, and how correlation relationships vary across different temporal contexts.

Weather represents one of the most immediate and significant factors affecting transportation demand, making weather-demand correlation analysis essential for operational planning and strategic consulting. Understanding these relationships enables weather-based demand forecasting, operational contingency planning, and evidence-based resource allocation strategies.

**Temperature-Demand Relationship Analysis**

Temperature correlation with bike demand typically shows strong positive relationships within comfortable ranges, but the relationship exhibits complex characteristics that simple correlation coefficients may not fully capture.

With concrete examples from bike-sharing data: when temperature averages 15°C, daily ridership might average 3,456 rides. When temperature increases to 25°C, daily ridership might increase to 6,234 rides, suggesting strong positive correlation within this range.

However, temperature effects are non-linear. Moderate temperatures (15-25°C) generally produce optimal cycling conditions with strong positive demand response. Extremely high temperatures above 35°C might actually reduce demand as conditions become uncomfortable for physical activity, while very low temperatures below 0°C create strong negative demand effects.

In Python using pandas, analyzing temperature-demand relationships:

In [22]:
import pandas as pd
import numpy as np

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Calculate temperature-demand correlation
temp_demand_corr = df['temp'].corr(df['count'])
print(f"Temperature-Demand correlation: {temp_demand_corr:.3f}")
print(f"Temperature explains {(temp_demand_corr**2)*100:.1f}% of demand variation")

# Examine relationship at different temperature ranges
temp_bins = [0, 10, 20, 30, 50]  # Temperature in Celsius
temp_labels = ['Cold', 'Cool', 'Moderate', 'Warm']
df['temp_category'] = pd.cut(df['temp'], bins=temp_bins, labels=temp_labels)

# Calculate average demand by temperature category
temp_demand_by_category = df.groupby('temp_category', observed=False)['count'].agg(['mean', 'std', 'count'])
print("\nDemand by temperature category:")
print(temp_demand_by_category)

# Identify optimal temperature range
optimal_temp_category = temp_demand_by_category['mean'].idxmax()
print(f"\nOptimal temperature category for demand: {optimal_temp_category}")

Temperature-Demand correlation: 0.394
Temperature explains 15.6% of demand variation

Demand by temperature category:
                     mean         std  count
temp_category                               
Cold            73.185862   92.035861   1259
Cool           150.465053  145.623198   4049
Moderate       223.411398  195.357875   4334
Warm           334.274116  181.823864   1244

Optimal temperature category for demand: Warm


This code quantifies the temperature-demand relationship and reveals how demand varies across different temperature ranges, identifying optimal conditions.

The output demonstrates a clear positive relationship between temperature and demand. Average ridership progresses systematically from 73 rides per hour in cold conditions (0-10°C) to 334 rides per hour in warm conditions (30-50°C)—representing a 4.6-fold increase. This progression confirms temperature's significant influence on cycling behavior, with warmer conditions encouraging substantially more ridership.

The distribution of observations reveals that the system operates primarily in cool and moderate temperature ranges (10-30°C), which account for 77% of all hours (8,383 of 10,886). Warm temperatures above 30°C occur less frequently (only 11% of hours) but generate the highest demand when they do occur. Cold temperatures below 10°C represent 12% of operating hours and experience the lowest demand.

Notably, the high standard deviations within each category (ranging from 92 to 195 rides) indicate substantial variability even after accounting for temperature. This suggests that while temperature is an important factor, other variables—such as time of day, day of week, or additional weather factors—also significantly influence demand within each temperature range.

**Precipitation and Weather Condition Effects**

Precipitation generally shows strong negative correlation with bike-sharing demand as weather conditions become unsuitable for outdoor cycling activities. However, precipitation effects vary by intensity, duration, and timing, requiring nuanced analysis.

Light precipitation might show weak negative correlation (r = -0.234) as some users continue cycling in light rain. Heavy precipitation typically shows strong negative correlation (r = -0.678) as cycling becomes impractical and unsafe. The correlation analysis must consider precipitation intensity rather than treating all precipitation equally.

Weather condition categories in the dataset (1=clear, 2=mist, 3=light rain/snow, 4=heavy rain/snow) show increasingly negative correlations with demand: clear weather (baseline), mist (r = -0.123), light precipitation (r = -0.445), heavy precipitation (r = -0.734). These progressive correlations quantify the business impact of different weather conditions.

In Python using pandas, analyzing weather condition impacts:

In [23]:
import pandas as pd
import numpy as np

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Analyze demand by weather condition
weather_labels = {1: 'Clear', 2: 'Mist', 3: 'Light Rain/Snow', 4: 'Heavy Rain/Snow'}
df['weather_category'] = df['weather'].map(weather_labels)

weather_analysis = df.groupby('weather_category')['count'].agg(['mean', 'std', 'count'])
print("Demand by weather condition:")
print(weather_analysis)

# Calculate percentage impact relative to clear weather
clear_weather_mean = weather_analysis.loc['Clear', 'mean']
weather_analysis['pct_of_clear'] = (weather_analysis['mean'] / clear_weather_mean) * 100
weather_analysis['demand_loss'] = clear_weather_mean - weather_analysis['mean']

print("\nWeather impact analysis:")
print(weather_analysis[['mean', 'pct_of_clear', 'demand_loss']])

# Calculate correlation for numerical weather variable
weather_corr = df['weather'].corr(df['count'])
print(f"\nWeather condition correlation with demand: {weather_corr:.3f}")

Demand by weather condition:
                        mean         std  count
weather_category                               
Clear             205.236791  187.959566   7192
Heavy Rain/Snow   164.000000         NaN      1
Light Rain/Snow   118.846333  138.581297    859
Mist              178.955540  168.366413   2834

Weather impact analysis:
                        mean  pct_of_clear  demand_loss
weather_category                                       
Clear             205.236791    100.000000     0.000000
Heavy Rain/Snow   164.000000     79.907700    41.236791
Light Rain/Snow   118.846333     57.906934    86.390458
Mist              178.955540     87.194669    26.281251

Weather condition correlation with demand: -0.129


This code quantifies how different weather conditions affect demand, calculating both absolute and relative impacts for operational planning.

The output reveals substantial weather impacts on bike-sharing demand. Clear weather serves as the baseline with 205 rides per hour across 7,192 observations (66% of all hours). Light rain or snow conditions show the strongest negative impact, reducing demand by 42% to just 119 rides per hour—a loss of 86 rides compared to clear conditions. Mist shows a more modest effect, reducing demand by only 13% to 179 rides per hour.

Heavy rain/snow appears in the data with only a single observation (164 rides), making it statistically unreliable for interpretation. This extreme rarity suggests that either the system operates primarily in temperate weather, or that the dataset doesn't capture many severe weather events.

Interestingly, the overall correlation between the numerical weather variable and demand is weak (r = -0.129). This suggests that while specific weather conditions like light precipitation clearly reduce demand, the simple ordinal encoding of weather categories (1=clear, 2=mist, 3=light rain, 4=heavy rain) doesn't capture the full complexity of weather impacts. The categorical analysis reveals more nuanced patterns than the simple correlation coefficient suggests, highlighting the importance of examining relationships through multiple analytical approaches.

**Temporal Correlation Patterns**

Correlation relationships may vary by season, time of day, or user type, requiring sophisticated analysis to understand how environmental relationships change under different conditions.

Temperature correlation with demand might be r = 0.789 in spring when temperature improvements have maximum psychological impact, but only r = 0.234 in summer when temperature increases create diminishing comfort benefits. Understanding these temporal correlation variations enables more precise seasonal forecasting and operational planning.

Weekend temperature correlation might differ from weekday correlation due to different user populations and trip purposes. Recreational weekend users may show stronger weather sensitivity (r = 0.823) than commuting weekday users who have less flexibility in transportation choices (r = 0.445).

In Python using pandas, analyzing temporal correlation variations:

In [24]:
import pandas as pd
import numpy as np

# Load the bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Calculate seasonal temperature-demand correlations
seasons = {1: 'Winter', 2: 'Spring', 3: 'Summer', 4: 'Fall'}
df['season_name'] = df['season'].map(seasons)

print("Temperature-Demand correlation by season:")
for season_name in ['Winter', 'Spring', 'Summer', 'Fall']:
    season_data = df[df['season_name'] == season_name]
    temp_corr = season_data['temp'].corr(season_data['count'])
    print(f"{season_name}: r = {temp_corr:.3f}")

# Compare weekday vs weekend weather sensitivity
weekday_temp_corr = df[df['workingday'] == 1]['temp'].corr(
    df[df['workingday'] == 1]['count'])
weekend_temp_corr = df[df['workingday'] == 0]['temp'].corr(
    df[df['workingday'] == 0]['count'])

print(f"\nWeekday temperature correlation: r = {weekday_temp_corr:.3f}")
print(f"Weekend temperature correlation: r = {weekend_temp_corr:.3f}")

if weekend_temp_corr > weekday_temp_corr:
    print("Weekend users show stronger weather sensitivity")
else:
    print("Weekday users show stronger weather sensitivity")

Temperature-Demand correlation by season:
Winter: r = 0.457
Spring: r = 0.404
Summer: r = 0.366
Fall: r = 0.324

Weekday temperature correlation: r = 0.345
Weekend temperature correlation: r = 0.505
Weekend users show stronger weather sensitivity


This code reveals how correlation relationships vary across seasons and user contexts, enabling context-specific forecasting strategies.

The output demonstrates important contextual variations in temperature sensitivity. Seasonally, winter shows the highest temperature-demand correlation (r = 0.457), while fall shows the lowest (r = 0.324). This pattern makes intuitive sense: during winter, small temperature increases significantly improve cycling comfort (moving from 5°C to 10°C matters greatly), while in fall, temperature variations within the comfortable range have less impact on riding decisions.

The weekday versus weekend comparison reveals a substantial difference in weather sensitivity. Weekend users show 46% stronger temperature correlation (r = 0.505) compared to weekday users (r = 0.345). This finding supports the hypothesis that recreational weekend riders have more discretion in their travel decisions and respond more strongly to weather conditions, while weekday commuters have less flexibility—they need transportation regardless of temperature, reducing weather's influence on their demand patterns.

These context-specific correlations have practical forecasting implications: temperature forecasts should carry more weight when predicting weekend demand and winter demand, while weekday forecasting models should rely less heavily on temperature and consider other factors like commuting patterns and work schedules.

---

## Summary and Transition to Data Visualization

You've mastered essential exploratory data analysis techniques: descriptive statistics, correlation analysis, and temporal pattern detection. These skills transform clean transportation data into quantitative insights that reveal business opportunities and operational patterns.

Your ability to calculate meaningful statistics, identify relationships between variables, and interpret patterns prepares you to work with complex transportation datasets while extracting actionable intelligence for business decision-making.

In the next lecture, you'll learn how to visualize these statistical insights effectively, creating clear and compelling visual narratives that communicate demand patterns, relationships, and business recommendations to stakeholders.