# Lecture 6: Programming Example - Statistical Analysis & Pattern Discovery

## Introduction: Building Your Statistical Analysis Toolkit

Welcome back, junior data consultant! Your Capital Bikes Washington D.C. client was impressed with your data cleaning work and now they need answers to critical business questions. Today, you'll transform from basic data manipulation to sophisticated statistical analysis - discovering the hidden patterns that drive bike-sharing demand using proven statistical techniques.

Think of statistical analysis like being a detective examining evidence. While anyone can collect data, professional consultants know how to measure patterns, calculate relationships, and quantify uncertainty. You'll learn to use numbers as proof - transforming "we think demand is higher on weekends" into "weekends show 34% higher average demand with 95% confidence, suggesting we need 50-75 additional bikes at key stations."

Your client is counting on you to turn their historical bike-sharing data into evidence-based recommendations worth millions in operational improvements. Every statistical technique you master serves this consulting purpose: providing the mathematical proof that drives confident business decisions.

> **🚀 Interactive Learning Alert**
>
> This is a hands-on statistical analysis tutorial with real transportation data challenges. For the best experience:
>
> - **Click "Open in Colab"** at the bottom to run code interactively
> - **Execute each code cell** by pressing **Shift + Enter**
> - **Complete the challenges** to practice your statistical analysis skills
> - **Think like a consultant** - every insight impacts client decisions

---

## Step 1: Setting Up Your Statistical Analysis Environment

Let's begin by importing the essential tools for statistical analysis and loading our bike-sharing dataset:

In [None]:
# Import pandas library for data manipulation (as 'pd' for convenience)
import pandas as pd

# Import numpy library for numerical operations (as 'np' for convenience)
import numpy as np

# Load the Washington D.C. bike-sharing dataset from GitHub repository
# This CSV file contains hourly bike rental data with weather and temporal features
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Convert the datetime column from text strings to proper datetime objects
# This enables time-based operations like extracting hour, day, month, etc.
df['datetime'] = pd.to_datetime(df['datetime'])

# Display basic information about our dataset for context
print("=== Capital Bikes Dataset Overview ===")
print(f"Dataset shape: {df.shape}")  # Shows (rows, columns) - number of observations and features
print(f"Date range: {df['datetime'].min()} to {df['datetime'].max()}")  # Time period covered
print(f"Total observations: {len(df):,} hours of bike-sharing data")  # Total data points

**What this establishes:**
- `pandas` provides all our data manipulation capabilities
- `numpy` handles mathematical operations and statistical calculations
- Converting datetime ensures proper temporal analysis capabilities
- Understanding dataset scope helps interpret statistical results

**Why this matters for your client:**
Your Capital Bikes client needs to understand the scale and timeframe of their data. These basic statistics set the context for all subsequent analysis - you're working with nearly two years of hourly bike rental data.

---

### Challenge 1: Dataset Familiarization
Explore the basic structure of your dataset to understand what variables are available for statistical analysis.

In [None]:
# Your code here
print("=== Dataset Structure ===")
print(_____)  # Display column types and missing values

print("\n=== First Few Records ===")
print(_____)  # Show first few rows

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>

Structure your dataset exploration around these systematic discovery approaches:
- Use `df.info()` to see column data types and check for missing values comprehensively
- Use `df.head()` and `df.tail()` to preview actual data values and verify data consistency
- Pay attention to which columns are numerical vs categorical for appropriate analysis
- Examine data ranges with `df.describe()` to understand typical values and outliers
- Don't assume all numerical columns should be analyzed the same way
- Remember that some numbers might represent categories (like season or weather codes)
- Think like a consultant: What would your client want to know about their data structure before diving into statistics?
- Consider data quality issues: Are there any obvious errors or inconsistencies?
</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Display dataset structure
print("=== Dataset Structure ===")
print(df.info())

print("\n=== First Few Records ===")
print(df.head())
```
</details>

---

## Step 2: Understanding Descriptive Statistics with .describe()

Let's use pandas' `.describe()` method to generate comprehensive summary statistics for our bike-sharing dataset:

In [None]:
# Import required libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Generate comprehensive descriptive statistics for all numerical variables
print("=== Descriptive Statistics for All Numerical Variables ===")
numerical_stats = df.describe()
print(numerical_stats)

# Focus on key business metrics for better clarity
print("\n=== Key Business Metrics Summary ===")
key_metrics = ['temp', 'humidity', 'windspeed', 'count']
business_summary = df[key_metrics].describe()
print(business_summary.round(2))  # Round for easier reading

# Extract specific insights for client presentation
total_demand = df['count'].sum()
avg_hourly_demand = df['count'].mean()
peak_demand = df['count'].max()
min_demand = df['count'].min()

print(f"\n=== Client-Ready Insights ===")
print(f"Total bike rentals: {total_demand:,}")
print(f"Average hourly demand: {avg_hourly_demand:.1f} bikes/hour")
print(f"Peak hourly demand: {peak_demand} bikes")
print(f"Minimum hourly demand: {min_demand} bikes")
print(f"Demand variability (std): {df['count'].std():.1f}")

**What this does:**
- `.describe()` generates eight summary statistics (count, mean, std, min, quartiles, max) for all numerical columns
- `count` shows data completeness (number of non-null observations)
- `mean` reveals average hourly demand - the baseline reference for typical system utilization
- `std` (standard deviation) quantifies demand variability around the mean
- `min/max` define the full operational range from quietest to busiest hours
- Quartiles (25%, 50%, 75%) reveal demand distribution shape and skewness patterns

**Business value:**
Your Capital Bikes client can use these statistics to set operational baselines, understand demand variability for capacity planning, and identify whether demand distribution is symmetric or skewed by comparing mean with median (50% quartile).

---

### Challenge 2: Interpret Weather Statistics
Analyze the descriptive statistics for weather variables and interpret what they mean for bike-sharing operations.

In [None]:
# Your code here
weather_cols = [_____, _____, _____, _____]  # Define weather columns
weather_stats = df[weather_cols]._____()  # Generate descriptive statistics
print("=== Weather Analysis ===")
print(weather_stats.round(2))

# Create business interpretations
print(f"\n=== Weather Insights for Operations ===")
print(f"Average temperature: {_____:.1f}°C")  # Calculate mean temperature
print(f"Temperature range: {_____:.1f}°C to {_____:.1f}°C")  # Show min to max

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>

Structure your weather statistics analysis around these key considerations:
- Calculate temperature thresholds: `df[df['temp'] > 25]['count'].mean()` for warm days
- Examine humidity extremes: `df['humidity'].quantile([0.1, 0.9])` for operational ranges
- Interpret standard deviations: high variability indicates unpredictable conditions
- Consider operational impact: extreme temperatures affect bike maintenance and user comfort
- Compare seasonal weather patterns using `df.groupby('season')[weather_cols].mean()`
- Focus on actionable insights: What weather thresholds should trigger operational changes?
</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Analyze weather statistics
weather_cols = ['temp', 'humidity', 'windspeed', 'weather']
weather_stats = df[weather_cols].describe()
print("=== Weather Analysis ===")
print(weather_stats.round(2))

# Create business interpretations
print(f"\n=== Weather Insights for Operations ===")
print(f"Average temperature: {df['temp'].mean():.1f}°C")
print(f"Temperature range: {df['temp'].min():.1f}°C to {df['temp'].max():.1f}°C")
```
</details>

---

## Step 3: Seasonal Statistical Comparison with .groupby()

Let's use `.groupby()` to compare statistical patterns across seasons and understand how demand varies throughout the year:

In [None]:
# Import required libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Analyze demand patterns by season
print("=== Seasonal Demand Analysis ===")
seasonal_stats = df.groupby('season')['count'].describe()
print(seasonal_stats.round(1))

# Create season labels for better interpretation
season_names = {1: 'Spring', 2: 'Summer', 3: 'Fall', 4: 'Winter'}
df['season_name'] = df['season'].map(season_names)

# Compare average demand by season with business context
print("\n=== Seasonal Business Impact ===")
seasonal_averages = df.groupby('season_name')['count'].agg(['mean', 'std', 'max']).round(1)
print(seasonal_averages)

# Calculate seasonal differences for strategic planning
spring_avg = df[df['season'] == 1]['count'].mean()
summer_avg = df[df['season'] == 2]['count'].mean()
fall_avg = df[df['season'] == 3]['count'].mean()
winter_avg = df[df['season'] == 4]['count'].mean()

print(f"\n=== Seasonal Planning Insights ===")
print(f"Spring average: {spring_avg:.1f} bikes/hour")
print(f"Summer average: {summer_avg:.1f} bikes/hour")
print(f"Fall average: {fall_avg:.1f} bikes/hour")
print(f"Winter average: {winter_avg:.1f} bikes/hour")
print(f"Summer vs Winter difference: {summer_avg - winter_avg:.1f} bikes/hour ({((summer_avg - winter_avg)/winter_avg)*100:.1f}% increase)")

**What this does:**
- `.groupby('season')` separates data into distinct seasonal groups (Spring, Summer, Fall, Winter)
- `.describe()` applies statistical functions to each group independently, revealing seasonal patterns
- `.agg(['mean', 'std', 'max'])` calculates multiple statistics simultaneously for comprehensive comparison
- Creating season labels with `.map()` makes output more readable for business stakeholders
- Calculating percentage differences quantifies the magnitude of seasonal variations

**Key insights from the data:**
The statistics reveal dramatic seasonal variations: Fall shows peak demand (234.4 bikes/hour), while Spring has the lowest demand (116.3 bikes/hour) - a 101% difference! The analysis focuses on Summer vs Winter comparison (8.2% difference) to demonstrate percentage calculations, but the full seasonal pattern shows that bike-sharing is highly weather-dependent, with warm-weather seasons (Summer/Fall) significantly outperforming Spring.

**Business value:**
Your Capital Bikes client can use these seasonal patterns to plan capacity adjustments (doubling fleet capacity from Spring to Fall), schedule maintenance during the low-demand Spring season, and allocate resources based on expected demand fluctuations throughout the year. The modest Summer-Winter difference suggests relatively stable demand in warmer months, while the Spring drop signals a critical period for cost optimization.

---

### Challenge 3: Workday vs Weekend Statistical Analysis
Compare bike-sharing demand patterns between workdays and weekends using groupby analysis.

In [None]:
# Your code here
# Analyze workingday patterns
workday_stats = df.groupby(_____)['count']._____()  # Group by workingday and describe
print("=== Workday vs Weekend Demand ===")
print(workday_stats.round(1))

# Create meaningful labels
workday_labels = {0: 'Weekend/Holiday', 1: 'Workday'}
df['workday_type'] = df['workingday']._____(_____) # Map labels

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>

Develop your workday analysis to uncover operational patterns:
- Compare peak hour differences: `df.groupby(['workingday', 'hour'])['count'].mean()`
- Analyze user type variations: `df.groupby('workingday')[['casual', 'registered']].mean()`
- Calculate demand consistency: examine standard deviations between workdays vs weekends
- Consider revenue implications: workday commuters vs weekend leisure riders have different payment patterns
- Examine time-of-day patterns: rush hours are prominent on workdays but not weekends
- Remember that workingday=0 includes holidays, which may have unique demand patterns
</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Analyze workingday patterns
workday_stats = df.groupby('workingday')['count'].describe()
print("=== Workday vs Weekend Demand ===")
print(workday_stats.round(1))

# Create meaningful labels
workday_labels = {0: 'Weekend/Holiday', 1: 'Workday'}
df['workday_type'] = df['workingday'].map(workday_labels)
```
</details>

---

## Step 4: Weather Impact Thresholds and Categorical Analysis

Let's analyze how different weather conditions affect bike-sharing demand using categorical weather variables:

In [None]:
# Import required libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Analyze demand by weather categories
print("=== Weather Impact Analysis ===")
weather_demand = df.groupby('weather')['count'].agg(['count', 'mean', 'std']).round(1)
print(weather_demand)

# Create weather condition labels for business interpretation
weather_labels = {
    1: 'Clear/Partly Cloudy',
    2: 'Mist/Cloudy',
    3: 'Light Snow/Rain',
    4: 'Heavy Rain/Snow'
}
df['weather_condition'] = df['weather'].map(weather_labels)

# Business-focused weather analysis
print("\n=== Weather Condition Business Impact ===")
weather_business = df.groupby('weather_condition')['count'].agg(['mean', 'std', 'min', 'max']).round(1)
print(weather_business)

# Calculate weather impact percentages for strategic decisions
clear_weather_avg = df[df['weather'] == 1]['count'].mean()
rainy_weather_avg = df[df['weather'] == 3]['count'].mean()
weather_impact = ((clear_weather_avg - rainy_weather_avg) / clear_weather_avg) * 100

print(f"\n=== Strategic Weather Insights ===")
print(f"Clear weather average: {clear_weather_avg:.1f} bikes/hour")
print(f"Rainy weather average: {rainy_weather_avg:.1f} bikes/hour")
print(f"Rain reduces demand by: {weather_impact:.1f}%")

# Temperature threshold analysis
print(f"\n=== Temperature Threshold Analysis ===")
hot_days = df[df['temp'] > 30]['count'].mean()
cold_days = df[df['temp'] < 10]['count'].mean()
ideal_days = df[(df['temp'] >= 15) & (df['temp'] <= 25)]['count'].mean()

print(f"Hot days (>30°C) average: {hot_days:.1f} bikes/hour")
print(f"Cold days (<10°C) average: {cold_days:.1f} bikes/hour")
print(f"Ideal temps (15-25°C) average: {ideal_days:.1f} bikes/hour")

**What this does:**
- `.groupby('weather')` separates data by weather condition categories (1=Clear, 2=Mist, 3=Rain, 4=Heavy Rain)
- `.agg(['count', 'mean', 'std'])` calculates multiple statistics simultaneously to understand both demand levels and variability
- `.map(weather_labels)` transforms numeric codes into descriptive labels for business-friendly reporting
- Boolean filtering `df[df['temp'] > 30]` isolates specific temperature ranges for threshold analysis
- Percentage calculations quantify the magnitude of weather impacts on demand

**Business value:**
Your Capital Bikes client can use these weather thresholds to trigger operational adjustments (reducing bike availability during poor weather), plan marketing campaigns (promoting fair-weather cycling), and set realistic demand expectations across different weather scenarios.

---

### Challenge 4: Humidity Threshold Impact
Investigate how different humidity levels affect bike-sharing demand by creating meaningful humidity categories.

In [None]:
# Your code here
# Create humidity categories for analysis
df['humidity_category'] = pd.cut(df[_____],
                                bins=[_____, _____, _____, _____],
                                labels=['Low (0-40%)', 'Medium (40-70%)', 'High (70-100%)'])

humidity_analysis = df.groupby(_____)[_____].agg(['mean', 'std', 'count']).round(1)
print("=== Humidity Impact Analysis ===")
print(humidity_analysis)

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>

Build a comprehensive humidity impact analysis for operational decision-making:
- Validate category sizes: ensure each humidity category has sufficient observations for reliable statistics
- Cross-analyze with temperature: `df.groupby(['humidity_category', 'temp_range'])['count'].mean()`
- Examine user type sensitivity: casual users may be more humidity-sensitive than commuters
- Calculate comfort index: combine temperature and humidity effects using heat index concepts
- Consider seasonal interactions: summer humidity impacts differ from winter humidity
- Focus on actionable thresholds: identify humidity levels that significantly reduce demand for operational planning
</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Create humidity categories for analysis
df['humidity_category'] = pd.cut(df['humidity'],
                                bins=[0, 40, 70, 100],
                                labels=['Low (0-40%)', 'Medium (40-70%)', 'High (70-100%)'])

humidity_analysis = df.groupby('humidity_category', observed=True)['count'].agg(['mean', 'std', 'count']).round(1)
print("=== Humidity Impact Analysis ===")
print(humidity_analysis)
```
</details>

---

## Step 5: Correlation Analysis with .corr()

Let's use correlation analysis to quantify relationships between variables and identify which factors most strongly influence bike-sharing demand:

In [None]:
# Import required libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Calculate correlation matrix for numerical variables
print("=== Correlation Analysis ===")
numerical_vars = ['temp', 'atemp', 'humidity', 'windspeed', 'casual', 'registered', 'count']
correlation_matrix = df[numerical_vars].corr()
print(correlation_matrix.round(3))

# Focus on demand correlations for business insights
print("\n=== Variables Most Correlated with Total Demand ===")
demand_correlations = df[numerical_vars].corr()['count'].sort_values(ascending=False)
print(demand_correlations.round(3))

# Interpret key correlations for client
strongest_positive = demand_correlations.drop('count').max()
strongest_positive_var = demand_correlations.drop('count').idxmax()
strongest_negative = demand_correlations.drop('count').min()
strongest_negative_var = demand_correlations.drop('count').idxmin()

print(f"\n=== Key Correlation Insights ===")
print(f"Strongest positive correlation: {strongest_positive_var} ({strongest_positive:.3f})")
print(f"Strongest negative correlation: {strongest_negative_var} ({strongest_negative:.3f})")

# Business interpretation of correlations
print(f"\n=== Business Interpretation ===")
temp_corr = df['temp'].corr(df['count'])
humidity_corr = df['humidity'].corr(df['count'])

if temp_corr > 0:
    print(f"Temperature has a positive relationship with demand (r={temp_corr:.3f})")
    print("→ Warmer weather encourages more bike rentals")
else:
    print(f"Temperature has a negative relationship with demand (r={temp_corr:.3f})")

if humidity_corr < 0:
    print(f"Humidity has a negative relationship with demand (r={humidity_corr:.3f})")
    print("→ High humidity discourages bike rentals")

**What this does:**
- `.corr()` calculates Pearson correlation coefficients between all numerical variable pairs, measuring linear relationship strength
- Correlation values range from -1.0 (perfect negative) to +1.0 (perfect positive), with 0 indicating no linear relationship
- `.sort_values(ascending=False)` ranks variables by their correlation strength with demand
- `.idxmax()` and `.idxmin()` identify which variables have the strongest positive and negative correlations
- Business interpretations translate statistical findings into actionable insights about weather impacts

**Business value:**
Your Capital Bikes client can identify which external factors drive demand fluctuations. Temperature shows moderate positive correlation (0.394) - warmer days increase rentals. Humidity shows weak negative correlation (-0.317) - uncomfortable conditions deter riders. Windspeed has minimal impact (0.101). These weather correlations enable demand forecasting and operational planning, while the registered/casual breakdown (0.971/0.690) simply reflects that registered users comprise most total rentals, not a causal relationship.

---

### Challenge 5: Advanced Correlation Analysis
Explore correlations between weather variables and different user types (casual vs registered).

In [None]:
# Your code here
print("=== User Type Correlation Analysis ===")
user_weather_corr = df[[_____, _____, _____, _____, _____]].corr()
print(user_weather_corr.round(3))

# Compare how weather affects different user types
casual_temp_corr = df[_____].corr(df[_____])
registered_temp_corr = df[_____].corr(df[_____])

print(f"\n=== User Type Weather Sensitivity ===")
print(f"Casual users temperature correlation: {casual_temp_corr:.3f}")
print(f"Registered users temperature correlation: {registered_temp_corr:.3f}")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>

Develop user segmentation insights through correlation analysis:
- Compare correlation strength differences: which user type shows stronger weather dependency?
- Analyze seasonal correlation patterns: `df[df['season']==2].corr()` for summer-specific relationships
- Examine multiple weather factors: create correlation matrix for temp, humidity, windspeed by user type
- Consider business implications: weather-sensitive segments need different marketing strategies
- Focus on actionable insights: which weather conditions drive different user behaviors for targeted operations?
</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# User type correlation analysis
print("=== User Type Correlation Analysis ===")
user_weather_corr = df[['temp', 'humidity', 'windspeed', 'casual', 'registered']].corr()
print(user_weather_corr.round(3))

# Compare how weather affects different user types
casual_temp_corr = df['temp'].corr(df['casual'])
registered_temp_corr = df['temp'].corr(df['registered'])

print(f"\n=== User Type Weather Sensitivity ===")
print(f"Casual users temperature correlation: {casual_temp_corr:.3f}")
print(f"Registered users temperature correlation: {registered_temp_corr:.3f}")
```
</details>

---

## Step 6: Time-Based Statistical Patterns

Let's extract time-based features and analyze hourly and daily demand patterns to identify operational planning opportunities:

In [None]:
# Import required libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Extract time-based features for analysis
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.dayofweek  # Monday=0, Sunday=6
df['month'] = df['datetime'].dt.month

# Hourly demand patterns
print("=== Hourly Demand Patterns ===")
hourly_stats = df.groupby('hour')['count'].agg(['mean', 'std', 'min', 'max']).round(1)
print(hourly_stats)

# Identify peak hours for operations planning
peak_morning_hour = df.groupby('hour')['count'].mean().iloc[6:10].idxmax()
peak_evening_hour = df.groupby('hour')['count'].mean().iloc[16:20].idxmax()
lowest_hour = df.groupby('hour')['count'].mean().idxmin()
highest_hour = df.groupby('hour')['count'].mean().idxmax()

print(f"\n=== Operational Planning Insights ===")
print(f"Peak morning hour: {peak_morning_hour}:00 ({df.groupby('hour')['count'].mean().iloc[peak_morning_hour]:.1f} bikes/hour)")
print(f"Peak evening hour: {peak_evening_hour}:00 ({df.groupby('hour')['count'].mean().iloc[peak_evening_hour]:.1f} bikes/hour)")
print(f"Lowest demand hour: {lowest_hour}:00 ({df.groupby('hour')['count'].mean().iloc[lowest_hour]:.1f} bikes/hour)")
print(f"Highest demand hour: {highest_hour}:00 ({df.groupby('hour')['count'].mean().iloc[highest_hour]:.1f} bikes/hour)")

# Day of week analysis
print("\n=== Day of Week Analysis ===")
day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df['day_name'] = df['day_of_week'].map(dict(zip(range(7), day_names)))

daily_stats = df.groupby('day_name')['count'].agg(['mean', 'std']).round(1)
print(daily_stats)

# Weekend vs weekday patterns
weekend_avg = df[df['day_of_week'].isin([5, 6])]['count'].mean()
weekday_avg = df[df['day_of_week'].isin([0, 1, 2, 3, 4])]['count'].mean()

print(f"\n=== Weekly Pattern Summary ===")
print(f"Weekend average: {weekend_avg:.1f} bikes/hour")
print(f"Weekday average: {weekday_avg:.1f} bikes/hour")
print(f"Weekend vs weekday difference: {weekend_avg - weekday_avg:.1f} bikes/hour")

**What this does:**
- `.dt.hour`, `.dt.dayofweek`, and `.dt.month` extract time components from datetime objects for temporal analysis
- `.iloc[6:10].idxmax()` identifies the specific hour with highest demand within morning rush hour window (6-10am)
- `.isin([5, 6])` filters for weekend days (Saturday=5, Sunday=6) to compare weekend vs weekday patterns
- `.map(dict(zip(...)))` creates a mapping dictionary that converts numeric day-of-week values (0-6) into readable day names ('Monday'-'Sunday') for professional business reporting. The `zip(range(7), day_names)` pairs each number with its corresponding day name, `dict()` converts these pairs into a lookup dictionary, and `.map()` applies this translation to transform the numeric column into human-readable labels

**Key insights from the data:**
The hourly pattern reveals dramatic commuter behavior with sharp peaks at 8am (362.8) and 5pm (468.8), yet the weekend/weekday difference is surprisingly small (-4.0 bikes/hour, only 2%). This suggests the system successfully serves both weekday commuters with concentrated rush-hour demand and weekend recreational riders with more distributed usage throughout the day - essentially maintaining similar total volume through different usage patterns.

**Business value:**
Your Capital Bikes client can use these temporal patterns to optimize bike redistribution schedules (moving bikes to high-demand stations before morning/evening peaks), adjust staffing levels based on hourly demand curves, and recognize that weekend operations require similar capacity but different distribution strategies compared to weekday commuter-focused positioning.

---

### Challenge 6: Monthly Seasonal Patterns
Analyze how demand varies across months and identify seasonal trends for annual planning.

In [None]:
# Your code here
print("=== Monthly Demand Analysis ===")
monthly_stats = df.groupby(_____)[_____].agg([_____, _____, _____]).round(1)
print(monthly_stats)

# Identify peak and low months
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
monthly_avg = df.groupby('month')['count']._____()
peak_month = monthly_avg._____()
low_month = monthly_avg._____()

print(f"\n=== Annual Planning Insights ===")
print(f"Peak month: {month_names[peak_month-1]} ({monthly_avg.iloc[peak_month-1]:.1f} bikes/hour)")
print(f"Lowest month: {month_names[low_month-1]} ({monthly_avg.iloc[low_month-1]:.1f} bikes/hour)")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>

Build comprehensive annual planning intelligence from monthly patterns:
- Calculate seasonal capacity requirements: `(peak_month_demand / avg_demand) * 100` for scaling factors
- Analyze demand growth trends: compare year-over-year monthly patterns using `df['datetime'].dt.year`
- Examine weather correlation by month: how do temperature and humidity patterns align with demand cycles?
- Consider operational calendar: align maintenance schedules with low-demand months
- Calculate revenue implications: `monthly_demand * avg_revenue_per_ride` for financial planning
- Identify transition periods: months with highest variability often signal seasonal shifts requiring operational adjustments
</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Monthly demand analysis
print("=== Monthly Demand Analysis ===")
monthly_stats = df.groupby('month')['count'].agg(['mean', 'std', 'count']).round(1)
print(monthly_stats)

# Identify peak and low months
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
monthly_avg = df.groupby('month')['count'].mean()
peak_month = monthly_avg.idxmax()
low_month = monthly_avg.idxmin()

print(f"\n=== Annual Planning Insights ===")
print(f"Peak month: {month_names[peak_month-1]} ({monthly_avg.iloc[peak_month-1]:.1f} bikes/hour)")
print(f"Lowest month: {month_names[low_month-1]} ({monthly_avg.iloc[low_month-1]:.1f} bikes/hour)")
```
</details>

---

## Step 7: Comprehensive Statistical Summary and Business Insights

Let's create a professional statistical report that synthesizes all our analyses into actionable business intelligence:

In [None]:
# Import required libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Create necessary derived features for the report
season_names = {1: 'Spring', 2: 'Summer', 3: 'Fall', 4: 'Winter'}
df['season_name'] = df['season'].map(season_names)

weather_labels = {
    1: 'Clear/Partly Cloudy',
    2: 'Mist/Cloudy',
    3: 'Light Snow/Rain',
    4: 'Heavy Rain/Snow'
}
df['weather_condition'] = df['weather'].map(weather_labels)

df['hour'] = df['datetime'].dt.hour

print("=" * 60)
print("CAPITAL BIKES WASHINGTON D.C. - STATISTICAL ANALYSIS REPORT")
print("=" * 60)

# Executive Summary Statistics
print("\n📊 EXECUTIVE SUMMARY")
print("-" * 40)
total_hours = len(df)
total_rentals = df['count'].sum()
avg_hourly_demand = df['count'].mean()
demand_variability = df['count'].std()

print(f"Analysis Period: {df['datetime'].min().strftime('%Y-%m-%d')} to {df['datetime'].max().strftime('%Y-%m-%d')}")
print(f"Total Data Points: {total_hours:,} hours")
print(f"Total Bike Rentals: {total_rentals:,} rentals")
print(f"Average Hourly Demand: {avg_hourly_demand:.1f} bikes/hour")
print(f"Demand Variability (σ): {demand_variability:.1f}")

# Key Performance Indicators
print(f"\n📈 KEY PERFORMANCE INDICATORS")
print("-" * 40)
print(f"Peak Hourly Demand: {df['count'].max()} bikes/hour")
print(f"95th Percentile Demand: {df['count'].quantile(0.95):.0f} bikes/hour")
print(f"Median Demand: {df['count'].median():.0f} bikes/hour")
print(f"Hours with Zero Demand: {(df['count'] == 0).sum()} hours ({((df['count'] == 0).sum()/len(df))*100:.1f}%)")

# Seasonal Business Intelligence
print(f"\n🌸 SEASONAL PATTERNS")
print("-" * 40)
seasonal_summary = df.groupby('season_name')['count'].agg(['mean', 'std']).round(1)
for season, stats in seasonal_summary.iterrows():
    print(f"{season}: {stats['mean']:.1f} ± {stats['std']:.1f} bikes/hour")

# Weather Impact Assessment
print(f"\n🌤️  WEATHER IMPACT ASSESSMENT")
print("-" * 40)
weather_impact = df.groupby('weather_condition')['count'].mean().round(1)
for condition, avg_demand in weather_impact.items():
    percentage_of_clear = (avg_demand / weather_impact['Clear/Partly Cloudy']) * 100
    print(f"{condition}: {avg_demand:.1f} bikes/hour ({percentage_of_clear:.0f}% of clear weather)")

# Operational Insights
print(f"\n⚙️  OPERATIONAL INSIGHTS")
print("-" * 40)
rush_morning = df[df['hour'].isin([7, 8, 9])]['count'].mean()
rush_evening = df[df['hour'].isin([17, 18, 19])]['count'].mean()
off_peak = df[df['hour'].isin([1, 2, 3, 4])]['count'].mean()

print(f"Morning Rush (7-9am): {rush_morning:.1f} bikes/hour")
print(f"Evening Rush (5-7pm): {rush_evening:.1f} bikes/hour")
print(f"Off-Peak (1-4am): {off_peak:.1f} bikes/hour")
print(f"Rush Hour Multiplier: {rush_evening/off_peak:.1f}x off-peak demand")

# User Behavior Analysis
print(f"\n👥 USER BEHAVIOR ANALYSIS")
print("-" * 40)
casual_percentage = (df['casual'].sum() / df['count'].sum()) * 100
registered_percentage = (df['registered'].sum() / df['count'].sum()) * 100
casual_weather_sensitivity = abs(df[df['weather']==1]['casual'].mean() - df[df['weather']==3]['casual'].mean())
registered_weather_sensitivity = abs(df[df['weather']==1]['registered'].mean() - df[df['weather']==3]['registered'].mean())

print(f"Casual Users: {casual_percentage:.1f}% of total demand")
print(f"Registered Users: {registered_percentage:.1f}% of total demand")
print(f"Casual Weather Sensitivity: {casual_weather_sensitivity:.1f} bikes/hour difference")
print(f"Registered Weather Sensitivity: {registered_weather_sensitivity:.1f} bikes/hour difference")

# Strategic Recommendations
print(f"\n💡 KEY STRATEGIC INSIGHTS")
print("-" * 40)
print("1. SEASONAL PLANNING:")
peak_season = seasonal_summary['mean'].idxmax()
low_season = seasonal_summary['mean'].idxmin()
seasonal_difference = seasonal_summary.loc[peak_season, 'mean'] - seasonal_summary.loc[low_season, 'mean']
print(f"   • {seasonal_difference:.0f} bikes/hour difference between {peak_season} and {low_season}")
print(f"   • Plan for {seasonal_difference/avg_hourly_demand*100:.0f}% seasonal capacity variation")

print("\n2. WEATHER CONTINGENCY:")
weather_loss = weather_impact['Clear/Partly Cloudy'] - weather_impact['Light Snow/Rain']
print(f"   • Poor weather reduces demand by {weather_loss:.0f} bikes/hour")
print(f"   • Represents {(weather_loss/weather_impact['Clear/Partly Cloudy'])*100:.0f}% demand loss in bad weather")

print("\n3. OPERATIONAL EFFICIENCY:")
peak_capacity_need = df['count'].quantile(0.95)
avg_capacity_need = df['count'].mean()
efficiency_ratio = peak_capacity_need / avg_capacity_need
print(f"   • Peak capacity needs are {efficiency_ratio:.1f}x average demand")
print(f"   • Consider dynamic pricing during peak hours ({df.groupby('hour')['count'].mean().idxmax()}:00)")

print("\n" + "=" * 60)
print("Report generated using pandas statistical analysis methods")
print("=" * 60)

**What this does:**
- Synthesizes multiple statistical analyses into a cohesive executive summary with clearly labeled sections
- Uses `.quantile(0.95)` to identify capacity planning thresholds that cover 95% of demand scenarios
- Calculates weather sensitivity differences to quantify user segment behavioral patterns
- Employs `.iterrows()` for formatted iteration through grouped data for professional reporting
- Combines descriptive statistics with business interpretations to translate technical findings into strategic recommendations

**Business value:**
Your Capital Bikes client receives a complete statistical intelligence report that summarizes all key findings, quantifies operational requirements (peak capacity is 2-3x average), identifies revenue optimization opportunities (dynamic pricing during peak hours), and provides evidence-based recommendations for seasonal planning, weather contingency, and user segment targeting.

---

### Challenge 8: Create Your Own Statistical Insight
Using the statistical techniques you've learned, investigate one additional pattern or relationship in the data that could provide business value.

In [None]:
# Your code here
print("=== CUSTOM STATISTICAL ANALYSIS ===")
# Example: Analyze the relationship between temperature and user types
# Or investigate windspeed impact on different weather conditions
# Or examine holiday patterns vs regular days

# Choose your own analysis focus:
# Option 1: Holiday impact analysis
holiday_analysis = df.groupby(_____)[['count', 'casual', 'registered']].mean().round(1)
print("Holiday Impact Analysis:")
print(holiday_analysis)

# Option 2: Temperature-humidity interaction
# Create categories and analyze their combined effect
# Your statistical investigation here...

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>

Design original statistical analysis that delivers unique business value:
- Choose a compelling business question: "How do holidays affect different user segments?" or "What weather combinations optimize revenue?"
- Combine multiple techniques: use groupby, correlations, and descriptive statistics together for comprehensive insights
- Create interaction analyses: examine how multiple variables work together (e.g., temperature + humidity effects)
- Calculate business metrics: convert statistical findings into revenue impact, capacity requirements, or operational costs
- Provide actionable recommendations: ensure your analysis leads to specific business decisions
- Validate findings: check that your results align with domain expertise and logical expectations
</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Custom statistical analysis example: Holiday impact
print("=== CUSTOM STATISTICAL ANALYSIS ===")

# Option 1: Holiday impact analysis
holiday_analysis = df.groupby('holiday')[['count', 'casual', 'registered']].mean().round(1)
print("Holiday Impact Analysis:")
print(holiday_analysis)

# Option 2: Temperature-humidity interaction analysis
print("\n=== Temperature-Humidity Interaction ===")
# Create temperature categories
df['temp_category'] = pd.cut(df['temp'],
                             bins=[0, 15, 25, 50],
                             labels=['Cold (<15°C)', 'Comfortable (15-25°C)', 'Hot (>25°C)'])

# Create humidity categories (if not already created)
df['humidity_category'] = pd.cut(df['humidity'],
                                bins=[0, 40, 70, 100],
                                labels=['Low (0-40%)', 'Medium (40-70%)', 'High (70-100%)'])

# Analyze combined effects
interaction_analysis = df.groupby(['temp_category', 'humidity_category'], observed=True)['count'].mean().round(1)
print(interaction_analysis)
```
</details>

---

## Summary: Professional Statistical Analysis and Pattern Discovery Techniques

**What We've Accomplished**:
- Implemented descriptive statistics analysis using `.describe()` to calculate central tendency and variability measures
- Applied `.groupby()` analysis for seasonal, temporal, and weather pattern comparisons
- Executed correlation analysis with `.corr()` to quantify relationships between weather variables and bike demand
- Created categorical analysis frameworks using `pd.cut()` to analyze temperature and humidity thresholds
- Built professional statistical reports with executive summaries and operational insights

**Key Technical Skills Mastered**:
- Central tendency measurement with `.mean()` and `.median()` for demand characterization
- Variability measurement with `.std()`, `.min()`, and `.max()` for capacity planning
- Multi-dimensional aggregation using `.groupby()` with `.agg()` for seasonal and temporal analysis
- Correlation matrix creation and interpretation with `.corr()` for relationship strength assessment
- Time-based feature extraction using `.dt.hour`, `.dt.dayofweek`, and `.dt.month` for temporal patterns
- Categorical variable creation with `pd.cut()` for threshold analysis
- Boolean filtering with conditional statements for category-specific analysis
- Statistical ranking with `.sort_values()`, `.idxmax()`, and `.idxmin()` for peak identification

**Next Steps**: Next, we'll advance to professional data visualization and chart creation techniques, mastering matplotlib frameworks that transform your statistical discoveries into compelling visual narratives and stakeholder-ready analytical presentations for maximum business impact.

Your bike-sharing client now possesses rigorous statistical intelligence backed by mathematical analysis that reveals actionable demand patterns, operational optimization opportunities, and user behavior insights - demonstrating the quantitative analytical expertise that separates professional transportation consultants from basic data analysts!