# Lecture 7: Programming Example - Data Storytelling & Visualization

## Introduction: Mastering Data Storytelling Through Professional Visualization

Welcome back, junior data consultant! Your Capital Bikes Washington D.C. client now possesses cleaned data, engineered features, and statistical insights - but numbers alone don't drive decisions. Today, you'll transform from statistical analyst to visual storyteller, mastering the professional visualization techniques that convert complex transportation patterns into compelling, stakeholder-ready analytical narratives.

Think of data visualization as the consulting translator between mathematical analysis and executive action. While statistical tests quantify relationships, visualizations make those relationships instantly comprehensible to non-technical stakeholders. You'll master the fundamental chart types that professional consultants deploy daily: line plots for temporal patterns, bar charts for categorical comparisons, box plots for distribution analysis, histograms for frequency understanding, and scatter plots for relationship exploration.

Your client is counting on you to transform your statistical discoveries into visual evidence that justifies million-dollar operational decisions. Every visualization you create serves a consulting purpose: answering specific business questions with clarity and precision that enables confident strategic action.

> **🚀 Interactive Learning Alert**
>
> This is a hands-on visualization tutorial with simple examples and gradual building. For the best experience:
>
> - **Click "Open in Colab"** at the bottom to run code interactively
> - **Execute each code cell** by pressing **Shift + Enter**
> - **Complete the challenges** to practice your visualization skills
> - **Think like a consultant** - clarity is more important than complexity

---

## Step 1: Creating Line Plots

Let's start with the simplest and most useful plot for time-based data: the line plot. We'll show how bike demand changes throughout a typical day.

In [None]:
# Import the libraries we need
import pandas as pd
import matplotlib.pyplot as plt

# Load the Washington D.C. bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Extract the hour from datetime
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour

# Calculate average demand for each hour (0-23)
hourly_demand = df.groupby('hour')['count'].mean()

# Create a basic line plot
plt.plot(hourly_demand.index, hourly_demand.values)
plt.xlabel('Hour of Day')
plt.ylabel('Average Bike Rentals')
plt.title('Daily Bike Demand Pattern')
plt.show()

print(f"Peak hour: {hourly_demand.idxmax()}:00 with {hourly_demand.max():.0f} bikes")

**What this does:**
- **Load and prepare**: Imports data and extracts hour from the datetime
- **Calculate averages**: Groups all data by hour and calculates the mean demand
- **Create line plot**: Uses `plt.plot()` to connect the 24 hourly points with a line
- **Add labels**: Makes the plot understandable with axis labels and a title

**Why a line plot?** Time flows continuously through the day, so connecting points with a line shows that natural flow. You can see the morning and evening rush hours as peaks.

### Multi-Line Comparison: The Power of Direct Comparison

Now let's see something powerful: putting weekday and weekend patterns on the same plot. This lets you compare them directly without switching between charts.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour

# Calculate averages for weekdays and weekends separately
weekday_hourly = df[df['workingday'] == 1].groupby('hour')['count'].mean()
weekend_hourly = df[df['workingday'] == 0].groupby('hour')['count'].mean()

# Plot both lines on the same chart
plt.plot(weekday_hourly.index, weekday_hourly.values, label='Weekday', marker='o')
plt.plot(weekend_hourly.index, weekend_hourly.values, label='Weekend', marker='s')
plt.xlabel('Hour of Day')
plt.ylabel('Average Bike Rentals')
plt.title('Weekday vs Weekend Demand Patterns')
plt.legend()  # Shows which line is which
plt.show()

print(f"Weekday peak: {weekday_hourly.max():.0f} bikes at hour {weekday_hourly.idxmax()}")
print(f"Weekend peak: {weekend_hourly.max():.0f} bikes at hour {weekend_hourly.idxmax()}")

**What this shows:**
- **Two lines, one chart**: You can instantly see how the patterns differ
- **Different markers**: 'o' for weekdays, 's' (square) for weekends helps distinguish them
- **Legend**: The `plt.legend()` creates a box showing which line is which
- **label parameter**: Each plot() gets a label='...' that appears in the legend

**Why this is powerful:** Your client can immediately see that weekdays have two peaks (morning and evening rush hours) while weekends have one midday peak. This single chart tells a complete story about how usage patterns change.

---

### Challenge 1: Create a Line Plot for Weekend Patterns

Your client asks: "How does demand look on weekends?" Create a line plot showing average hourly demand only for weekend days (Saturday and Sunday).

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour

# Your code here - create weekend line plot
df['dayofweek'] = df['datetime'].dt.dayofweek  # 0=Monday, 6=Sunday
weekend_data = df[df['dayofweek'].isin([_____, _____])]  # Fill in: 5, 6 for Saturday, Sunday
weekend_hourly = weekend_data.groupby('_____')['_____'].mean()  # Fill in: 'hour', 'count'

plt.plot(weekend_hourly.index, weekend_hourly.values)
plt.xlabel('Hour of Day')
plt.ylabel('Average Bike Rentals')
plt.title('Weekend Bike Demand Pattern')
plt.show()

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Break this into three steps: First, extract the day of week using `.dt.dayofweek` (remember: 0=Monday, so weekends are 5 and 6). Second, filter the dataframe to keep only weekend days using `.isin([5, 6])`. Third, group by hour and calculate the mean just like in Step 1. Watch out: dayofweek starts at 0 (Monday), not 1, so Saturday=5 and Sunday=6. Your client will notice that weekend patterns look different from weekday patterns - probably a single midday peak instead of two rush hour peaks.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour

# Create weekend line plot
df['dayofweek'] = df['datetime'].dt.dayofweek
weekend_data = df[df['dayofweek'].isin([5, 6])]
weekend_hourly = weekend_data.groupby('hour')['count'].mean()

plt.plot(weekend_hourly.index, weekend_hourly.values)
plt.xlabel('Hour of Day')
plt.ylabel('Average Bike Rentals')
plt.title('Weekend Bike Demand Pattern')
plt.show()
```

</details>

---

## Step 2: Creating Bar Plots

Now let's compare categories using a bar plot. We'll compare average demand between weekdays and weekends - perfect for seeing which type of day is busier.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Calculate the average demand for weekdays vs weekends
weekday_avg = df[df['workingday'] == 1]['count'].mean()
weekend_avg = df[df['workingday'] == 0]['count'].mean()

# Create lists for the bar plot
categories = ['Weekday', 'Weekend']
values = [weekday_avg, weekend_avg]

# Create a bar plot
plt.bar(categories, values)
plt.xlabel('Day Type')
plt.ylabel('Average Bike Rentals per Hour')
plt.title('Weekday vs Weekend Demand')
plt.show()

print(f"Weekday average: {weekday_avg:.0f} bikes per hour")
print(f"Weekend average: {weekend_avg:.0f} bikes per hour")

**What this does:**
- **Calculate category averages**: Finds mean demand for weekdays (workingday=1) and weekends (workingday=0)
- **Prepare data**: Creates simple lists of category names and their values
- **Create bar plot**: Uses `plt.bar()` to draw bars where height represents the average
- **Add labels**: Makes the comparison clear with axis labels

**Why a bar plot?** When comparing distinct categories (weekday vs weekend), bars make it easy to see which is higher. The height difference shows the answer instantly.

---

### Challenge 2: Create a Bar Plot Comparing All Four Seasons

Your client wants to see: "Which season has the highest bike demand?" Create a bar plot comparing average demand across all four seasons.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Your code here - create seasonal comparison bar plot
seasonal_avg = df.groupby('_____')['_____'].mean()  # Fill in: 'season', 'count'

categories = ['Spring', 'Summer', 'Fall', 'Winter']
values = seasonal_avg.values

plt.bar(_____, _____)  # Fill in: categories, values
plt.xlabel('Season')
plt.ylabel('Average Bike Rentals per Hour')
plt.title('Seasonal Bike Demand Comparison')
plt.show()

# Print the results
for i, season in enumerate(categories):
    print(f"{season}: {values[i]:.0f} bikes per hour")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


This is similar to the weekday/weekend comparison, but with four categories instead of two. Use `.groupby('season')` to calculate the average for each season (the dataset has season codes 1-4). The season codes map to: 1=Spring, 2=Summer, 3=Fall, 4=Winter. Remember to use `.mean()` after grouping to get averages. Watch out: the season values in the dataframe are numbers (1,2,3,4), but you want readable labels on your chart. Your client will likely see that summer has the highest demand (warm weather = more biking) and winter has the lowest (cold = fewer riders).

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create seasonal comparison bar plot
seasonal_avg = df.groupby('season')['count'].mean()

categories = ['Spring', 'Summer', 'Fall', 'Winter']
values = seasonal_avg.values

plt.bar(categories, values)
plt.xlabel('Season')
plt.ylabel('Average Bike Rentals per Hour')
plt.title('Seasonal Bike Demand Comparison')
plt.show()

# Print the results
for i, season in enumerate(categories):
    print(f"{season}: {values[i]:.0f} bikes per hour")
```

</details>

---

## Step 3: Creating Box Plots (Understanding Distributions with Confidence)

Before we dive into histograms, let's learn about box plots - a compact way to show how data is distributed across different categories. Box plots show you the "middle 50%" of your data and help you spot outliers.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create box plot comparing weekday vs weekend demand
weekday_data = df[df['workingday'] == 1]['count']
weekend_data = df[df['workingday'] == 0]['count']

plt.boxplot([weekday_data, weekend_data], labels=['Weekday', 'Weekend'])
plt.ylabel('Bike Rentals per Hour')
plt.title('Demand Distribution: Weekday vs Weekend')
plt.show()

# Print the quartiles
print("Weekday demand:")
print(f"  25th percentile: {weekday_data.quantile(0.25):.0f} bikes")
print(f"  50th percentile (median): {weekday_data.quantile(0.50):.0f} bikes")
print(f"  75th percentile: {weekday_data.quantile(0.75):.0f} bikes")

**What this shows:**
- **The box**: Shows the middle 50% of data (from 25th to 75th percentile)
- **The line inside the box**: That's the median (50th percentile) - the middle value
- **The whiskers**: The lines extending from the box show the typical range
- **The dots**: Any dots beyond the whiskers are outliers - unusual values

**Why use box plots?** They give you a complete picture of your data's spread in a compact format. You can quickly see if weekdays and weekends have similar ranges or if one is more variable than the other.

**Understanding confidence with box plots:** The box shows you where you can be confident most data falls. If 50% of your data is between 40 and 280 bikes (the box), you can confidently plan operations for that range. The whiskers extend to show the typical minimum and maximum (excluding outliers), giving you the full operational range.

Now let's see seasonal patterns with box plots:

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create box plot for all four seasons
season_data = [df[df['season'] == i]['count'] for i in [1, 2, 3, 4]]
season_labels = ['Spring', 'Summer', 'Fall', 'Winter']

plt.boxplot(season_data, labels=season_labels)
plt.ylabel('Bike Rentals per Hour')
plt.title('Demand Distribution by Season')
plt.show()

# Print insights
for i, season in enumerate(season_labels, 1):
    season_df = df[df['season'] == i]['count']
    print(f"{season}:")
    print(f"  Median: {season_df.median():.0f} bikes")
    print(f"  Middle 50% range: {season_df.quantile(0.25):.0f}-{season_df.quantile(0.75):.0f} bikes")

**What this reveals:**
- **Seasonal variation**: You can see at a glance which seasons have higher typical demand (higher boxes)
- **Variability**: Wider boxes mean more variable demand - harder to predict
- **Outliers**: More dots mean more unusual demand spikes in that season
- **Confidence intervals**: The box gives you a confidence interval where 50% of observations fall - this helps your client plan capacity with confidence

**Business insight:** If summer's box is much higher than winter's, your client knows they need seasonal staffing changes. If a season has many outliers (dots), they need surge capacity planning for that season.

---

## Step 4: Creating Histograms (Understanding Bins!)

Histograms show how data is distributed - how often different values occur. The key concept here is **bins**: the ranges that group your data together.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create a histogram with 20 bins
plt.hist(df['count'], bins=20)
plt.xlabel('Bike Rentals per Hour')
plt.ylabel('Number of Hours')
plt.title('Distribution of Hourly Bike Demand (20 bins)')
plt.show()

print(f"Most common demand range: Look for the tallest bar")
print(f"Total hours analyzed: {len(df)}")

**What this does:**
- **Create histogram**: Uses `plt.hist()` to divide the data into bins and count how many hours fall in each bin
- **bins=20**: Divides the entire range of rentals (from min to max) into 20 equal-width bins
- **Bar height**: Shows how many hours had demand in that bin's range

**What are bins?** Think of bins like buckets. If demand ranges from 0-1000 bikes and you use 20 bins, each bin is 50 bikes wide (0-50, 50-100, 100-150, etc.). The histogram shows how many hours fall into each bucket.

Let's see how changing bins affects the visualization:

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Compare different numbers of bins
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 10 bins - broader groups
axes[0].hist(df['count'], bins=10)
axes[0].set_title('10 Bins (Broad View)')
axes[0].set_xlabel('Bike Rentals')
axes[0].set_ylabel('Frequency')

# 20 bins - medium detail
axes[1].hist(df['count'], bins=20)
axes[1].set_title('20 Bins (Medium Detail)')
axes[1].set_xlabel('Bike Rentals')
axes[1].set_ylabel('Frequency')

# 50 bins - fine detail
axes[2].hist(df['count'], bins=50)
axes[2].set_title('50 Bins (Fine Detail)')
axes[2].set_xlabel('Bike Rentals')
axes[2].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

**What this shows:**
- **Fewer bins (10)**: Broad overview, each bucket is wider, smoother pattern
- **Medium bins (20)**: Good balance of detail and clarity
- **More bins (50)**: Fine detail, each bucket is narrow, more jagged pattern

**Choosing bins:** Start with 20-30 bins. Too few bins and you lose important detail. Too many bins and the pattern becomes hard to see.

---

### Challenge 3: Create a Histogram of Temperature Distribution

Your client asks: "What temperatures do we experience in D.C.?" Create a histogram showing the distribution of temperatures, then experiment with different numbers of bins.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Your code here - create temperature histogram
plt.hist(df['_____'], bins=_____)  # Fill in: 'temp', and choose number of bins (try 25)
plt.xlabel('Temperature (Celsius)')
plt.ylabel('Number of Hours')
plt.title('Distribution of D.C. Temperatures')
plt.show()

# Print some statistics
print(f"Average temperature: {df['temp'].mean():.1f}°C")
print(f"Coldest hour: {df['temp'].min():.1f}°C")
print(f"Warmest hour: {df['temp'].max():.1f}°C")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Use `plt.hist()` with the 'temp' column from the dataframe. Start with bins=25 to get good detail. Temperature in D.C. ranges roughly from -5°C to 40°C, so 25 bins means each bucket represents about 2 degrees. Try changing bins to 10, then 50 to see how it affects the visualization. You'll probably see that most hours cluster around moderate temperatures (15-25°C) with fewer very cold or very hot hours. This distribution helps your client understand what weather conditions to plan for.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create temperature histogram
plt.hist(df['temp'], bins=25)
plt.xlabel('Temperature (Celsius)')
plt.ylabel('Number of Hours')
plt.title('Distribution of D.C. Temperatures')
plt.show()

# Print some statistics
print(f"Average temperature: {df['temp'].mean():.1f}°C")
print(f"Coldest hour: {df['temp'].min():.1f}°C")
print(f"Warmest hour: {df['temp'].max():.1f}°C")
```

</details>

---

## Step 5: Creating Scatter Plots (with Trendlines!)

Scatter plots show the relationship between two variables. Let's see how temperature affects bike demand, and add a trendline to see the overall pattern.

First, let's create a basic scatter plot:

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create a scatter plot
plt.scatter(df['temp'], df['count'])
plt.xlabel('Temperature (Celsius)')
plt.ylabel('Bike Rentals per Hour')
plt.title('Temperature vs Bike Demand')
plt.show()

**What this does:**
- **Create scatter plot**: Each point represents one hour - its x-position is temperature, y-position is demand
- **Shows relationship**: You can see if higher temperatures generally have higher demand

**Why a scatter plot?** When you have two continuous measurements (temperature and demand), scatter plots let you see if they're related. Do points go up together? That's a positive relationship.

Now let's add a trendline to see the pattern more clearly:

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create a scatter plot
plt.scatter(df['temp'], df['count'], alpha=0.5)

# Add a trendline
# Step 1: Calculate the line that best fits the data
coefficients = np.polyfit(df['temp'], df['count'], 1)  # 1 means a straight line
trendline = np.poly1d(coefficients)  # Creates a function for the line

# Step 2: Create points along the line to plot it
temp_range = np.linspace(df['temp'].min(), df['temp'].max(), 100)
demand_trend = trendline(temp_range)

# Step 3: Plot the trendline
plt.plot(temp_range, demand_trend, color='red', linewidth=2, label='Trendline')

plt.xlabel('Temperature (Celsius)')
plt.ylabel('Bike Rentals per Hour')
plt.title('Temperature vs Bike Demand (with Trendline)')
plt.legend()
plt.show()

# Calculate and show correlation
correlation = df['temp'].corr(df['count'])
print(f"Correlation: {correlation:.3f}")
print(f"Interpretation: {'Positive' if correlation > 0 else 'Negative'} relationship")

**What the trendline shows:**
- **np.polyfit()**: Finds the best straight line through the scattered points
- **Red line**: Shows the general trend - as temperature increases, demand generally increases
- **Correlation**: A number between -1 and 1 showing how strong the relationship is
  - Close to 1 = strong positive relationship (both go up together)
  - Close to -1 = strong negative relationship (one up, other down)
  - Close to 0 = weak relationship (not much pattern)

**What are you seeing?** If the trendline slopes upward, warmer weather means more bike rentals. The correlation number confirms what your eyes see.

---

### Challenge 4: Create a Scatter Plot with Trendline for Humidity

Your client wonders: "Does humidity affect bike demand?" Create a scatter plot with a trendline showing the relationship between humidity and bike demand.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Your code here - create humidity vs demand scatter plot with trendline
# Step 1: Create scatter plot
plt.scatter(df['_____'], df['_____'], alpha=0.5)  # Fill in: 'humidity', 'count'

# Step 2: Calculate and plot trendline
coefficients = np.polyfit(df['_____'], df['_____'], 1)  # Fill in: 'humidity', 'count'
trendline = np.poly1d(coefficients)

humidity_range = np.linspace(df['humidity']._____, df['humidity']._____, 100)  # Fill in: min(), max()
demand_trend = trendline(humidity_range)

plt.plot(humidity_range, demand_trend, color='red', linewidth=2, label='Trendline')

plt.xlabel('Humidity (%)')
plt.ylabel('Bike Rentals per Hour')
plt.title('Humidity vs Bike Demand')
plt.legend()
plt.show()

# Calculate correlation
correlation = df['humidity'].corr(df['count'])
print(f"Correlation: {correlation:.3f}")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Follow the same pattern as the temperature example, but use 'humidity' instead of 'temp'. The steps are: (1) Create scatter plot with plt.scatter(), (2) Calculate trendline coefficients with np.polyfit(), (3) Create the trendline function with np.poly1d(), (4) Generate points along the line with np.linspace(), (5) Plot the line with plt.plot(). Use alpha=0.5 to make points slightly transparent so you can see overlapping points better. Watch out: humidity might have a negative correlation (higher humidity = fewer riders) because humid weather is uncomfortable. The trendline will slope downward if that's the case. This helps your client understand if they should consider humidity when planning bike availability.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create humidity vs demand scatter plot with trendline
# Step 1: Create scatter plot
plt.scatter(df['humidity'], df['count'], alpha=0.5)

# Step 2: Calculate and plot trendline
coefficients = np.polyfit(df['humidity'], df['count'], 1)
trendline = np.poly1d(coefficients)

humidity_range = np.linspace(df['humidity'].min(), df['humidity'].max(), 100)
demand_trend = trendline(humidity_range)

plt.plot(humidity_range, demand_trend, color='red', linewidth=2, label='Trendline')

plt.xlabel('Humidity (%)')
plt.ylabel('Bike Rentals per Hour')
plt.title('Humidity vs Bike Demand')
plt.legend()
plt.show()

# Calculate correlation
correlation = df['humidity'].corr(df['count'])
print(f"Correlation: {correlation:.3f}")
```

</details>

---

## Summary: Professional Data Visualization and Communication Techniques

**What We've Accomplished**:
- Implemented comprehensive temporal visualization frameworks using matplotlib line plots to reveal hourly demand patterns, commuter behavior peaks, and weekday-weekend operational contrasts
- Developed categorical comparison methodologies through bar chart construction for seasonal demand analysis, workday segmentation, and weather condition impact assessment
- Established distribution analysis capabilities using box plot techniques to visualize demand variability, identify outliers, and communicate confidence intervals for capacity planning
- Mastered frequency distribution visualization through histogram construction with strategic bin selection for temperature analysis, demand clustering, and operational threshold identification
- Constructed sophisticated relationship visualization systems using scatter plots enhanced with trendline overlays, correlation coefficients, and numpy-based regression fitting
- Applied multi-series plotting strategies for direct visual comparison of weekday versus weekend patterns on unified chart frameworks with professional legend integration
- Engineered data aggregation pipelines using groupby operations and statistical summarization to prepare analysis-ready datasets for visualization workflows

**Key Technical Skills Mastered**:
- Core matplotlib visualization functions including plt.plot() for temporal series, plt.bar() for categorical comparisons, plt.boxplot() for distribution analysis, plt.hist() for frequency patterns, and plt.scatter() for relationship exploration
- Advanced multi-line plotting architectures with legend() integration enabling side-by-side pattern comparison and stakeholder-friendly visual narratives
- Professional chart customization methodologies through xlabel(), ylabel(), and title() functions ensuring clarity, accessibility, and business communication standards
- Statistical visualization interpretation including quartile analysis, median identification, whisker range understanding, and outlier detection for data quality assessment
- Distribution confidence interval concepts using box plot interquartile ranges to communicate operational planning zones and capacity requirement thresholds
- Histogram bin optimization strategies balancing detail granularity against pattern clarity through systematic bin count experimentation and distribution shape analysis
- Linear regression visualization through numpy polyfit() coefficient calculation and poly1d() function generation for trendline overlay and relationship strength communication
- Correlation quantification using pandas .corr() methods to supplement visual analysis with numerical relationship strength metrics
- Data preparation workflows combining groupby() aggregation, mean() calculation, and temporal feature extraction for visualization-ready dataset construction
- Alpha transparency manipulation for scatter plot overlapping point visibility and enhanced pattern recognition in high-density visualizations

**Next Steps**: Next, we'll advance to machine learning model development and predictive analytics implementation, leveraging your visualization capabilities to evaluate model performance through residual plots, learning curves, and prediction accuracy assessments. Your mastery of statistical visualization now enables sophisticated model diagnostics, hyperparameter optimization analysis, and stakeholder-ready performance reporting that transforms predictive modeling from black-box algorithms into transparent, explainable analytical systems.

Your bike-sharing client now possesses professional-grade data visualization capabilities that transform raw statistical analysis into compelling visual narratives driving evidence-based operational decisions. You've mastered the fundamental visualization methodologies and communication techniques that consulting firms require for stakeholder presentations, executive briefings, and analytical report generation - demonstrating the visual storytelling expertise that separates professional transportation consultants from basic data analysts!