# Lecture 7: Programming Example - Data Storytelling & Visualization

## Introduction: Mastering Data Storytelling Through Professional Visualization

Welcome back, junior data consultant! Your Capital Bikes Washington D.C. client now possesses statistical insights - but numbers alone don't drive decisions. Today, you'll transform from statistical analyst to visual storyteller, mastering the visualization techniques that convert complex transportation patterns into compelling, stakeholder-ready analytical narratives.

Think of data visualization as the consulting translator between mathematical analysis and executive action. While statistical tests quantify relationships, visualizations make those relationships instantly comprehensible to non-technical stakeholders. You'll master the fundamental chart types that professional consultants deploy in their daily work: line plots for temporal patterns, bar charts for categorical comparisons, box plots for distribution analysis, histograms for frequency understanding, and scatter plots for relationship exploration.

Your client is counting on you to transform your statistical discoveries into visual evidence that justifies million-dollar operational decisions. Every visualization you create serves a consulting purpose: answering specific business questions with clarity and precision that enables confident strategic action.

> **🚀 Interactive Learning Alert**
>
> This is a hands-on visualization tutorial with simple examples and gradual building. For the best experience:
>
> - **Click "Open in Colab"** at the bottom to run code interactively
> - **Execute each code cell** by pressing **Shift + Enter**
> - **Complete the challenges** to practice your visualization skills
> - **Think like a consultant** - clarity is more important than complexity

---

## Step 1: Creating Line Plots

Let's start with the simplest and most useful plot for time-based data: the line plot. We'll show how bike demand changes throughout a typical day.

In [None]:
# Import the libraries we need
import pandas as pd
import matplotlib.pyplot as plt

# Load the Washington D.C. bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Extract the hour from datetime
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour

# Calculate average demand for each hour (0-23)
hourly_demand = df.groupby('hour')['count'].mean()

# Create a basic line plot
plt.plot(hourly_demand.index, hourly_demand.values)
plt.xlabel('Hour of Day')
plt.ylabel('Average Bike Rentals')
plt.title('Daily Bike Demand Pattern')
plt.show()

print(f"Peak hour: {hourly_demand.idxmax()}:00 with {hourly_demand.max():.0f} bikes")

**What this does:**
- **Load and prepare**: Imports data and extracts hour from the datetime
- **Calculate averages**: Groups all data by hour and calculates the mean demand
- **Create line plot**: Uses `plt.plot(x_values, y_values)` where x_values are the hours (0-23) and y_values are the average bike counts. The function automatically connects these points with a line
- **Add labels**: `plt.xlabel()` and `plt.ylabel()` set the axis labels, while `plt.title()` adds the chart title. `plt.show()` displays the completed plot

**Why a line plot?** Time flows continuously through the day, so connecting points with a line shows that natural flow. You can see the morning and evening rush hours as peaks.

---

### Challenge 1: Create a Line Plot for Weekend Patterns

Your client asks: "How does demand look on weekends?" Create a line plot showing average hourly demand only for weekend days (Saturday and Sunday).

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour

# Your code here - create weekend line plot
df['dayofweek'] = df['datetime'].dt.dayofweek  # 0=Monday, 6=Sunday
weekend_data = df[df['dayofweek'].isin([_____, _____])]  # Fill in: 5, 6 for Saturday, Sunday
weekend_hourly = weekend_data.groupby('_____')['_____'].mean()  # Fill in: 'hour', 'count'

plt.plot(weekend_hourly.index, weekend_hourly.values)
plt.xlabel('Hour of Day')
plt.ylabel('Average Bike Rentals')
plt.title('Weekend Bike Demand Pattern')
plt.show()

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Break this into three steps: First, extract the day of week using `.dt.dayofweek` (remember: 0=Monday, so weekends are 5 and 6). Second, filter the dataframe to keep only weekend days using `.isin([5, 6])`. Third, group by hour and calculate the mean just like in Step 1. Your client will notice that weekend patterns look different from weekday patterns - probably a single midday peak instead of two rush hour peaks.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour

# Create weekend line plot
df['dayofweek'] = df['datetime'].dt.dayofweek
weekend_data = df[df['dayofweek'].isin([5, 6])]
weekend_hourly = weekend_data.groupby('hour')['count'].mean()

plt.plot(weekend_hourly.index, weekend_hourly.values)
plt.xlabel('Hour of Day')
plt.ylabel('Average Bike Rentals')
plt.title('Weekend Bike Demand Pattern')
plt.show()
```

</details>

---

## Step 2: Creating Bar Plots

Now let's compare categories using a bar plot. We'll compare average demand between weekdays and weekends - perfect for seeing which type of day is busier.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# Calculate the average demand for weekdays vs weekends
weekday_avg = df[df['workingday'] == 1]['count'].mean()
weekend_avg = df[df['workingday'] == 0]['count'].mean()

# Create lists for the bar plot
categories = ['Weekday', 'Weekend']
values = [weekday_avg, weekend_avg]

# Create a bar plot
plt.bar(categories, values)
plt.xlabel('Day Type')
plt.ylabel('Average Bike Rentals per Hour')
plt.title('Weekday vs Weekend Demand')
plt.show()

print(f"Weekday average: {weekday_avg:.0f} bikes per hour")
print(f"Weekend average: {weekend_avg:.0f} bikes per hour")

**What this does:**
- **Calculate category averages**: Finds mean demand for weekdays (workingday=1) and weekends (workingday=0)
- **Prepare data**: Creates simple lists of category names and their values
- **Create bar plot**: Uses `plt.bar(categories, values)` to draw bars where the first argument (categories) becomes the x-axis labels and the second argument (values) becomes the bar heights representing each average
- **Add labels**: Uses `plt.xlabel()`, `plt.ylabel()`, and `plt.title()` to add descriptive text that makes the comparison clear and professional-looking

**Why a bar plot?** When comparing distinct categories (weekday vs weekend), bars make it easy to see which is higher. The height difference shows the answer instantly.

---

### Challenge 2: Create a Bar Plot Comparing All Four Seasons

Your client wants to see: "Which season has the highest bike demand?" Create a bar plot comparing average demand across all four seasons.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Your code here - create seasonal comparison bar plot
seasonal_avg = df.groupby('_____')['_____'].mean()  # Fill in: 'season', 'count'

categories = ['Spring', 'Summer', 'Fall', 'Winter']
values = seasonal_avg.values

plt.bar(_____, _____)  # Fill in: categories, values
plt.xlabel('Season')
plt.ylabel('Average Bike Rentals per Hour')
plt.title('Seasonal Bike Demand Comparison')
plt.show()

# Print the results
for i, season in enumerate(categories):
    print(f"{season}: {values[i]:.0f} bikes per hour")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


This is similar to the weekday/weekend comparison, but with four categories instead of two. Use `.groupby('season')` to calculate the average for each season (the dataset has season codes 1-4). The season codes map to: 1=Spring, 2=Summer, 3=Fall, 4=Winter. Remember to use `.mean()` after grouping to get averages. Watch out: the season values in the dataframe are numbers (1,2,3,4), but you want readable labels on your chart. Your client will likely see that summer has the highest demand (warm weather = more biking) and winter has the lowest (cold = fewer riders).

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create seasonal comparison bar plot
seasonal_avg = df.groupby('season')['count'].mean()

categories = ['Spring', 'Summer', 'Fall', 'Winter']
values = seasonal_avg.values

plt.bar(categories, values)
plt.xlabel('Season')
plt.ylabel('Average Bike Rentals per Hour')
plt.title('Seasonal Bike Demand Comparison')
plt.show()

# Print the results
for i, season in enumerate(categories):
    print(f"{season}: {values[i]:.0f} bikes per hour")
```

</details>

---

## Step 3: Creating Box Plots (Understanding Distributions with Confidence)

Before we dive into histograms, let's learn about box plots - a compact way to show how data is distributed across different categories. Box plots show you the "middle 50%" of your data and help you spot outliers.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create box plot comparing weekday vs weekend demand
weekday_data = df[df['workingday'] == 1]['count']
weekend_data = df[df['workingday'] == 0]['count']

plt.boxplot([weekday_data, weekend_data], labels=['Weekday', 'Weekend'])
plt.ylabel('Bike Rentals per Hour')
plt.title('Demand Distribution: Weekday vs Weekend')
plt.show()

# Print the quartiles
print("Weekday demand:")
print(f"  25th percentile: {weekday_data.quantile(0.25):.0f} bikes")
print(f"  50th percentile (median): {weekday_data.quantile(0.50):.0f} bikes")
print(f"  75th percentile: {weekday_data.quantile(0.75):.0f} bikes")

**What this does:**
- **Load and prepare**: Imports data and creates two separate datasets - one for weekdays (workingday == 1) and one for weekends (workingday == 0)
- **Create box plot**: Uses `plt.boxplot([data1, data2], labels=['Label1', 'Label2'])` where the first argument is a list of datasets to compare, and labels provides the category names
- **Add labels**: `plt.ylabel()` and `plt.title()` label the chart, `plt.show()` displays it
- **Calculate quartiles**: Uses `.quantile()` to find the 25th, 50th (median), and 75th percentiles

**What this shows:**
- **The box**: Shows the middle 50% of data (from 25th to 75th percentile)
- **The line inside the box**: That's the median (50th percentile) - the middle value
- **The whiskers**: The lines extending from the box show the typical range
- **The dots**: Any dots beyond the whiskers are outliers - unusual values (if we had cleaned the data properly, these probably wouldn't be here)

**Why use box plots?** They give you a complete picture of your data's spread in a compact format. You can quickly see if weekdays and weekends have similar ranges or if one is more variable than the other.

**Understanding confidence with box plots:** The box shows you where you can be confident most data falls. If 50% of your data is between 40 and 280 bikes (the box), you can confidently plan operations for that range. The whiskers extend to show the typical minimum and maximum (excluding outliers), giving you the full operational range.

---

### Challenge 3: Create Box Plot Comparing Seasonal Demand

Your client asks: "How does demand variability differ across seasons?" Create a box plot comparing the distribution of bike demand across all four seasons.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Your code here - create seasonal box plot
season_data = [df[df['_____'] == i]['_____'] for i in [_____, _____, _____, _____]]  # Fill in: 'season', 'count', 1, 2, 3, 4
season_labels = ['Spring', 'Summer', 'Fall', 'Winter']

plt.boxplot(_____, labels=_____)  # Fill in: season_data, season_labels
plt.ylabel('Bike Rentals per Hour')
plt.title('Demand Distribution by Season')
plt.show()

# Print insights
for i, season in enumerate(season_labels, 1):
    season_df = df[df['season'] == i]['count']
    print(f"{season}:")
    print(f"  Median: {season_df.median():.0f} bikes")
    print(f"  Middle 50% range: {season_df.quantile(0.25):.0f}-{season_df.quantile(0.75):.0f} bikes")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Box plots are perfect for comparing distributions across categories. Here's the approach: First, create a list of four datasets, one for each season (season codes are 1=Spring, 2=Summer, 3=Fall, 4=Winter). Use a list comprehension with filtering: `[df[df['season'] == i]['count'] for i in [1, 2, 3, 4]]`. This creates four separate arrays of bike counts. Then pass this list to `plt.boxplot()` along with readable season labels. The plot will show you not just average demand (the median line in each box), but also the spread of demand (box width) and outliers (dots). Your client will likely see that summer has both the highest median demand and possibly more variability, while winter has the lowest and tightest distribution. Wide boxes mean unpredictable demand requiring flexible capacity planning.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create seasonal box plot
season_data = [df[df['season'] == i]['count'] for i in [1, 2, 3, 4]]
season_labels = ['Spring', 'Summer', 'Fall', 'Winter']

plt.boxplot(season_data, labels=season_labels)
plt.ylabel('Bike Rentals per Hour')
plt.title('Demand Distribution by Season')
plt.show()

# Print insights
for i, season in enumerate(season_labels, 1):
    season_df = df[df['season'] == i]['count']
    print(f"{season}:")
    print(f"  Median: {season_df.median():.0f} bikes")
    print(f"  Middle 50% range: {season_df.quantile(0.25):.0f}-{season_df.quantile(0.75):.0f} bikes")
```

</details>

---

## Step 4: Creating Histograms (Understanding Bins!)

Histograms show how data is distributed - how often different values occur. The key concept here is **bins**: the ranges that group your data together.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create a histogram with 20 bins
plt.hist(df['count'], bins=20)
plt.xlabel('Bike Rentals per Hour')
plt.ylabel('Number of Hours')
plt.title('Distribution of Hourly Bike Demand (20 bins)')
plt.show()

print(f"Most common demand range: Look for the tallest bar")
print(f"Total hours analyzed: {len(df)}")

**What this does:**
- **Create histogram**: Uses `plt.hist()` function which takes the data column `df['count']` and automatically divides it into bins (ranges), then counts how many data points fall into each bin and displays them as bars
- **bins=20**: Divides the entire range of rentals (from min to max) into 20 equal-width bins
- **Bar height**: Shows how many hours had demand in that bin's range

**What are bins?** Think of bins like buckets. If demand ranges from 0-1000 bikes and you use 20 bins, each bin is 50 bikes wide (0-50, 50-100, 100-150, etc.). The histogram shows how many hours fall into each bucket.

---

### Challenge 4: Create a Histogram of Temperature Distribution

Your client asks: "What temperatures do we experience in D.C.?" Create a histogram showing the distribution of temperatures, then experiment with different numbers of bins.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Your code here - create temperature histogram
plt.hist(df['_____'], bins=_____)  # Fill in: 'temp', and choose number of bins (try 25)
plt.xlabel('Temperature (Celsius)')
plt.ylabel('Number of Hours')
plt.title('Distribution of D.C. Temperatures')
plt.show()

# Print some statistics
print(f"Average temperature: {df['temp'].mean():.1f}°C")
print(f"Coldest hour: {df['temp'].min():.1f}°C")
print(f"Warmest hour: {df['temp'].max():.1f}°C")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Use `plt.hist()` with the 'temp' column from the dataframe. Start with bins=25 to get good detail. Temperature in D.C. ranges roughly from -5°C to 40°C, so 25 bins means each bucket represents about 2 degrees. Try changing bins to 10, then 50 to see how it affects the visualization. You'll probably see that most hours cluster around moderate temperatures (15-25°C) with fewer very cold or very hot hours. This distribution helps your client understand what weather conditions to plan for.

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create temperature histogram
plt.hist(df['temp'], bins=25)
plt.xlabel('Temperature (Celsius)')
plt.ylabel('Number of Hours')
plt.title('Distribution of D.C. Temperatures')
plt.show()

# Print some statistics
print(f"Average temperature: {df['temp'].mean():.1f}°C")
print(f"Coldest hour: {df['temp'].min():.1f}°C")
print(f"Warmest hour: {df['temp'].max():.1f}°C")
```

</details>

---

## Step 5: Creating Scatter Plots (with Trendlines!)

Scatter plots show the relationship between two variables. Let's see how temperature affects bike demand, and add a trendline to see the overall pattern.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create a scatter plot
plt.scatter(df['temp'], df['count'], alpha=0.5)

# Add a trendline
# Step 1: Calculate the line that best fits the data
coefficients = np.polyfit(df['temp'], df['count'], 1)  # 1 means a straight line
trendline = np.poly1d(coefficients)  # Creates a function for the line

# Step 2: Create points along the line to plot it
temp_range = np.linspace(df['temp'].min(), df['temp'].max(), 100)
demand_trend = trendline(temp_range)

# Step 3: Plot the trendline
plt.plot(temp_range, demand_trend, color='red', linewidth=2, label='Trendline')

plt.xlabel('Temperature (Celsius)')
plt.ylabel('Bike Rentals per Hour')
plt.title('Temperature vs Bike Demand (with Trendline)')
plt.legend()
plt.show()

# Calculate and show correlation
correlation = df['temp'].corr(df['count'])
print(f"Correlation: {correlation:.3f}")
print(f"Interpretation: {'Positive' if correlation > 0 else 'Negative'} relationship")

**What this does:**
- **Create scatter plot**: `plt.scatter(x_values, y_values)` plots points where each point represents one data row. The first parameter (df['temp']) becomes the x-coordinate, the second parameter (df['count']) becomes the y-coordinate. So each hour in our dataset becomes one dot positioned at (temperature, bike_demand)
- **alpha=0.5**: Makes the points semi-transparent so you can see overlapping points better
- **Shows relationship**: You can see if higher temperatures generally have higher demand

**Why a scatter plot?** When you have two continuous measurements (temperature and demand), scatter plots let you see if they're related. Do points go up together? That's a positive relationship.

**What the trendline shows:**
- **np.polyfit()**: Finds the best straight line through the scattered points
- **Red line**: Shows the general trend - as temperature increases, demand generally increases
- **Correlation**: A number between -1 and 1 showing how strong the relationship is
  - Close to 1 = strong positive relationship (both go up together)
  - Close to -1 = strong negative relationship (one up, other down)
  - Close to 0 = weak relationship (not much pattern)

**What are you seeing?** If the trendline slopes upward, warmer weather means more bike rentals. The correlation number confirms what your eyes see.

---

### Challenge 5: Clean Up the Scatter Plot Using Bins

Your client looks at the humidity scatter plot and says: "This is too messy with 17,000+ overlapping points. Can you clean this up like professional analysts do?" Let's use **binning** to aggregate the data and show mean values with confidence intervals - creating a cleaner, more professional visualization.

In [None]:
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Your code here - create a binned scatter plot with confidence intervals
# Step 1: Create humidity bins to aggregate data
df['humidity_bin'] = pd.cut(df['_____'], bins=20)  # Fill in: 'humidity'

# Step 2: Calculate mean, std, and count for each bin
binned = df.groupby('humidity_bin', observed=True)['count'].agg(['_____', '_____', '_____'])  # Fill in: 'mean', 'std', 'count'

# Step 3: Calculate standard error and get bin centers
binned['se'] = binned['std'] / np.sqrt(binned['count'])
binned['humidity_center'] = binned.index.map(lambda x: x._____)  # Fill in: mid

# Step 4: Create the plot with error bars (95% confidence interval)
plt.figure(figsize=(10, 6))
plt.errorbar(binned['humidity_center'], binned['_____'],  # Fill in: 'mean'
             yerr=binned['se']*1.96,  # 95% confidence interval
             fmt='o', markersize=8, capsize=5, capthick=2,
             color='#3498DB', ecolor='#3498DB', alpha=0.7,
             label='Mean Demand (95% CI)')

# Step 5: Add trendline using the original (unbinned) data
slope, intercept, r_value, p_value, std_err = stats.linregress(df['_____'], df['_____'])  # Fill in: 'humidity', 'count'
line_x = np.array([df['humidity'].min(), df['humidity'].max()])
line_y = slope * line_x + intercept
plt.plot(line_x, line_y, 'r--', linewidth=2.5, label=f'Trend Line (r = {r_value:.3f})')

plt.xlabel('Humidity (%)', fontsize=12, fontweight='bold')
plt.ylabel('Hourly Bike Rentals', fontsize=12, fontweight='bold')
plt.title('Humidity-Demand Relationship: Binned Analysis Reduces Visual Clutter',
         fontsize=13, fontweight='bold')
plt.legend(loc='upper right', fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Print insights
print(f"Humidity-demand correlation: r = {r_value:.3f}")
print(f"Humidity explains {(r_value**2)*100:.1f}% of demand variation")
print(f"For each 1% increase in humidity, demand changes by {slope:.1f} rentals")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


The binning approach has several steps: (1) Create bins with pd.cut() to divide humidity into 20 equal ranges, (2) Use groupby().agg() to calculate mean, std (standard deviation), and count for each bin, (3) Calculate standard error (se = std / sqrt(count)) which measures uncertainty in each bin's mean, (4) Get bin centers with .mid to position points, (5) Plot with errorbar() using yerr=se*1.96 for 95% confidence intervals, (6) Add the trendline using stats.linregress() on the original unbinned data for accurate correlation. The confidence interval bars show: wider bars = more uncertainty (fewer observations in that bin), narrower bars = more certainty (many observations). This creates a professional visualization that's much cleaner than 17,000 overlapping raw points! Watch out: humidity typically has a negative correlation (higher humidity = uncomfortable = fewer riders).

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset.csv")

# Create a binned scatter plot with confidence intervals
# Step 1: Create humidity bins to aggregate data
df['humidity_bin'] = pd.cut(df['humidity'], bins=20)

# Step 2: Calculate mean, std, and count for each bin
binned = df.groupby('humidity_bin', observed=True)['count'].agg(['mean', 'std', 'count'])

# Step 3: Calculate standard error and get bin centers
binned['se'] = binned['std'] / np.sqrt(binned['count'])
binned['humidity_center'] = binned.index.map(lambda x: x.mid)

# Step 4: Create the plot with error bars (95% confidence interval)
plt.figure(figsize=(10, 6))
plt.errorbar(binned['humidity_center'], binned['mean'],
             yerr=binned['se']*1.96,  # 95% confidence interval
             fmt='o', markersize=8, capsize=5, capthick=2,
             color='#3498DB', ecolor='#3498DB', alpha=0.7,
             label='Mean Demand (95% CI)')

# Step 5: Add trendline using the original (unbinned) data
slope, intercept, r_value, p_value, std_err = stats.linregress(df['humidity'], df['count'])
line_x = np.array([df['humidity'].min(), df['humidity'].max()])
line_y = slope * line_x + intercept
plt.plot(line_x, line_y, 'r--', linewidth=2.5, label=f'Trend Line (r = {r_value:.3f})')

plt.xlabel('Humidity (%)', fontsize=12, fontweight='bold')
plt.ylabel('Hourly Bike Rentals', fontsize=12, fontweight='bold')
plt.title('Humidity-Demand Relationship: Binned Analysis Reduces Visual Clutter',
         fontsize=13, fontweight='bold')
plt.legend(loc='upper right', fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Print insights
print(f"Humidity-demand correlation: r = {r_value:.3f}")
print(f"Humidity explains {(r_value**2)*100:.1f}% of demand variation")
print(f"For each 1% increase in humidity, demand changes by {slope:.1f} rentals")
```

</details>

---

## Summary: Professional Data Visualization and Communication Techniques

**What We've Accomplished**:
- Created line plots to visualize hourly demand patterns and identify peak usage times
- Built bar charts to compare categorical data like weekday vs weekend demand and seasonal patterns
- Developed box plots to analyze demand distributions, identify outliers, and understand quartile ranges
- Constructed histograms to visualize frequency distributions with strategic bin selection
- Built scatter plots with trendlines to explore temperature-demand relationships and calculate correlations

**Key Technical Skills Mastered**:
- `plt.plot()` for temporal line plots, `plt.bar()` for categorical comparisons, `plt.boxplot()` for distributions, `plt.hist()` for frequencies, and `plt.scatter()` for relationships
- Chart customization with `xlabel()`, `ylabel()`, and `title()` for professional presentation
- Box plot interpretation: quartiles, medians, whiskers, and outliers
- Histogram bin selection to balance detail and clarity
- Trendline creation using `np.polyfit()` and `np.poly1d()` for linear regression visualization
- Correlation calculation with `.corr()` to quantify relationship strength
- Data aggregation with `groupby()` and `.mean()` for visualization preparation
- Transparency control with `alpha` parameter for overlapping points

**Next Steps**: Next, we'll advance to machine learning model development and predictive analytics implementation, leveraging your visualization capabilities to evaluate model performance.

Your bike-sharing client now possesses professional-grade data visualization capabilities that transform raw statistical analysis into compelling visual narratives driving evidence-based operational decisions. You've mastered the fundamental visualization methodologies that consulting firms require for stakeholder presentations, executive briefings, and analytical report generation - demonstrating the visual storytelling expertise that separates professional transportation consultants from basic data analysts!