# Lecture 3: Programming Example - Pandas Fundamentals with Washington D.C. Data

## Introduction: Your First Day as a Data Consultant

Welcome to your first hands-on session as a junior data consultant! Today, you'll work with real Washington D.C. bike-sharing data, learning pandas step-by-step. We'll start with absolute basics and build up to loading and exploring your client's dataset.

> **🚀 Interactive Learning Alert**
> 
> This is a hands-on programming tutorial with code examples and challenges. For the best learning experience:
> 
> - **Click "Open in Colab"** at the top of this notebook to run it in Google Colab
> - **Execute each code cell** by pressing **Shift + Enter** to see the results
> - **Complete the challenges** to practice what you learn
> 

---

## Step 1: Setting Up Your Data Analysis Environment

Let's start by importing the tools you'll need for data analysis. Think of this like setting up your workbench before starting a project:

In [None]:
# Import pandas - your primary data manipulation tool
import pandas as pd

# Print confirmation
print("Pandas imported successfully!")
print(f"Pandas version: {pd.__version__}")

> **Note:** To run a code cell in Jupyter Notebook, click inside the cell and press **Shift + Enter**. This will execute the code and show the output directly below the cell.

**What this does:**
- `import pandas as pd` makes the pandas library available with the shorthand "pd"
- The shorthand `pd` is a universal convention - all pandas users worldwide use this
- Checking the version ensures you're working with up-to-date tools. In real-world projects, you usually just import pandas and skip printing the version or confirmation messages. We're doing it here only for teaching purposes, so you can see exactly what's happening step by step.

---

### Challenge 1: Import Practice
Import pandas yourself, then check that it worked by running `type(pd)`. The output should be `<class 'module'>`, confirming that `pd` is a module.

In [None]:
# Your code here
import _____ as pd  # Fill in the library name
print(type(pd))

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Your code here
import pandas as pd  # Fill in the library name
print(type(pd))
```

</details>

---

## Step 2: Understanding Series - Single Column Data

Let's start with Series by creating bike rental data for a typical Monday morning. Think of Series as a single column from a spreadsheet:

In [None]:
# Create a Series with hourly bike rentals
morning_rentals = pd.Series([15, 23, 45, 67, 89, 156, 234, 287])
print("Morning Bike Rentals:")
print(morning_rentals)

**What this does:**
- `morning_rentals = pd.Series([...])` creates a pandas Series object and saves it in the variable `morning_rentals`
- The numbers on the right column represent bike rentals for each hour
- The numbers on the left column are index numbers that Pandas automatically assigns (0, 1, 2, etc.)

### Challenge 2: Create Your Own Series
Create a Series representing temperature readings for the same morning hours. Use these temperatures: [42, 44, 47, 50, 53, 56, 58, 61]

In [None]:
# Import pandas for Series creation
import pandas as pd

# Your code here - create a Series called 'morning_temps'
morning_temps = pd.Series(_____)  # Fill in the temperature list
print(_____)  # Print your Series to verify it worked

> **Note:** To solve the exercise, replace each `_____` with the correct value or code.

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


When creating Series objects, consider these best practices:
- Use descriptive variable names like `morning_temps` instead of generic names like `data`
- Print the Series to verify it contains the expected values and structure

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import pandas for Series creation
import pandas as pd

# Your code here - create a Series called 'morning_temps'
morning_temps = pd.Series([42, 44, 47, 50, 53, 56, 58, 61])  # Fill in the temperature list
print(morning_temps)  # Print your Series to verify it worked
```

</details>

---

## Step 3: Adding Meaningful Labels to Series

Raw index numbers (0, 1, 2) aren't very business-friendly. Let's add meaningful labels:

In [None]:
# Create Series with meaningful hour labels
hour_labels = ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM', '11 AM', '12 PM', '1 PM']
morning_rentals_labeled = pd.Series([15, 23, 45, 67, 89, 156, 234, 287], 
                                   index=hour_labels)
print("Labeled Morning Bike Rentals:")
print(morning_rentals_labeled)

**What this does:**
- `hour_labels = ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM', '11 AM', '12 PM', '1 PM']` creates a list of time labels that represent each hour in the morning
- `index=hour_labels` replaces default numbers with business-meaningful labels created in the previous line
- Now each rental count is clearly connected to its time period

Now when you show this to your client, they immediately understand that 1 PM has the highest rentals (287), which makes perfect sense for lunch-time bike usage.

### Challenge 3: Access Specific Data Points
Using your labeled Series, find the bike rentals at 9 AM. Use this syntax: `morning_rentals_labeled['9 AM']`

In [None]:
# Import pandas and create the labeled Series
import pandas as pd

# Create Series with meaningful hour labels
hour_labels = ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM', '11 AM', '12 PM', '1 PM']
morning_rentals_labeled = pd.Series([15, 23, 45, 67, 89, 156, 234, 287], 
                                   index=hour_labels)

# Your code here - access the 9 AM value from the labeled Series
nine_am_rentals = morning_rentals_labeled[_____]  # Fill in the correct label
print(f"Bike rentals at 9 AM: {_____}")  # Print the result

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


When accessing Series data by label, keep these techniques in mind:
- In this challenge, we're using square brackets with the exact label: `series_name['label']`
- Check available labels with `series_name.index` to see all options
- Access multiple values: `series_name[['9 AM', '10 AM']]` (note double brackets)
- Use `.loc[]` for explicit label-based selection: `series_name.loc['9 AM']`
- Be careful with exact spelling and spacing in labels to avoid KeyError

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import pandas and create the labeled Series
import pandas as pd

# Create Series with meaningful hour labels
hour_labels = ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM', '11 AM', '12 PM', '1 PM']
morning_rentals_labeled = pd.Series([15, 23, 45, 67, 89, 156, 234, 287], 
                                   index=hour_labels)

# Your code here - access the 9 AM value from the labeled Series
nine_am_rentals = morning_rentals_labeled['9 AM']  # Fill in the correct label
print(f"Bike rentals at 9 AM: {nine_am_rentals}")  # Print the result
```

</details>

---

## Step 4: Creating Your First DataFrame - Complete Business Data

Now let's combine multiple pieces of information into a DataFrame. Think of this as creating a complete spreadsheet with multiple columns:

In [None]:
# Create a comprehensive DataFrame with multiple variables
bike_operations_data = pd.DataFrame({
    'hour': ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM'],
    'temperature': [5.6, 6.7, 8.3, 10.0, 11.7],
    'bike_rentals': [15, 23, 45, 67, 89],
    'weather_condition': ['Clear', 'Clear', 'Partly Cloudy', 'Clear', 'Clear']
})

print("Complete Bike Operations Data:")
print(bike_operations_data)

**What this does:**
- `pd.DataFrame({...})` creates a DataFrame with multiple columns
- The DataFrame constructor accepts a dictionary where:
  - each key (like 'hour', 'temperature', etc.) becomes a column name
  - each list in the dictionary provides the values for that column
- All rows stay aligned (first hour corresponds to first temperature, etc.)

You can immediately see patterns - bike rentals increase with temperature and time, giving your client valuable operational insights..

### Challenge 4: Add a New Column
Add a column called 'user_satisfaction' with values [3.2, 3.5, 3.8, 4.1, 4.3] representing customer satisfaction ratings.

In [None]:
# Import pandas and create the DataFrame
import pandas as pd

# Create a comprehensive DataFrame with multiple variables
bike_operations_data = pd.DataFrame({
    'hour': ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM'],
    'temperature': [5.6, 6.7, 8.3, 10.0, 11.7],
    'bike_rentals': [15, 23, 45, 67, 89],
    'weather_condition': ['Clear', 'Clear', 'Partly Cloudy', 'Clear', 'Clear']
})

# Your code here - add a new column with satisfaction ratings
bike_operations_data[_____] = [3.2, 3.5, 3.8, 4.1, 4.3]  # Fill in column name
print(_____)  # Print the updated DataFrame

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


When adding new columns to DataFrames, follow these best practices:
- Ensure the new data list has the same length as existing rows: `len(new_data) == len(df)`
- Use descriptive column names that clearly indicate what the data represents
- Verify the addition worked: `df.columns` shows all column names including the new one
- Check data types: `df.dtypes` to ensure the new column has appropriate type (float64 for ratings)
- You can also add columns using `df.assign(column_name=values)` for a more functional approach

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import pandas and create the DataFrame
import pandas as pd

# Create a comprehensive DataFrame with multiple variables
bike_operations_data = pd.DataFrame({
    'hour': ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM'],
    'temperature': [5.6, 6.7, 8.3, 10.0, 11.7],
    'bike_rentals': [15, 23, 45, 67, 89],
    'weather_condition': ['Clear', 'Clear', 'Partly Cloudy', 'Clear', 'Clear']
})

# Your code here - add a new column with satisfaction ratings
bike_operations_data['user_satisfaction'] = [3.2, 3.5, 3.8, 4.1, 4.3]  # Fill in column name
print(bike_operations_data)  # Print the updated DataFrame
```

</details>

---

## Step 5: Loading Real Washington D.C. Dataset

Now for the real challenge - loading your client's actual historical data. This is where professional consulting begins:

In [None]:
# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Confirm successful loading
print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")

**What this does:**
- `pd.read_csv()` reads data from a CSV file into a DataFrame
- CSV (Comma-Separated Values) a standard format for sharing tabular data
- The shape tells you how much data you have to work with

Always check the shape immediately after loading - it confirms the file loaded correctly and gives you a sense of your dataset size.

### Challenge 5: Explore the Column Names
Print the column names using `df.columns` to see what variables are available in your dataset.

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - explore the available columns
print("Available columns:")
print(list(_____.columns))  # Fill in the DataFrame name

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


When exploring column names in a new dataset, use these investigation techniques:
- `df.columns` returns an Index object with all column names
- `list(df.columns)` converts to a regular Python list for easier reading
- `len(df.columns)` tells you how many variables you have to work with
- Look for patterns in naming conventions to understand data structure

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - explore the available columns
print("Available columns:")
print(list(df.columns))  # Fill in the DataFrame name
```

</details>

---

## Step 6: First Look at Real Transportation Data

Let's examine the first few rows to understand the data structure:

In [None]:
# Display the first 5 rows
print("First 5 rows of the dataset:")
print(df.head())

**What this shows:**
- `head()` displays the first 5 rows by default (you can specify a different number like `df.head(10)` to see 10 rows)
- You'll see actual bike-sharing data with timestamps, weather, and usage counts
- Each row represents one hour of bike-sharing operations

**Understanding the real data:**
- `datetime`: When this data was recorded
- `season`, `holiday`, `workingday`: Operational context
- `weather`, `temp`, `humidity`, `windspeed`: Weather conditions
- `casual`, `registered`, `count`: Different types of users and total rentals

This is the foundation of all your future analysis for this client.

### Challenge 6: Look at the Last Few Rows
Use `df.tail()` to see the last 5 rows of the dataset. This helps verify you have complete data coverage.

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - examine the last few rows
print("Last 5 rows of the dataset:")
print(_____._____)  # Fill in the DataFrame name and method

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


When examining the end of your dataset, consider these data quality checks:
- Compare last row's datetime to first row's datetime to understand time coverage
- Check if the last rows have complete data or if there are missing values
- Use `df.tail(10)` to see more rows if you want a larger sample
- Look for any unusual patterns or data entry errors in the final records
- Verify that the data collection didn't stop abruptly mid-period

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - examine the last few rows
print("Last 5 rows of the dataset:")
print(df.tail())  # Fill in the DataFrame name and method
```

</details>

---

## Step 7: Understanding Your Dataset Size and Structure

Professional data analysis requires understanding exactly what you're working with:

In [None]:
# Get detailed information about the dataset
print("Dataset Information:")
print(f"Total records: {len(df)}")
print(f"Total variables: {len(df.columns)}")

# Show data types for each column
print("\nData Types:")
print(df.dtypes)

**What this tells you:**
- **Total records**: How much historical data your client has collected
- **Total variables**: How many different factors you can analyze
- **Data types**: What kind of analysis you can perform on each variable

More historical data means more reliable predictions. The variety of variables (weather, time, user types) means you can build sophisticated demand forecasting models.

### Challenge 7: Calculate Time Coverage
The dataset contains hourly data. Calculate how many days of data you have by dividing the total rows by 24.

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - calculate days of coverage
total_days = len(_____) / _____  # Fill in DataFrame name and divisor
print(f"Dataset covers approximately {_____:.1f} days")  # Fill in variable name

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


When calculating time coverage for time series data, consider these analysis approaches:
- Use `:.1f` formatting to display days with one decimal place for readability
- Calculate weeks as well: `total_days / 7` for business planning context
- Consider if you have complete days: `len(df) % 24` shows any partial days

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - calculate days of coverage
total_days = len(df) / 24  # Fill in DataFrame name and divisor
print(f"Dataset covers approximately {total_days:.1f} days")  # Fill in variable name
```

</details>

---

## Step 8: Basic Statistical Summary

Understanding the data distribution helps identify patterns and potential issues:

In [None]:
# Generate statistical summary for numerical variables
print("Statistical Summary:")
print(df.describe())

**What this shows:**
- **count**: How many non-missing values exist for each variable
- **mean**: Average values (useful for understanding typical conditions)
- **std**: Standard deviation (shows how much values vary)
- **min/max**: Range of values (helps identify outliers or impossible values)
- **25%, 50%, 75%**: Quartiles (show data distribution)

If minimum bike counts are 1 and maximum is 977, that's a huge range! This suggests your client experiences very different demand conditions that you'll need to understand and predict.

### Challenge 8: Focus on Key Business Metrics
Create a summary focusing only on the key business variables: temperature (`temp`), humidity (`humidity`), and bike count (`count`).

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - create a summary of key metrics
key_metrics = _____[[_____, _____, _____]].describe()  # Fill in DataFrame and column names
print("Key Business Metrics Summary:")
print(_____)  # Print the summary

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


When focusing on specific business metrics, use these analytical techniques:
- Select columns using double brackets: `df[['col1', 'col2']]` to maintain DataFrame structure
- Compare ranges across variables to understand which have more variation
- Check for outliers: are max values realistic or potentially data entry errors?
- Consider business thresholds: what temperature ranges are most relevant for operations?

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - create a summary of key metrics
key_metrics = df[['temp', 'humidity', 'count']].describe()  # Fill in DataFrame and column names
print("Key Business Metrics Summary:")
print(key_metrics)  # Print the summary
```

</details>

---

## Step 9: Selecting Specific Data for Analysis

Often you need to focus on specific parts of your dataset. Let's learn several ways to select data:

In [None]:
# Select a single column (bike counts)
bike_counts = df['count']
print(f"Bike counts - Type: {type(bike_counts)}")
print(f"Average daily rentals: {bike_counts.mean():.1f}")

# Select multiple columns for weather analysis
weather_data = df[['temp', 'humidity', 'windspeed']]
print(f"\nWeather data shape: {weather_data.shape}")
print(weather_data.head(3))

# Select first 12 rows for initial analysis
sample_data = df.head(12)
print(f"\nSample data covers first {len(sample_data)} hours")
print(sample_data)

**What this demonstrates:**
- **Single column selection**: Returns a Series (one-dimensional)
- **Multiple column selection**: Returns a DataFrame (two-dimensional)
- **Row selection**: Gets a subset of the full dataset

You might analyze just weather data to understand seasonal patterns, or focus on the first few months to understand how the bike-sharing system performed during its early operations.

### Challenge 9: Create a Sample Dataset
Select only the columns 'datetime', 'temp', 'count' and only the first 168 rows (first week of data).

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - create a focused dataset for rush hour analysis
rush_hour_analysis = _____[[_____, _____, _____]].head(_____)  # Fill in details
print(f"Rush hour dataset shape: {_____.shape}")  # Fill in variable name
print(_____)  # Print sample data

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


When creating focused analysis datasets, use these selection strategies:
- Verify the subset size: 168 hours = 7 days × 24 hours = 1 week
- Combine column and row selection: `df[['col1', 'col2']].head(n)`
- Confirm time coverage: compare first and last datetime values - as you will see, it doesn't match (this may mean that some hour registers are missing)

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - create a focused dataset for rush hour analysis
rush_hour_analysis = df[['datetime', 'temp', 'count']].head(168)  # Fill in details
print(f"Rush hour dataset shape: {rush_hour_analysis.shape}")  # Fill in variable name
print(rush_hour_analysis)  # Print sample data
```

</details>

---

## Step 10: Understanding Time-Based Data

Transportation data is inherently time-based. Let's work with the datetime information:

In [None]:
# Convert datetime column to pandas datetime format
df['datetime'] = pd.to_datetime(df['datetime'])
print(f"Datetime conversion successful. Type: {df['datetime'].dtype}")

# Ensure chronological order for time-based operations
df = df.sort_values('datetime').reset_index(drop=True)

# Extract useful time components
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['month'] = df['datetime'].dt.month

# Show the first few rows with time components
print("\nData with extracted time components:")
print(df[['datetime', 'hour', 'day_of_week', 'month', 'count']].head())

> Timedelta primer: A Timedelta represents a duration (difference between two timestamps). You'll use `pd.Timedelta(hours=1)` in the next step to flag gaps larger than one hour between records.

**What this accomplishes:**
- Converts text dates to actual datetime objects for analysis
- Extracts hour, day, and month for business analysis
- Enables time-based filtering and grouping

These time-based features unlock powerful insights. Hour analysis allows you to identify peak usage times for bike rebalancing, while day analysis helps compare weekday vs. weekend patterns. Additionally, month analysis reveals seasonal trends that are crucial for capacity planning.

### Challenge 10: Find Peak Hour
Use the new 'hour' column to find which hour of the day has the highest average bike rentals.

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Convert datetime column and extract time components
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['month'] = df['datetime'].dt.month

# Your code here - find the peak hour for bike rentals
hourly_average = _____.groupby(_____)[_____].mean()  # Fill in DataFrame, grouping column, target column
peak_hour = hourly_average._____()  # Fill in method to find maximum index
peak_rentals = hourly_average._____()  # Fill in method to get maximum value

print(f"Peak hour: {_____}:00")  # Fill in variable name
print(f"Average rentals during peak hour: {_____:.1f}")  # Fill in variable name

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


When analyzing time-based patterns, use these grouping and aggregation techniques:
- `df.groupby('hour')['count'].mean()` calculates average by hour
- Use `.idxmax()` to find the index (hour) with maximum value
- Use `.max()` to get the actual maximum value
- Explore other time patterns: `df.groupby('day_of_week')['count'].mean()`
- Consider multiple aggregations: `df.groupby('hour')['count'].agg(['mean', 'std', 'count'])`
- Sort results for easier interpretation: `hourly_average.sort_values(ascending=False)`

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Convert datetime column and extract time components
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['month'] = df['datetime'].dt.month

# Your code here - find the peak hour for bike rentals
hourly_average = df.groupby('hour')['count'].mean()  # Fill in DataFrame, grouping column, target column
peak_hour = hourly_average.idxmax()  # Fill in method to find maximum index
peak_rentals = hourly_average.max()  # Fill in method to get maximum value

print(f"Peak hour: {peak_hour}:00")  # Fill in variable name
print(f"Average rentals during peak hour: {peak_rentals:.1f}")  # Fill in variable name
```

</details>

---

## Step 11: Data Quality Assessment - Missing Data Detection (Empty Cells)

Real-world data often has missing values (empty cells). Let's check if any values are missing from our dataset. Note that we're checking for empty cells here - detecting missing time periods (like skipped hours) requires a different approach that we'll handle in Challenge 11.

In [None]:
# Check for missing data in each column
missing_data = df.isnull().sum()
print("Missing Data Summary:")
print(missing_data)

# Calculate percentage of missing data
missing_percentage = (missing_data / len(df)) * 100
print("\nMissing Data Percentages:")
for column in df.columns:
    if missing_data[column] > 0:
        print(f"{column}: {missing_percentage[column]:.1f}%")
    else:
        print(f"{column}: 0% (Complete)")

**What this code accomplishes:**
- `df.isnull()` creates a DataFrame of True/False values where True indicates a missing value
- `.sum()` counts the True values (which represent missing entries) for each column - since True is treated as 1 and False as 0 in arithmetic operations
- The percentage calculation shows the proportion of missing data relative to total records
- The loop displays results in a business-friendly format, clearly marking complete vs incomplete columns
- Converting to percentages helps understand the severity of missing data

**What this reveals:**
- By running this code, we can see that there is no missing data in the dataset - all columns are complete
- Complete columns can be trusted for all analysis without data quality concerns
- Missing data would require decisions: drop incomplete rows, exclude unreliable columns, or fill gaps using estimates (like averages or trends) - each choice impacts your analysis quality

### Challenge 11: Identify Missing Datetime Gaps (Skipped Hours)
Find where the gap between consecutive `datetime` values is greater than 1 hour — these indicate missing hourly records in the time series. This approach works because time series data should have consistent intervals. In our case, records are generated hourly, so any gap larger than 1 hour reveals missing records in the sequence.

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - detect gaps > 1 hour in the datetime sequence
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.sort_values(by='_____').reset_index(drop=True)  # Fill in the time column

time_diffs = df['_____'].diff()  # Fill in the time column
gaps = df[time_diffs > pd.Timedelta(hours=_____)]  # Fill in the gap size

print("Rows that follow a gap > 1 hour:")
print(_____)  # Fill in variable name

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


When detecting missing time periods, keep these practices in mind:
- Convert the datetime column to pandas datetime format by using `pd.to_datetime()`
- Ensure the data is sorted by time before calling `.diff()`
- Remember that `.diff()` calculates the time difference between each row and the previous row
- Use `pd.Timedelta(hours=1)` for a clear 1-hour threshold

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - detect gaps > 1 hour in the datetime sequence
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.sort_values(by='datetime').reset_index(drop=True)  # Fill in the time column

time_diffs = df['datetime'].diff()  # Fill in the time column
gaps = df[time_diffs > pd.Timedelta(hours=1)]  # Fill in the gap size

print("Rows that follow a gap > 1 hour:")
print(gaps)  # Fill in variable name
```

</details>

---

## Step 12: Basic Data Filtering for Business Insights

Now we'll use data filtering to answer critical business questions that help optimize bike-sharing operations. By creating targeted subsets of our data, we can identify peak demand periods, understand weather impacts on ridership, and develop insights for operational planning:

In [None]:
# Find high-demand periods (above average usage)
average_rentals = df['count'].mean()
high_demand = df[df['count'] > average_rentals]
print(f"High-demand periods: {len(high_demand)} out of {len(df)} total hours")
print(f"That's {len(high_demand)/len(df)*100:.1f}% of all hours")

# Find cold weather operations (temperature below 8°C)
cold_weather = df[df['temp'] < 8]
print(f"\nCold weather operations: {len(cold_weather)} hours")
print(f"Average rentals in cold weather: {cold_weather['count'].mean():.1f}")

# Compare to warm weather (temperature above 26°C)
warm_weather = df[df['temp'] >= 26]
print(f"Average rentals in warm weather: {warm_weather['count'].mean():.1f}")

The analysis reveals clear patterns in bike-sharing demand:

- Out of more than **10,000 recorded hours**, around **40% qualify as high-demand periods**, showing that elevated usage is a regular occurrence rather than an exception
- **Weather plays a particularly strong role**: when temperatures drop below **8°C**, bike rentals average only about **61 per hour**, reflecting how cold conditions discourage riders
- In contrast, when the temperature climbs to **26°C or higher**, average rentals surge to over **277 per hour**—more than **four times the cold-weather figure**

As you can imagine, these findings have significant implications for operational planning and resource allocation.

### Challenge 12: Weekend vs. Weekday Analysis
Filter the data to compare average bike rentals on weekends (Saturday, Sunday) versus weekdays.

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Convert datetime column and extract time components
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['month'] = df['datetime'].dt.month

# Your code here - compare weekend vs weekday bike usage
weekend_data = _____[_____['day_of_week'].isin([_____, _____])]  # Fill in details
weekday_data = _____[~_____['day_of_week'].isin([_____, _____])]  # Fill in details

weekend_avg = weekend_data[_____].mean()  # Fill in column name
weekday_avg = weekday_data[_____].mean()  # Fill in column name

print(f"Weekend average rentals: {_____:.1f}")  # Fill in variable name
print(f"Weekday average rentals: {_____:.1f}")  # Fill in variable name
print(f"Difference: {abs(_____ - _____):.1f} rentals per hour")  # Fill in variable names

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


When comparing categorical groups like weekends vs weekdays, use these filtering strategies:
- Use `.isin(['value1', 'value2'])` to match multiple values
- Use `~` (tilde) for "not" to get the inverse: `~df['col'].isin(values)`
- Alternative approach: `df['day_of_week'].str.contains('Saturday|Sunday')`
- Consider statistical significance: do the groups have meaningful differences?
- Calculate percentage difference: `((weekend_avg - weekday_avg) / weekday_avg) * 100`
- Explore within-group variation: `weekend_data['count'].std()` vs `weekday_data['count'].std()`

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/dataset/dataset.csv"
df = pd.read_csv(data_file_path)

# Convert datetime column and extract time components
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['month'] = df['datetime'].dt.month

# Your code here - compare weekend vs weekday bike usage
weekend_data = df[df['day_of_week'].isin(['Saturday', 'Sunday'])]  # Fill in details
weekday_data = df[~df['day_of_week'].isin(['Saturday', 'Sunday'])]  # Fill in details

weekend_avg = weekend_data['count'].mean()  # Fill in column name
weekday_avg = weekday_data['count'].mean()  # Fill in column name

print(f"Weekend average rentals: {weekend_avg:.1f}")  # Fill in variable name
print(f"Weekday average rentals: {weekday_avg:.1f}")  # Fill in variable name
print(f"Difference: {abs(weekend_avg - weekday_avg):.1f} rentals per hour")  # Fill in variable names
```

</details>

---

## Summary: Professional Pandas Data Analysis Fundamentals

**What We've Accomplished**: 
- Established comprehensive pandas environment and data manipulation workflows
- Implemented systematic data loading and exploration methodologies for real transportation data
- Performed data quality assessment with missing data detection protocols
- Created time-based feature extraction and business intelligence filtering frameworks

**Key Technical Skills Mastered**:
- Series and DataFrame creation with meaningful business labeling systems
- CSV data loading and basic analysis for professional client datasets
- Temporal data manipulation with datetime extraction and grouping operations
- Data filtering and aggregation techniques for business insight generation

**Next Steps**: Next, we'll advance to professional data cleaning techniques, mastering missing value handling, outlier identification, and data preparation protocols that ensure our transportation datasets meet the rigorous quality standards required for sophisticated predictive modeling and client-ready analysis.

Your bike-sharing client now has a solid data foundation built with professional pandas techniques that demonstrate systematic data exploration and business-focused analytical thinking - the core competencies that consulting firms expect from junior transportation data analysts!