# Lecture 3: Programming Example - Pandas Fundamentals with Washington D.C. Data

## Introduction: Your First Day as a Data Consultant

Welcome to your first hands-on session as a junior data consultant! Today, you'll work with real Washington D.C. bike-sharing data, learning pandas step-by-step. We'll start with absolute basics and build up to loading and exploring your client's dataset.

Remember: Every line of code serves a business purpose. You're not just learning programming - you're developing the skills to help your bike-sharing client make better decisions.

---

## Step 1: Setting Up Your Data Analysis Environment

Let's start by importing the tools you'll need for data analysis. Think of this like setting up your workbench before starting a project:

In [None]:
# Import pandas - your primary data manipulation tool
import pandas as pd

# Print confirmation
print("Pandas imported successfully!")
print(f"Pandas version: {pd.__version__}")

**What this does:**
- `import pandas as pd` makes the pandas library available with the shorthand "pd"
- The shorthand `pd` is a universal convention - all pandas users worldwide use this
- Checking the version ensures you're working with up-to-date tools

**Why this matters for your client:**
Just like a mechanic checks their tools before working on a car, you verify your data analysis tools are ready before handling valuable business data.

---

### Challenge 1: Import Practice
Try importing pandas yourself and check if it worked by running `type(pd)`. You should see that pd is now a module object.

> 💡 **Tip. Click here for a tip.**
>
> When testing if pandas imported correctly, you can use these verification approaches:
> - `type(pd)` shows the object type (should be `<class 'module'>`)
> - `pd.__version__` displays the pandas version number
> - `dir(pd)` lists all available pandas functions and attributes
> - If you get a `NameError`, the import failed and you need to run the import statement first

In [None]:
# Your code here
import pandas as pd
print(type(pd))

**Solution:**

In [None]:
# Your code here
import pandas as pd
print(type(pd))

---

## Step 2: Understanding Series - Single Column Data

Let's start with Series by creating bike rental data for a typical Monday morning. Think of Series as a single column from a spreadsheet:

In [None]:
# Create a Series with hourly bike rentals
morning_rentals = pd.Series([15, 23, 45, 67, 89, 156, 234, 287])
print("Morning Bike Rentals:")
print(morning_rentals)

**What this does:**
- `pd.Series([...])` creates a pandas Series object
- The numbers represent bike rentals for each hour (6 AM through 1 PM)
- Pandas automatically assigns index numbers (0, 1, 2, etc.)

**Output you'll see:**
```
0     15
1     23
2     45
3     67
4     89
5    156
6    234
7    287
dtype: int64
```

**Understanding the output:**
- Left column (0, 1, 2...): Index numbers (like row numbers)
- Right column (15, 23, 45...): Actual bike rental counts
- Bottom line: Data type information (int64 means integer numbers)

### Challenge 2: Create Your Own Series
Create a Series representing temperature readings for the same morning hours. Use these temperatures: [42, 44, 47, 50, 53, 56, 58, 61]

<details>
<summary>💡 Tip (click to expand)</summary>

> When creating Series objects, consider these best practices:  
> 
> - Use descriptive variable names like `morning_temps` instead of generic names like `data`  
> - Print the Series to verify it contains the expected values and structure  

</details>

In [None]:
# Import pandas for Series creation
import pandas as pd

# Your code here - create a Series called 'morning_temps'
morning_temps = pd.Series(_____) # Fill in the temperature list
print(_____) # Print your Series to verify it worked

**Solution:**

<details>
<summary>💡 Solution (click to expand)</summary>
# Import pandas for Series creation
import pandas as pd

# Your code here - create a Series called 'morning_temps'
morning_temps = pd.Series([42, 44, 47, 50, 53, 56, 58, 61]) # Fill in the temperature list
print(morning_temps) # Print your Series to verify it worked
</details>

---

## Step 3: Adding Meaningful Labels to Series

Raw index numbers (0, 1, 2) aren't very business-friendly. Let's add meaningful labels:

In [None]:
# Create Series with meaningful hour labels
hour_labels = ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM', '11 AM', '12 PM', '1 PM']
morning_rentals_labeled = pd.Series([15, 23, 45, 67, 89, 156, 234, 287], 
                                   index=hour_labels)
print("Labeled Morning Bike Rentals:")
print(morning_rentals_labeled)

**What this does:**
- `index=hour_labels` replaces default numbers with business-meaningful labels
- Now each rental count is clearly connected to its time period

**Output you'll see:**
```
6 AM     15
7 AM     23
8 AM     45
9 AM     67
10 AM    89
11 AM   156
12 PM   234
1 PM    287
dtype: int64
```

**Business value:**
Now when you show this to your client, they immediately understand that 12 PM has the highest rentals (234), which makes perfect sense for lunch-time bike usage.

### Challenge 3: Access Specific Data Points
Using your labeled Series, find the bike rentals at 9 AM. Use this syntax: `morning_rentals_labeled['9 AM']`

> 💡 **Tip. Click here for a tip.**
> 
> When accessing Series data by label, keep these techniques in mind:
> - Use square brackets with the exact label: `series_name['label']`
> - Check available labels with `series_name.index` to see all options
> - Access multiple values: `series_name[['9 AM', '10 AM']]` (note double brackets)
> - Use `.loc[]` for explicit label-based selection: `series_name.loc['9 AM']`
> - Be careful with exact spelling and spacing in labels to avoid KeyError

In [None]:
# Import pandas and create the labeled Series
import pandas as pd

# Create Series with meaningful hour labels
hour_labels = ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM', '11 AM', '12 PM', '1 PM']
morning_rentals_labeled = pd.Series([15, 23, 45, 67, 89, 156, 234, 287], 
                                   index=hour_labels)

# Your code here - access the 9 AM value from the labeled Series
nine_am_rentals = morning_rentals_labeled[_____] # Fill in the correct label
print(f"Bike rentals at 9 AM: {_____}") # Print the result

**Solution:**

In [None]:
# Import pandas and create the labeled Series
import pandas as pd

# Create Series with meaningful hour labels
hour_labels = ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM', '11 AM', '12 PM', '1 PM']
morning_rentals_labeled = pd.Series([15, 23, 45, 67, 89, 156, 234, 287], 
                                   index=hour_labels)

# Your code here - access the 9 AM value from the labeled Series
nine_am_rentals = morning_rentals_labeled['9 AM'] # Fill in the correct label
print(f"Bike rentals at 9 AM: {nine_am_rentals}") # Print the result

---

## Step 4: Creating Your First DataFrame - Complete Business Data

Now let's combine multiple pieces of information into a DataFrame. Think of this as creating a complete spreadsheet with multiple columns:

In [None]:
# Create a comprehensive DataFrame with multiple variables
bike_operations_data = pd.DataFrame({
    'hour': ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM'],
    'temperature': [42, 44, 47, 50, 53],
    'bike_rentals': [15, 23, 45, 67, 89],
    'weather_condition': ['Clear', 'Clear', 'Partly Cloudy', 'Clear', 'Clear']
})

print("Complete Bike Operations Data:")
print(bike_operations_data)

**What this does:**
- `pd.DataFrame({...})` creates a DataFrame with multiple columns
- Each key ('hour', 'temperature', etc.) becomes a column name
- Each list becomes the values for that column
- All rows stay aligned (first hour corresponds to first temperature, etc.)

**Output you'll see:**
```
    hour  temperature  bike_rentals weather_condition
0   6 AM           42            15             Clear
1   7 AM           44            23             Clear
2   8 AM           47            45    Partly Cloudy
3   9 AM           50            67             Clear
4  10 AM           53            89             Clear
```

**Business insight:**
You can immediately see patterns - bike rentals increase with temperature and time, giving your client valuable operational insights.

### Challenge 4: Add a New Column
Add a column called 'user_satisfaction' with values [3.2, 3.5, 3.8, 4.1, 4.3] representing customer satisfaction ratings.

> 💡 **Tip. Click here for a tip.**
> 
> When adding new columns to DataFrames, follow these best practices:
> - Ensure the new data list has the same length as existing rows: `len(new_data) == len(df)`
> - Use descriptive column names that clearly indicate what the data represents
> - Verify the addition worked: `df.columns` shows all column names including the new one
> - Check data types: `df.dtypes` to ensure the new column has appropriate type (float64 for ratings)
> - You can also add columns using `df.assign(column_name=values)` for a more functional approach

In [None]:
# Import pandas and create the DataFrame
import pandas as pd

# Create a comprehensive DataFrame with multiple variables
bike_operations_data = pd.DataFrame({
    'hour': ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM'],
    'temperature': [42, 44, 47, 50, 53],
    'bike_rentals': [15, 23, 45, 67, 89],
    'weather_condition': ['Clear', 'Clear', 'Partly Cloudy', 'Clear', 'Clear']
})

# Your code here - add a new column with satisfaction ratings
bike_operations_data[_____] = [3.2, 3.5, 3.8, 4.1, 4.3] # Fill in column name
print(_____) # Print the updated DataFrame

**Solution:**

In [None]:
# Import pandas and create the DataFrame
import pandas as pd

# Create a comprehensive DataFrame with multiple variables
bike_operations_data = pd.DataFrame({
    'hour': ['6 AM', '7 AM', '8 AM', '9 AM', '10 AM'],
    'temperature': [42, 44, 47, 50, 53],
    'bike_rentals': [15, 23, 45, 67, 89],
    'weather_condition': ['Clear', 'Clear', 'Partly Cloudy', 'Clear', 'Clear']
})

# Your code here - add a new column with satisfaction ratings
bike_operations_data['user_satisfaction'] = [3.2, 3.5, 3.8, 4.1, 4.3] # Fill in column name
print(bike_operations_data) # Print the updated DataFrame

---

## Step 5: Loading Real Washington D.C. Dataset

Now for the real challenge - loading your client's actual historical data. This is where professional consulting begins:

In [None]:
# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Confirm successful loading
print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")

**What this does:**
- `pd.read_csv()` reads data from a CSV file into a DataFrame
- CSV (Comma-Separated Values) is the standard format for sharing tabular data
- The shape tells you how much data you have to work with

**Professional tip:**
Always check the shape immediately after loading - it confirms the file loaded correctly and gives you a sense of your dataset size.

### Challenge 5: Explore the Column Names
Print the column names using `df.columns` to see what variables are available in your dataset.

> 💡 **Tip. Click here for a tip.**
> 
> When exploring column names in a new dataset, use these investigation techniques:
> - `df.columns` returns an Index object with all column names
> - `list(df.columns)` converts to a regular Python list for easier reading
> - `len(df.columns)` tells you how many variables you have to work with
> - Look for patterns in naming conventions to understand data structure
> - Identify key business variables (dates, counts, categories) vs. supporting variables

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - explore the available columns
print("Available columns:")
print(list(_____.columns)) # Fill in the DataFrame name

**Solution:**

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - explore the available columns
print("Available columns:")
print(list(df.columns)) # Fill in the DataFrame name

---

## Step 6: First Look at Real Transportation Data

Let's examine the first few rows to understand the data structure:

In [None]:
# Display the first 5 rows
print("First 5 rows of the dataset:")
print(df.head())

**What this shows:**
- `head()` displays the first 5 rows by default
- You'll see actual bike-sharing data with timestamps, weather, and usage counts
- Each row represents one hour of bike-sharing operations

**Understanding the real data:**
- `datetime`: When this data was recorded
- `season`, `holiday`, `workingday`: Operational context
- `weather`, `temp`, `humidity`, `windspeed`: Weather conditions
- `casual`, `registered`, `count`: Different types of users and total rentals

This is the foundation of all your future analysis for this client.

### Challenge 6: Look at the Last Few Rows
Use `df.tail()` to see the last 5 rows of the dataset. This helps verify you have complete data coverage.

> 💡 **Tip. Click here for a tip.**
> 
> When examining the end of your dataset, consider these data quality checks:
> - Compare last row's datetime to first row's datetime to understand time coverage
> - Check if the last rows have complete data or if there are missing values
> - Use `df.tail(10)` to see more rows if you want a larger sample
> - Look for any unusual patterns or data entry errors in the final records
> - Verify that the data collection didn't stop abruptly mid-period

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - examine the last few rows
print("Last 5 rows of the dataset:")
print(_____._____)  # Fill in the DataFrame name and method

**Solution:**

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - examine the last few rows
print("Last 5 rows of the dataset:")
print(df.tail())  # Fill in the DataFrame name and method

---

## Step 7: Understanding Your Dataset Size and Structure

Professional data analysis requires understanding exactly what you're working with:

In [None]:
# Get detailed information about the dataset
print("Dataset Information:")
print(f"Total records: {len(df)}")
print(f"Total variables: {len(df.columns)}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

# Show data types for each column
print("\nData Types:")
print(df.dtypes)

**What this tells you:**
- **Total records**: How much historical data your client has collected
- **Total variables**: How many different factors you can analyze
- **Memory usage**: Whether your computer can handle this dataset efficiently
- **Data types**: What kind of analysis you can perform on each variable

**Business implications:**
More historical data means more reliable predictions. The variety of variables (weather, time, user types) means you can build sophisticated demand forecasting models.

### Challenge 7: Calculate Time Coverage
The dataset contains hourly data. Calculate how many days of data you have by dividing the total rows by 24.

> 💡 **Tip. Click here for a tip.**
> 
> When calculating time coverage for time series data, consider these analysis approaches:
> - Use `.1f` formatting to display days with one decimal place for readability
> - Verify your calculation makes sense: `total_days * 24` should equal `len(df)`
> - Calculate weeks as well: `total_days / 7` for business planning context
> - Consider if you have complete days: `len(df) % 24` shows any partial days
> - More sophisticated approach: use actual datetime range with `(df['datetime'].max() - df['datetime'].min()).days`

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - calculate days of coverage
total_days = len(_____) / _____ # Fill in DataFrame name and divisor
print(f"Dataset covers approximately {_____:.1f} days") # Fill in variable name

**Solution:**

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - calculate days of coverage
total_days = len(df) / 24 # Fill in DataFrame name and divisor
print(f"Dataset covers approximately {total_days:.1f} days") # Fill in variable name

---

## Step 8: Basic Statistical Summary

Understanding the data distribution helps identify patterns and potential issues:

In [None]:
# Generate statistical summary for numerical variables
print("Statistical Summary:")
print(df.describe())

**What this shows:**
- **count**: How many non-missing values exist for each variable
- **mean**: Average values (useful for understanding typical conditions)
- **std**: Standard deviation (shows how much values vary)
- **min/max**: Range of values (helps identify outliers or impossible values)
- **25%, 50%, 75%**: Quartiles (show data distribution)

**Professional insight:**
If minimum bike counts are 1 and maximum is 977, that's a huge range! This suggests your client experiences very different demand conditions that you'll need to understand and predict.

### Challenge 8: Focus on Key Business Metrics
Create a summary focusing only on the key business variables: temperature, humidity, and bike count.

> 💡 **Tip. Click here for a tip.**
> 
> When focusing on specific business metrics, use these analytical techniques:
> - Select columns using double brackets: `df[['col1', 'col2']]` to maintain DataFrame structure
> - Compare ranges across variables to understand which have more variation
> - Look for correlations: do higher temperatures generally coincide with higher bike counts?
> - Check for outliers: are max values realistic or potentially data entry errors?
> - Consider business thresholds: what temperature ranges are most relevant for operations?

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - create a summary of key metrics
key_metrics = _____[[_____, _____, _____]].describe() # Fill in DataFrame and column names
print("Key Business Metrics Summary:")
print(_____) # Print the summary

**Solution:**

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - create a summary of key metrics
key_metrics = df[['temp', 'humidity', 'count']].describe() # Fill in DataFrame and column names
print("Key Business Metrics Summary:")
print(key_metrics) # Print the summary

---

## Step 9: Selecting Specific Data for Analysis

Often you need to focus on specific parts of your dataset. Let's learn several ways to select data:

In [None]:
# Select a single column (bike counts)
bike_counts = df['count']
print(f"Bike counts - Type: {type(bike_counts)}")
print(f"Average daily rentals: {bike_counts.mean():.1f}")

# Select multiple columns for weather analysis
weather_data = df[['temp', 'humidity', 'windspeed']]
print(f"\nWeather data shape: {weather_data.shape}")
print(weather_data.head(3))

# Select first 100 rows for initial analysis
sample_data = df.head(100)
print(f"\nSample data covers first {len(sample_data)} hours")

**What this demonstrates:**
- **Single column selection**: Returns a Series (one-dimensional)
- **Multiple column selection**: Returns a DataFrame (two-dimensional)
- **Row selection**: Gets a subset of the full dataset

**Professional application:**
You might analyze just weather data to understand seasonal patterns, or focus on the first few months to understand how the bike-sharing system performed during its early operations.

### Challenge 9: Create a Rush Hour Analysis Dataset
Select only the columns 'datetime', 'temp', 'count' and only the first 168 rows (first week of data).

> 💡 **Tip. Click here for a tip.**
> 
> When creating focused analysis datasets, use these selection strategies:
> - Combine column and row selection: `df[['col1', 'col2']].head(n)`
> - Verify the subset size: 168 hours = 7 days × 24 hours = 1 week
> - Check that you have the right columns: `rush_hour_analysis.columns`
> - Confirm time coverage: compare first and last datetime values
> - Consider if this sample represents typical operations or includes holidays/special events

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - create a focused dataset for rush hour analysis
rush_hour_analysis = _____[[_____, _____, _____]].head(_____) # Fill in details
print(f"Rush hour dataset shape: {_____.shape}") # Fill in variable name
print(_____.head()) # Print first few rows

**Solution:**

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - create a focused dataset for rush hour analysis
rush_hour_analysis = df[['datetime', 'temp', 'count']].head(168) # Fill in details
print(f"Rush hour dataset shape: {rush_hour_analysis.shape}") # Fill in variable name
print(rush_hour_analysis.head()) # Print first few rows

---

## Step 10: Data Quality Assessment - Missing Data Detection

Real-world data is often incomplete. Let's check for missing values:

In [None]:
# Check for missing data in each column
missing_data = df.isnull().sum()
print("Missing Data Summary:")
print(missing_data)

# Calculate percentage of missing data
missing_percentage = (missing_data / len(df)) * 100
print("\nMissing Data Percentages:")
for column in df.columns:
    if missing_data[column] > 0:
        print(f"{column}: {missing_percentage[column]:.1f}%")
    else:
        print(f"{column}: 0% (Complete)")

**What this reveals:**
- `isnull().sum()` counts missing values in each column
- Converting to percentages helps understand the severity of missing data
- Complete columns can be trusted for all analysis
- Columns with missing data need special handling

**Professional implication:**
If weather data is missing for certain periods, you'll need to either exclude those periods or find ways to estimate the missing values. This affects the reliability of your predictions.

### Challenge 10: Identify the Most Complete Variables
Find which columns have zero missing values - these are your most reliable variables for analysis.

> 💡 **Tip. Click here for a tip.**
> 
> When identifying complete variables, use these data quality approaches:
> - Alternative approach: `df.isnull().sum() == 0` returns a boolean Series of complete columns
> - Get complete columns directly: `complete_cols = df.columns[df.isnull().sum() == 0].tolist()`
> - Count complete columns: `(df.isnull().sum() == 0).sum()`
> - Prioritize these complete columns for initial analysis and model building
> - Consider why some columns have missing data - is it systematic or random?

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - find columns with no missing values
complete_columns = []
for column in _____.columns: # Fill in DataFrame name
    if _____[column].isnull().sum() == _____: # Fill in DataFrame name and comparison value
        complete_columns.append(column)

print("Completely populated columns:")
for col in complete_columns:
    print(f"- {col}")

**Solution:**

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Your code here - find columns with no missing values
complete_columns = []
for column in df.columns: # Fill in DataFrame name
    if df[column].isnull().sum() == 0: # Fill in DataFrame name and comparison value
        complete_columns.append(column)

print("Completely populated columns:")
for col in complete_columns:
    print(f"- {col}")

---

## Step 11: Understanding Time-Based Data

Transportation data is inherently time-based. Let's work with the datetime information:

In [None]:
# Convert datetime column to pandas datetime format
df['datetime'] = pd.to_datetime(df['datetime'])
print(f"Datetime conversion successful. Type: {df['datetime'].dtype}")

# Extract useful time components
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['month'] = df['datetime'].dt.month

# Show the first few rows with time components
print("\nData with extracted time components:")
print(df[['datetime', 'hour', 'day_of_week', 'month', 'count']].head())

**What this accomplishes:**
- Converts text dates to actual datetime objects for analysis
- Extracts hour, day, and month for business analysis
- Enables time-based filtering and grouping

**Business applications:**
- **Hour analysis**: Identify peak usage times for bike rebalancing
- **Day analysis**: Compare weekday vs. weekend patterns
- **Month analysis**: Understand seasonal trends for capacity planning

### Challenge 11: Find Peak Hour
Use the new 'hour' column to find which hour of the day has the highest average bike rentals.

> 💡 **Tip. Click here for a tip.**
> 
> When analyzing time-based patterns, use these grouping and aggregation techniques:
> - `df.groupby('hour')['count'].mean()` calculates average by hour
> - Use `.idxmax()` to find the index (hour) with maximum value
> - Use `.max()` to get the actual maximum value
> - Explore other time patterns: `df.groupby('day_of_week')['count'].mean()`
> - Consider multiple aggregations: `df.groupby('hour')['count'].agg(['mean', 'std', 'count'])`
> - Sort results for easier interpretation: `hourly_average.sort_values(ascending=False)`

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Convert datetime column and extract time components
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['month'] = df['datetime'].dt.month

# Your code here - find the peak hour for bike rentals
hourly_average = _____.groupby(_____)[_____].mean() # Fill in DataFrame, grouping column, target column
peak_hour = hourly_average._____() # Fill in method to find maximum index
peak_rentals = hourly_average._____() # Fill in method to get maximum value

print(f"Peak hour: {_____}:00") # Fill in variable name
print(f"Average rentals during peak hour: {_____:.1f}") # Fill in variable name

**Solution:**

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Convert datetime column and extract time components
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['month'] = df['datetime'].dt.month

# Your code here - find the peak hour for bike rentals
hourly_average = df.groupby('hour')['count'].mean() # Fill in DataFrame, grouping column, target column
peak_hour = hourly_average.idxmax() # Fill in method to find maximum index
peak_rentals = hourly_average.max() # Fill in method to get maximum value

print(f"Peak hour: {peak_hour}:00") # Fill in variable name
print(f"Average rentals during peak hour: {peak_rentals:.1f}") # Fill in variable name

---

## Step 12: Basic Data Filtering for Business Insights

Let's filter the data to answer specific business questions:

In [None]:
# Find high-demand periods (above average usage)
average_rentals = df['count'].mean()
high_demand = df[df['count'] > average_rentals]
print(f"High-demand periods: {len(high_demand)} out of {len(df)} total hours")
print(f"That's {len(high_demand)/len(df)*100:.1f}% of all hours")

# Find cold weather operations (temperature below 50)
cold_weather = df[df['temp'] < 50]
print(f"\nCold weather operations: {len(cold_weather)} hours")
print(f"Average rentals in cold weather: {cold_weather['count'].mean():.1f}")

# Compare to warm weather
warm_weather = df[df['temp'] >= 50]
print(f"Average rentals in warm weather: {warm_weather['count'].mean():.1f}")

**Business insights generated:**
- **High-demand identification**: Helps predict when extra bikes will be needed
- **Weather impact analysis**: Shows how temperature affects demand
- **Operational planning**: Cold weather requires different preparation than warm weather

### Challenge 12: Weekend vs. Weekday Analysis
Filter the data to compare average bike rentals on weekends (Saturday, Sunday) versus weekdays.

> 💡 **Tip. Click here for a tip.**
> 
> When comparing categorical groups like weekends vs weekdays, use these filtering strategies:
> - Use `.isin(['value1', 'value2'])` to match multiple values
> - Use `~` (tilde) for "not" to get the inverse: `~df['col'].isin(values)`
> - Alternative approach: `df['day_of_week'].str.contains('Saturday|Sunday')`
> - Consider statistical significance: do the groups have meaningful differences?
> - Calculate percentage difference: `((weekend_avg - weekday_avg) / weekday_avg) * 100`
> - Explore within-group variation: `weekend_data['count'].std()` vs `weekday_data['count'].std()`

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Convert datetime column and extract time components
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['month'] = df['datetime'].dt.month

# Your code here - compare weekend vs weekday bike usage
weekend_data = _____[_____['day_of_week'].isin([_____, _____])] # Fill in details
weekday_data = _____[~_____['day_of_week'].isin([_____, _____])] # Fill in details

weekend_avg = weekend_data[_____].mean() # Fill in column name
weekday_avg = weekday_data[_____].mean() # Fill in column name

print(f"Weekend average rentals: {_____:.1f}") # Fill in variable name
print(f"Weekday average rentals: {_____:.1f}") # Fill in variable name
print(f"Difference: {abs(_____ - _____):.1f} rentals per hour") # Fill in variable names

**Solution:**

In [None]:
# Import required libraries and load data
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_file_path = "dataset.csv"
df = pd.read_csv(data_file_path)

# Convert datetime column and extract time components
df['datetime'] = pd.to_datetime(df['datetime'])
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['month'] = df['datetime'].dt.month

# Your code here - compare weekend vs weekday bike usage
weekend_data = df[df['day_of_week'].isin(['Saturday', 'Sunday'])] # Fill in details
weekday_data = df[~df['day_of_week'].isin(['Saturday', 'Sunday'])] # Fill in details

weekend_avg = weekend_data['count'].mean() # Fill in column name
weekday_avg = weekday_data['count'].mean() # Fill in column name

print(f"Weekend average rentals: {weekend_avg:.1f}") # Fill in variable name
print(f"Weekday average rentals: {weekday_avg:.1f}") # Fill in variable name
print(f"Difference: {abs(weekend_avg - weekday_avg):.1f} rentals per hour") # Fill in variable names

---

## Summary: Professional Pandas Data Analysis Fundamentals

**What We've Accomplished**: 
- Established comprehensive pandas environment and data manipulation workflows
- Implemented systematic data loading and exploration methodologies for real transportation data
- Performed professional data quality assessment with missing data detection protocols
- Created time-based feature extraction and business intelligence filtering frameworks

**Key Technical Skills Mastered**:
- Series and DataFrame creation with meaningful business labeling systems
- CSV data loading and structural analysis for professional client datasets
- Temporal data manipulation with datetime extraction and grouping operations
- Data filtering and aggregation techniques for business insight generation

**Next Steps**: Next, we'll advance to professional data cleaning techniques, mastering missing value handling, outlier identification, and data preparation protocols that ensure our transportation datasets meet the rigorous quality standards required for sophisticated predictive modeling and client-ready analysis.

Your bike-sharing client now has a solid data foundation built with professional pandas techniques that demonstrate systematic data exploration and business-focused analytical thinking - the core competencies that consulting firms expect from junior transportation data analysts!