# Lecture 4: Programming Example - Data Quality Assessment and Cleaning

## Introduction: Cleaning Real Transportation Data

Welcome back, junior data consultant! Your client was impressed with your initial work and now they need you to ensure their Washington D.C. bike-sharing dataset is ready for business predictions. Today, you'll learn to be a data detective - identifying problems, applying fixes, and validating your work.

Think of data cleaning like being a mechanic inspecting a car before a long journey. You need to check every component, fix what's broken, and ensure everything runs smoothly. Your client is counting on clean, reliable data for their million-dollar bike expansion decisions.

> **🚀 Interactive Learning Alert**
> 
> This is a hands-on data cleaning tutorial with detective work and problem-solving. For the best experience:
>
> - **Click "Open in Colab"** at the bottom to run code interactively
> - **Execute each code cell** by pressing **Shift + Enter**
> - **Complete the challenges** to practice your data cleaning skills
> - **Think like a consultant** - every decision impacts client trust

---

## Step 1: Loading Your Dataset and Getting the First Look

Just like a mechanic needs the right tools before inspecting a car, you need to set up your data cleaning environment and load your client's data. Let's start by importing essential libraries and loading the bike-sharing dataset:

In [None]:
# Import pandas - your primary data manipulation tool
import pandas as pd
# Import numpy - for mathematical operations and handling special values
import numpy as np

# Load the Washington D.C. bike-sharing dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv")

# Get your first look at what you're working with
print("Dataset Overview:")
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")

**What this does:**
- **Libraries**: `pandas` (as `pd`) is your Swiss Army knife for data manipulation, `numpy` (as `np`) helps with mathematical operations and missing data handling
- **Dataset loading**: Brings your client's bike-sharing data into your workspace for analysis
- **Shape overview**: Shows how many hours of data you have (rows) and how many variables you need to check (columns)

More data means more reliable insights, but also more potential quality issues to find and fix.

---

### Challenge 1: Get the Complete Data Health Report
Your client wants to know: "What exactly is in this dataset?" Use `df.info()` to create a comprehensive health report showing data types and completeness.

In [None]:
# Import libraries and load data
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv")

# Your code here - generate the complete data health report
print("Dataset Health Report:")
print(_____.info())  # Fill in the DataFrame name

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


The `df.info()` method is like getting a complete medical checkup for your data. Look for:
- **Non-null counts**: Should match total rows if data is complete
- **Data types**: Numbers should be `int64` or `float64`, not `object`
- **Memory usage**: Helps you know if your computer can handle the data

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv")

# Your code here - generate the complete data health report
print("Dataset Health Report:")
print(df.info())  # Fill in the DataFrame name
```

</details>

---

## Step 2: Checking for Out-of-Range Values

Think of this like a car inspection - you need to know if the readings make sense. A thermometer can't read 200°C in Washington D.C., and you can't have negative bike rentals! We'll define reasonable ranges that represent **physical limits** (temperature and humidity can't exceed natural boundaries), **business limits** (bike stations have capacity constraints), and **sensor limits** (equipment has measurement ranges). Values outside these ranges might indicate broken sensors, data entry errors, or extraordinary events that need investigation.

In [None]:
# First, let's see what our data actually looks like
print("Current data ranges:")
print(df[['temp', 'humidity', 'windspeed', 'count']].describe())

print("\n" + "="*50)

# Define what "normal" looks like for Washington D.C. bike data
reasonable_ranges = {
    'temp': (-20, 45),      # Temperature in Celsius (D.C. winter to summer)
    'atemp': (-20, 50),     # How temperature feels to humans
    'humidity': (0, 100),   # Humidity percentage (0% = desert, 100% = fog)
    'windspeed': (0, 50),   # Wind speed km/h (50+ is severe storm)
    'count': (0, 1000),     # Total bike rentals per hour
    'casual': (0, 500),     # Casual user rentals per hour
    'registered': (0, 900)   # Registered user rentals per hour
}

print("Reasonable Range Definitions:")
for variable, (min_val, max_val) in reasonable_ranges.items():
    print(f"{variable}: {min_val} to {max_val}")

# Now check if any actual values fall outside our reasonable ranges
print("\nRange Validation Results:")
for column, (min_val, max_val) in reasonable_ranges.items():
    if column in df.columns:
        # Get the actual data range
        actual_min = df[column].min()
        actual_max = df[column].max()
        
        # Count violations (values outside reasonable range)
        below_range = (df[column] < min_val).sum()
        above_range = (df[column] > max_val).sum()
        
        print(f"\n{column}:")
        print(f"  Expected: {min_val} to {max_val}")
        print(f"  Actual: {actual_min:.2f} to {actual_max:.2f}")
        
        # Report any problems found
        if below_range > 0:
            print(f"  ⚠️ WARNING: {below_range} values too low")
        if above_range > 0:
            print(f"  ⚠️ WARNING: {above_range} values too high")
        if below_range == 0 and above_range == 0:
            print(f"  ✅ All values look reasonable")

**What this does:**
- **Validates reality**: Data should match what's physically possible
- **Identifies errors**: Impossible values suggest data collection problems
- **Builds trust**: Your client knows the data has been thoroughly checked

---

### Challenge 2: Investigate Extreme Values
Your client asks: "Were those really busy hours normal, or were they data errors?" Find the top 5 highest bike rental hours and examine the conditions to see if they make sense.

In [None]:
# Import libraries and load data
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv")

# Your code here - find the top 5 busiest hours
top_rentals = _____.nlargest(_____, _____)  # Fill in DataFrame, number, and column
print("Top 5 busiest bike rental hours:")
print(top_rentals[['datetime', 'temp', 'humidity', 'weather', 'count']])

# Make datetime more readable
top_rentals_copy = top_rentals.copy()
top_rentals_copy['datetime'] = pd.to_datetime(top_rentals_copy['datetime'])
print("\nWith readable dates:")
print(top_rentals_copy[['datetime', 'temp', 'weather', 'count']])

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Use `df.nlargest(n, 'column')` to efficiently find the top N values in any column. Think of it as asking "show me the highest scores on the test." Look at the weather and temperature during these busy times - do they explain why so many people rented bikes?

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv")

# Your code here - find the top 5 busiest hours
top_rentals = df.nlargest(5, 'count')  # Fill in DataFrame, number, and column
print("Top 5 busiest bike rental hours:")
print(top_rentals[['datetime', 'temp', 'humidity', 'weather', 'count']])

# Make datetime more readable
top_rentals_copy = top_rentals.copy()
top_rentals_copy['datetime'] = pd.to_datetime(top_rentals_copy['datetime'])
print("\nWith readable dates:")
print(top_rentals_copy[['datetime', 'temp', 'weather', 'count']])
```

</details>

---

## Step 3: Outlier Detection and Treatment - Finding the Unusual

Now, let's identify outliers - data points that are unusually high or low compared to the rest. Think of outliers like finding a person who is 8 feet tall in a crowd - they're not necessarily wrong, but they're different enough to deserve special attention.

In this example, we will use the IQR (Interquartile Range) method to detect outliers. The IQR method is like creating a "normal zone" for your data. Here's how it works:

In [None]:
# Function to detect outliers using IQR method with detailed explanation
def detect_outliers_iqr(series, multiplier=1.5):
    """
    Detect outliers using the Interquartile Range (IQR) method.
    
    The IQR method works by:
    1. Finding Q1 (25th percentile) - 25% of data is below this value
    2. Finding Q3 (75th percentile) - 75% of data is below this value  
    3. Calculating IQR = Q3 - Q1 (the middle 50% range)
    4. Setting boundaries at Q1 - 1.5*IQR and Q3 + 1.5*IQR
    
    Data points outside these boundaries are considered outliers.
    """
    Q1 = series.quantile(0.25)  # 25th percentile
    Q3 = series.quantile(0.75)  # 75th percentile
    IQR = Q3 - Q1               # Interquartile range (middle 50%)
    
    # Calculate outlier boundaries
    lower_bound = Q1 - multiplier * IQR
    upper_bound = Q3 + multiplier * IQR
    
    # Identify outliers (values outside boundaries)
    outliers = (series < lower_bound) | (series > upper_bound)
    
    print(f"IQR Analysis for {series.name}:")
    print(f"  Q1 (25th percentile): {Q1:.2f}")
    print(f"  Q3 (75th percentile): {Q3:.2f}")
    print(f"  IQR (Q3 - Q1): {IQR:.2f}")
    print(f"  Lower boundary: {lower_bound:.2f}")
    print(f"  Upper boundary: {upper_bound:.2f}")
    
    return outliers, lower_bound, upper_bound

# Detect outliers in bike count
print("Analyzing bike rental outliers using IQR method:")
outliers, lower, upper = detect_outliers_iqr(df['count'])
print(f"  Number of outliers: {outliers.sum()}")
print(f"  Percentage of data: {outliers.sum()/len(df)*100:.1f}%")

# Examine extreme outliers
if outliers.sum() > 0:
    extreme_outliers = df[outliers].nlargest(5, 'count')
    print("\nTop 5 outlier periods:")
    print(extreme_outliers[['datetime', 'temp', 'weather', 'workingday', 'count']])

> **Note**: Why IQR method is robust:
> - **Not affected by extreme values**: Unlike mean/standard deviation, IQR uses percentiles
> - **Works with skewed data**: Doesn't assume normal distribution
> - **Interpretable boundaries**: Clear definition of "normal" vs "unusual"

When you run this code, you'll see the number of outliers detected in the bike count data using the IQR method. Rather than automatically removing outliers, we should investigate them first because they might indicate:

- Special events (festivals, parades)
- Perfect weather conditions  
- System promotions or changes
- Data collection errors

---

### Challenge 3: Alternative Outlier Detection with Z-Score Method
Your client asks: "Can we use a different statistical approach to validate our outlier findings?" Implement the Z-score method and compare results with the IQR approach.

In [None]:
# Import libraries and load data
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv")

# Your code here - implement Z-score outlier detection
def detect_outliers_zscore(series, threshold=3):
    """
    Detect outliers using the Z-score method.
    
    Z-score measures how many standard deviations a value is from the mean.
    Values with |Z-score| > threshold are considered outliers.
    Common threshold: 3 (99.7% of normal data falls within ±3 standard deviations)
    """
    mean = _____._____()  # Fill in series and method
    std = _____._____()   # Fill in series and method
    
    # Calculate Z-scores
    z_scores = (_____ - _____) / _____  # Fill in: (series - mean) / std
    
    # Identify outliers
    outliers = np.abs(_____) > _____  # Fill in z_scores and threshold
    
    print(f"Z-score Analysis for {series.name}:")
    print(f"  Mean: {_____:.2f}")  # Fill in variable
    print(f"  Standard deviation: {_____:.2f}")  # Fill in variable
    print(f"  Threshold: ±{_____}")  # Fill in threshold
    
    return outliers, z_scores

# Apply Z-score method to bike count
print("Analyzing bike rental outliers using Z-score method:")
zscore_outliers, z_scores = detect_outliers_zscore(df['_____'])  # Fill in column name
print(f"  Number of outliers: {zscore_outliers.sum()}")
print(f"  Percentage of data: {zscore_outliers.sum()/len(df)*100:.1f}%")

# Compare with IQR method
iqr_outliers, _, _ = detect_outliers_iqr(df['_____'])  # Fill in column name
print(f"\nMethod Comparison:")
print(f"  IQR outliers: {iqr_outliers.sum()}")
print(f"  Z-score outliers: {zscore_outliers.sum()}")

# Find overlapping outliers
overlap = (iqr_outliers & zscore_outliers).sum()
print(f"  Overlapping outliers: {overlap}")
print(f"  Agreement rate: {overlap/max(iqr_outliers.sum(), zscore_outliers.sum())*100:.1f}%")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


The Z-score method assumes your data follows a normal distribution. Use `np.abs()` to get absolute values (distance from mean regardless of direction). Compare the two methods: Z-score is more sensitive to extreme values, while IQR is more robust. Look for patterns in which outliers each method identifies!

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv")

# Your code here - implement Z-score outlier detection
def detect_outliers_zscore(series, threshold=3):
    """
    Detect outliers using the Z-score method.
    
    Z-score measures how many standard deviations a value is from the mean.
    Values with |Z-score| > threshold are considered outliers.
    Common threshold: 3 (99.7% of normal data falls within ±3 standard deviations)
    """
    mean = series.mean()  # Fill in series and method
    std = series.std()   # Fill in series and method
    
    # Calculate Z-scores
    z_scores = (series - mean) / std  # Fill in: (series - mean) / std
    
    # Identify outliers
    outliers = np.abs(z_scores) > threshold  # Fill in z_scores and threshold
    
    print(f"Z-score Analysis for {series.name}:")
    print(f"  Mean: {mean:.2f}")  # Fill in variable
    print(f"  Standard deviation: {std:.2f}")  # Fill in variable
    print(f"  Threshold: ±{threshold}")  # Fill in threshold
    
    return outliers, z_scores

# Apply Z-score method to bike count
print("Analyzing bike rental outliers using Z-score method:")
zscore_outliers, z_scores = detect_outliers_zscore(df['count'])  # Fill in column name
print(f"  Number of outliers: {zscore_outliers.sum()}")
print(f"  Percentage of data: {zscore_outliers.sum()/len(df)*100:.1f}%")

# Compare with IQR method
iqr_outliers, _, _ = detect_outliers_iqr(df['count'])  # Fill in column name
print(f"\nMethod Comparison:")
print(f"  IQR outliers: {iqr_outliers.sum()}")
print(f"  Z-score outliers: {zscore_outliers.sum()}")

# Find overlapping outliers
overlap = (iqr_outliers & zscore_outliers).sum()
print(f"  Overlapping outliers: {overlap}")
print(f"  Agreement rate: {overlap/max(iqr_outliers.sum(), zscore_outliers.sum())*100:.1f}%")
```

</details>

---

## Step 4: Creating a Missing Data Report Card

Missing data is like having pieces of a puzzle scattered around. Before you can solve the puzzle (make predictions), you need to know which pieces are missing and how serious the problem is. Let's create a comprehensive missing data report card:

In [None]:
# Count missing values (like counting empty seats in a theater)
missing_counts = df.isnull().sum()
print("Missing Data Count:")
print(missing_counts)

# Calculate percentages (like getting a grade out of 100%)
missing_percentages = (missing_counts / len(df)) * 100

# Create a professional missing data report
missing_summary = pd.DataFrame({
    'Missing_Count': missing_counts,
    'Missing_Percentage': missing_percentages
}).round(2)

print("\nMissing Data Report Card:")
print(missing_summary[missing_summary['Missing_Count'] > 0])

# Identify the "A+ students" - columns with perfect attendance
complete_columns = missing_summary[missing_summary['Missing_Count'] == 0].index.tolist()
print(f"\nPerfect attendance (no missing data): {len(complete_columns)} columns")
for col in complete_columns:
    print(f"  ✓ {col}")

**What `.isnull()` does:**
- Creates a True/False map of your data
- `True` means "this cell is empty"
- `.sum()` counts all the `True` values (empty cells)

**What this reveals:**
- **Missing counts**: How many empty cells each variable has
- **Missing percentages**: How reliable each variable is (0% = completely reliable)
- **Complete columns**: Variables you can trust for all calculations

Variables with 0% missing data are gold - you can use them confidently. Variables with missing data need special handling before making business decisions.

---

### Challenge 4: Investigate Missing Data Patterns
Be a data detective! Your client wants to know: "Are the missing windspeed readings random, or is there a pattern?" Examine missing windspeed data to see when it occurred.

In [None]:
# Import libraries and load data
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv")

# Ensure proper datetime handling for pattern analysis
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.sort_values('datetime').reset_index(drop=True)

# Your code here - investigate missing windspeed patterns
if 'windspeed' in _____.columns:  # Fill in DataFrame name
    missing_windspeed = _____[_____['windspeed'].isnull()]  # Fill in DataFrame names
    if len(missing_windspeed) > 0:
        print("First 10 periods with missing windspeed:")
        print(missing_windspeed[['datetime', 'temp', 'humidity', 'windspeed']].head(_____))  # Fill in number
        print("\nLast 10 periods with missing windspeed:")
        print(missing_windspeed[['datetime', 'temp', 'humidity', 'windspeed']].tail(_____))  # Fill in number
    else:
        print("Great news! No missing windspeed data found")
else:
    print("Windspeed column not found in dataset")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


When investigating missing data patterns, think like a detective:
- Use `df[df['column'].isnull()]` to filter rows where data is missing
- Look at timestamps: are missing values clustered in time periods?
- Check other variables: do missing values happen during specific weather conditions?

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv")

# Ensure proper datetime handling for pattern analysis
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.sort_values('datetime').reset_index(drop=True)

# Your code here - investigate missing windspeed patterns
if 'windspeed' in df.columns:  # Fill in DataFrame name
    missing_windspeed = df[df['windspeed'].isnull()]  # Fill in DataFrame names
    if len(missing_windspeed) > 0:
        print("First 10 periods with missing windspeed:")
        print(missing_windspeed[['datetime', 'temp', 'humidity', 'windspeed']].head(10))  # Fill in number
        print("\nLast 10 periods with missing windspeed:")
        print(missing_windspeed[['datetime', 'temp', 'humidity', 'windspeed']].tail(10))  # Fill in number
    else:
        print("Great news! No missing windspeed data found")
else:
    print("Windspeed column not found in dataset")
```

</details>

---

## Step 5: Standardizing the Timeline - Fixing Duplicate Hours

Before we can trust our time-series data, we need to ensure each hour appears exactly once. Think of this like organizing a photo album - if you have multiple copies of the same photo, you need to decide which one to keep or how to combine them. We'll first identify duplicates, then collapse them using smart aggregation rules.

In [None]:
import numpy as np

# First, let's check how many duplicate timestamps we have
rows_per_ts = df.groupby("datetime").size().rename("n_rows_per_ts")
duplicated_ts = rows_per_ts[rows_per_ts > 1].index
n_dup_timestamps = len(duplicated_ts)
n_dup_rows = int((rows_per_ts[rows_per_ts > 1] - 1).sum())

print(f"Found {n_dup_timestamps} timestamps with duplicates")
print(f"Total extra duplicate rows to collapse: {n_dup_rows}")

# Show an example of duplicated data
if n_dup_timestamps > 0:
    sample_duplicate = duplicated_ts[0]
    print(f"\nExample - rows for {sample_duplicate}:")
    print(df[df["datetime"] == sample_duplicate][["datetime", "count", "temp", "weather"]].head())

# Build our aggregation policy - different rules for different types of data
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
agg_map = {c: "mean" for c in numeric_cols}

# For bike counts (targets), we SUM because rentals accumulate
for c in ["count", "casual", "registered"]:
    if c in agg_map:
        agg_map[c] = "sum"

# For categorical variables, we take the FIRST value
for c in ["holiday", "weather", "season"]:
    if c in df.columns:
        agg_map[c] = "first"

print("\nAggregation policy:")
print("- Bike counts (count, casual, registered): SUM")
print("- Weather variables (temp, humidity, etc.): MEAN") 
print("- Categories (holiday, weather, season): FIRST")

# Handle duplicates based on what we found
if n_dup_timestamps == 0:
    print("\n✅ No duplicate timestamps found - data is already clean!")
    df_clean = df.copy()
else:
    # Collapse the duplicates
    print(f"\nCollapsing {n_dup_timestamps} duplicate timestamps...")
    df_clean = (
        df.groupby("datetime", as_index=False)
          .agg(agg_map)
          .sort_values("datetime")
          .reset_index(drop=True)
    )
    
    # Add a flag to track which hours were collapsed
    df_clean["flag_collapsed_from_duplicates"] = df_clean["datetime"].isin(duplicated_ts)
    
    print(f"✅ Collapse complete!")
    print(f"Hours that were collapsed: {df_clean['flag_collapsed_from_duplicates'].sum()}")
    print(f"Total rows now: {len(df_clean)}")

**What this does:**
We successfully identify and collapse duplicate timestamps using intelligent aggregation rules. Bike rental counts are summed (because rentals accumulate), weather variables are averaged (representing hourly conditions), and categorical variables keep their first values (maintaining consistency). We create a clean dataset (`df_clean`) and add a transparency flag to track which hours were affected by this process.

**Why this matters:**
This standardization is critical for reliable time-series analysis. Duplicate rows would inflate rental counts and create false demand patterns, potentially leading to poor business decisions. Now each hour has exactly one record with properly combined data, creating a trustworthy foundation for forecasting models.

---

### Challenge 5: Complete Timeline - Filling the Missing Hours
Your client asks: "Are we missing any hours in our dataset? Can we have a complete timeline?" Create a continuous hourly timeline by inserting rows for missing hours.

In [None]:
# Import libraries and load data (and apply Step 5 cleaning first)
import pandas as pd
import numpy as np

# Load and prepare the dataset with collapsed duplicates
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv")
df['datetime'] = pd.to_datetime(df['datetime'])

# First collapse duplicates (from Step 5)
rows_per_ts = df.groupby("datetime").size()
duplicated_ts = rows_per_ts[rows_per_ts > 1].index

numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
agg_map = {c: "mean" for c in numeric_cols}
for c in ["count", "casual", "registered"]:
    if c in agg_map:
        agg_map[c] = "sum"
for c in ["holiday", "weather", "season"]:
    if c in df.columns:
        agg_map[c] = "first"

df = df.groupby("datetime", as_index=False).agg(agg_map).sort_values("datetime").reset_index(drop=True)

# Your challenge: Create a complete hourly timeline
time_min = _____["datetime"].min()  # Fill in DataFrame name
time_max = _____["datetime"].max()  # Fill in DataFrame name

# Build the full hourly index from min to max
full_hours = pd.date_range(_____, _____, freq="H")  # Fill in time_min and time_max

print(f"Original hours: {len(df)}")
print(f"Expected hours: {len(full_hours)}")
print(f"Missing hours: {len(full_hours) - len(df)}")

# Your code here - reindex to the full timeline and add a flag
original_hours = pd.Index(df["datetime"])
df = df.set_index("datetime").reindex(_____)  # Fill in the full_hours variable
df.index.name = "datetime"
df = df.sort_index()  # Ensure chronological order

# Add flag for inserted missing hours
df["flag_missing_timestamp"] = ~df.index.isin(_____)  # Fill in original_hours variable

# Check results
inserted_hours = int(df["flag_missing_timestamp"].sum())
print(f"✅ Timeline completed!")
print(f"Total hours now: {len(df)}")
print(f"Inserted missing hours: {inserted_hours}")

**What you should achieve:**
- A complete hourly timeline from first to last timestamp
- Identification of how many hours were missing
- A flag marking which rows were inserted vs. original data

**Why this matters:**
A continuous timeline is essential for time-series analysis. Missing hours can break seasonal patterns and forecasting models.

---

## Step 6: Forward Fill for Known Missing Data

From our data quality assessment, we identified missing values in windspeed and potentially other variables. Before applying complex imputation strategies, let's start with forward fill for variables that change slowly over time.

This method assumes that if a value is missing, the most recent valid observation is our best estimate. For example, if temperature at 2 PM is missing but we recorded 25°C at 1 PM, forward fill uses 25°C for the 2 PM gap. This works because many weather variables have **temporal continuity** - they don't jump dramatically from hour to hour.

The beauty of this method is that it's simple, preserves actual measurements (rather than creating artificial averages), and works well for variables with natural persistence. Think of it as saying "conditions probably stayed similar to the last known measurement" rather than guessing with statistical averages.

In [None]:
# Rule #1: Never modify original data! Always work on a copy
df_clean = df.copy()
print("Created a backup copy of original data")
print("Now we can safely apply fixes without losing the original")

# Check which variables have missing data (from our previous analysis)
missing_summary = df_clean.isnull().sum()
print("\nVariables with missing data:")
for col in missing_summary[missing_summary > 0].index:
    count = missing_summary[col]
    percentage = (count / len(df_clean)) * 100
    print(f"  {col}: {count} missing ({percentage:.1f}%)")

# Apply forward fill to slowly-changing weather variables
weather_vars_for_ffill = ['atemp', 'humidity', 'temp']
filled_count = 0

for var in weather_vars_for_ffill:
    if var in df_clean.columns and df_clean[var].isnull().sum() > 0:
        missing_before = df_clean[var].isnull().sum()
        df_clean[var] = df_clean[var].fillna(method='ffill')
        missing_after = df_clean[var].isnull().sum()
        filled = missing_before - missing_after
        filled_count += filled
        print(f"✅ Forward-filled {filled} missing values in {var}")

if filled_count == 0:
    print("No weather variables needed forward fill")
else:
    print(f"\nTotal values filled with forward fill: {filled_count}")

# Show remaining missing data after forward fill
remaining_missing = df_clean.isnull().sum()
print("\nMissing data remaining after forward fill:")
if remaining_missing.sum() == 0:
    print("🎉 All missing data has been handled!")
else:
    for col in remaining_missing[remaining_missing > 0].index:
        print(f"  {col}: {remaining_missing[col]} still missing")

**When forward fill works well:**
- **Temperature variables**: Change gradually over hours
- **Atmospheric pressure**: Varies slowly throughout the day
- **Humidity**: Generally stable unless weather system changes

**When NOT to use forward fill:**
- **Windspeed**: Can change rapidly with weather fronts
- **Bike counts**: Demand varies dramatically by hour
- **Precipitation**: Rain can start/stop suddenly

---

### Challenge 6: Smart Grouping for Weather-Related Variables

Your client asks: "Can we handle the remaining missing windspeed data more intelligently?" Implement season-aware weather-based imputation for windspeed since wind patterns vary significantly between weather conditions AND seasons (winter storms vs summer breezes).

In [None]:
# Import libraries and load data (building on Step 6)
import pandas as pd
import numpy as np

# Load and apply our Step 6 cleaning first
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv")
df_clean = df.copy()

# Apply forward fill to slowly-changing variables (from Step 6)
weather_vars_for_ffill = ['atemp', 'humidity', 'temp']
for var in weather_vars_for_ffill:
    if var in df_clean.columns and df_clean[var].isnull().sum() > 0:
        df_clean[var] = df_clean[var].fillna(method='ffill')

# Your code here - implement smart grouping for windspeed
if 'windspeed' in df_clean.columns and df_clean['windspeed'].isnull().sum() > 0:
    print("Found missing windspeed data - implementing season-aware weather-based imputation")
    
    # Step 1: Calculate typical windspeed for each season-weather combination
    seasonal_weather_windspeed = df_clean.groupby(['_____', '_____'])['windspeed']._____()  # Fill in groupby columns and aggregation method
    print("Typical windspeed by season and weather condition:")
    print(seasonal_weather_windspeed)
    
    # Step 2: Fill missing values based on season-weather combinations
    missing_filled = 0
    for season in df_clean['season'].unique():
        for weather_code in df_clean['weather'].unique():
            # Find rows with this season-weather combination AND missing windspeed
            mask = (df_clean['season'] == season) & (df_clean['_____'] == weather_code) & (df_clean['windspeed']._____())  # Fill in column and missing check
            if mask.sum() > 0:
                # Get the fill value for this season-weather combination
                if (season, weather_code) in seasonal_weather_windspeed.index:
                    fill_value = seasonal_weather_windspeed[(season, weather_code)]
                    # Check if the median value is valid (not NaN)
                    if pd.isna(fill_value):
                        # Fallback to season-only median
                        fill_value = df_clean[df_clean['season'] == season]['windspeed'].median()
                else:
                    # Fallback to season-only median if combination doesn't exist
                    fill_value = df_clean[df_clean['season'] == season]['windspeed'].median()
                
                # Apply the fill value
                df_clean.loc[mask, 'windspeed'] = fill_value
                missing_filled += mask.sum()
    
    print(f"✅ Filled {missing_filled} missing windspeed values using season-aware weather logic")
    
    # Step 3: Validate our smart imputation
    print("\nValidation - windspeed distribution by season and weather:")
    validation = df_clean.groupby(['season', 'weather'])['windspeed'].agg(['count', 'mean', 'std']).round(2)
    print(_____)  # Fill in variable name
    
    # Step 4: Additional seasonal validation
    print("\nSeasonal windspeed patterns:")
    seasonal_summary = df_clean.groupby('season')['windspeed'].agg(['mean', 'std']).round(3)
    print(seasonal_summary)
else:
    print("No missing windspeed data found")

# Final check - did we handle all missing data?
remaining_missing = df_clean.isnull().sum()
print("\nFinal missing data status:")
if remaining_missing.sum() == 0:
    print("🎉 All missing data has been successfully handled!")
else:
    print("Variables still with missing data:")
    for col in remaining_missing[remaining_missing > 0].index:
        print(f"  {col}: {remaining_missing[col]} missing")

<details>
<summary>💡 <strong>Tip</strong> (click to expand)</summary>


Season-aware weather imputation is more meteorologically accurate because wind patterns depend on both weather conditions AND seasonal factors. Use `groupby(['season', 'weather'])['windspeed'].median()` to calculate typical windspeed for each season-weather combination. Winter storms typically have higher winds than summer storms, while spring/fall weather systems show different patterns. The median remains more robust than mean for this type of imputation. Validate that your results show realistic seasonal progression: winter (season=1) and fall (season=4) generally have higher wind speeds than summer (season=3).

</details>

<details>
<summary>🤫 <strong>Solution</strong> (click to expand)</summary>

```python
# Import libraries and load data (building on Step 6)
import pandas as pd
import numpy as np

# Load and apply our Step 6 cleaning first
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv")
df_clean = df.copy()

# Apply forward fill to slowly-changing variables (from Step 6)
weather_vars_for_ffill = ['atemp', 'humidity', 'temp']
for var in weather_vars_for_ffill:
    if var in df_clean.columns and df_clean[var].isnull().sum() > 0:
        df_clean[var] = df_clean[var].fillna(method='ffill')

# Your code here - implement smart grouping for windspeed
if 'windspeed' in df_clean.columns and df_clean['windspeed'].isnull().sum() > 0:
    print("Found missing windspeed data - implementing season-aware weather-based imputation")
    
    # Step 1: Calculate typical windspeed for each season-weather combination
    seasonal_weather_windspeed = df_clean.groupby(['season', 'weather'])['windspeed'].median()  # Fill in groupby columns and aggregation method
    print("Typical windspeed by season and weather condition:")
    print(seasonal_weather_windspeed)
    
    # Step 2: Fill missing values based on season-weather combinations
    missing_filled = 0
    for season in df_clean['season'].unique():
        for weather_code in df_clean['weather'].unique():
            # Find rows with this season-weather combination AND missing windspeed
            mask = (df_clean['season'] == season) & (df_clean['weather'] == weather_code) & (df_clean['windspeed'].isnull())  # Fill in column and missing check
            if mask.sum() > 0:
                # Get the fill value for this season-weather combination
                if (season, weather_code) in seasonal_weather_windspeed.index:
                    fill_value = seasonal_weather_windspeed[(season, weather_code)]
                    # Check if the median value is valid (not NaN)
                    if pd.isna(fill_value):
                        # Fallback to season-only median
                        fill_value = df_clean[df_clean['season'] == season]['windspeed'].median()
                else:
                    # Fallback to season-only median if combination doesn't exist
                    fill_value = df_clean[df_clean['season'] == season]['windspeed'].median()
                
                # Apply the fill value
                df_clean.loc[mask, 'windspeed'] = fill_value
                missing_filled += mask.sum()
    
    print(f"✅ Filled {missing_filled} missing windspeed values using season-aware weather logic")
    
    # Step 3: Validate our smart imputation
    print("\nValidation - windspeed distribution by season and weather:")
    validation = df_clean.groupby(['season', 'weather'])['windspeed'].agg(['count', 'mean', 'std']).round(2)
    print(validation)  # Fill in variable name
    
    # Step 4: Additional seasonal validation
    print("\nSeasonal windspeed patterns:")
    seasonal_summary = df_clean.groupby('season')['windspeed'].agg(['mean', 'std']).round(3)
    print(seasonal_summary)
else:
    print("No missing windspeed data found")

# Final check - did we handle all missing data?
remaining_missing = df_clean.isnull().sum()
print("\nFinal missing data status:")
if remaining_missing.sum() == 0:
    print("🎉 All missing data has been successfully handled!")
else:
    print("Variables still with missing data:")
    for col in remaining_missing[remaining_missing > 0].index:
        print(f"  {col}: {remaining_missing[col]} missing")
```

</details>

---

## Summary: Professional Data Quality Assessment and Cleaning

**What We've Accomplished**:
- Implemented comprehensive data quality assessment protocols for real transportation datasets
- Applied advanced statistical outlier detection methodologies using IQR and Z-score techniques
- Developed timeline standardization procedures through intelligent duplicate handling and missing timestamp resolution
- Executed sophisticated imputation strategies including forward-fill and season-aware weather-based approaches
- Established data validation frameworks ensuring cleaning operations preserved essential dataset characteristics
- Created systematic approaches to range validation and impossible value detection for business contexts

**Key Technical Skills Mastered**:
- Range validation and boundary checking for transportation data quality control
- Statistical outlier identification and investigation using multiple detection methodologies
- Time-series data standardization with intelligent aggregation rule implementation
- Missing data pattern analysis and targeted imputation strategy development
- Business logic integration for contextual data cleaning and validation protocols
- Quality assurance verification through before-and-after statistical comparison frameworks

**Next Steps**: Next, we'll advance to feature engineering and preprocessing techniques, transforming cleaned datasets into optimized variable structures and formats required for sophisticated machine learning model development and deployment in transportation demand forecasting applications.

Your bike-sharing client now possesses thoroughly validated, production-ready datasets that demonstrate professional data cleaning methodologies and systematic quality control processes - the essential foundation that consulting firms require for reliable predictive modeling and strategic business intelligence initiatives.