# Lecture 4: Data Quality & Cleaning Essentials - Professional Data Preparation

## Learning Objectives

By the end of this lecture, you will be able to:
- Identify and categorize different types of data quality issues in transportation datasets
- Apply systematic approaches to detect missing data, outliers, and inconsistencies
- Choose appropriate strategies for handling missing data (elimination vs. imputation)
- Implement professional data cleaning workflows using pandas methods

---

## 1. The Reality of Real-World Transportation Data

As your consulting work progresses, you'll quickly discover a fundamental truth: **real-world data is messy**. The clean datasets you see in textbooks don't exist in professional practice. Your bike-sharing client's data comes from sensors that malfunction, weather stations that go offline, and databases that occasionally corrupt records.

This messiness isn't just a technical inconvenience - it's a business-critical challenge. Poor data quality can lead to incorrect demand predictions, resulting in empty bike stations when customers need them or overflow situations where returning bikes becomes impossible. These operational failures directly impact customer satisfaction and revenue.

Your role as a professional consultant is to transform messy, incomplete data into reliable foundations for business decision-making. This requires systematic approaches, clear documentation, and transparent communication about data limitations and cleaning procedures.

## 2. Common Data Quality Issues in Transportation Systems

Messiness isn’t a single problem—it comes in many forms, each affecting your analysis in different ways. As a consultant, your job is to **recognize these issues early** and decide how to handle them before they undermine your predictions.

### 2.1. Understanding Data Quality Dimensions

Data quality is not simply “good” or “bad.” Instead, it exists along multiple dimensions, each of which can influence your analysis in different ways. Let’s break down the five most important ones and see how they appear in transportation data:

1. **Completeness**
   *Definition:* Whether all expected values are present in the dataset.
   *Why it matters:* Missing values create blind spots, especially when the missingness is not random.
   *Example:* If bike-sharing sensors fail during storms, you may lose exactly the data needed to understand weather impacts on demand.

2. **Accuracy**
   *Definition:* The extent to which recorded values reflect reality.
   *Why it matters:* Inaccurate values can mislead both descriptive analysis and predictive models.
   *Example:* A temperature of -50°C recorded in Washington D.C. in July is a clear sensor error that could confuse demand models.

3. **Consistency**
   *Definition:* Whether data follows uniform formats, units, and scales.
   *Why it matters:* Inconsistent formats can corrupt calculations and comparisons.
   *Example:* Mixing Celsius and Fahrenheit in the same column, or having timestamps in multiple formats, leads to corrupted analysis.

4. **Validity**
   *Definition:* Whether values fall within logical or physically possible ranges.
   *Why it matters:* Invalid data points indicate measurement or collection errors.
   *Example:* Negative bike counts or humidity above 100% are impossible values that reveal collection problems.

5. **Uniqueness**
   *Definition:* Whether each observation is recorded only once.
   *Why it matters:* Duplicate records inflate usage counts and distort demand predictions.
   *Example:* If the same rental transaction is logged twice, it looks like demand is higher than it really was.

**Understanding the Five-Dimensional Framework**

Together, these five dimensions reveal how data quality problems can undermine predictions in different ways. Completeness affects the scope of your analysis, accuracy determines whether your insights reflect reality, consistency ensures reliable calculations, validity catches measurement errors, and uniqueness prevents false inflation of patterns. Recognizing these distinct aspects helps you target specific problems rather than applying generic "data cleaning" approaches.

### 2.2. Missing Data Patterns and Business Implications

Not all missing data is created equal. In transportation systems, gaps in the dataset often occur under very specific conditions—the very conditions you want to analyze. Understanding when and why data goes missing helps you choose the right handling strategies and communicate risks to clients.

**Weather-Driven Data Loss**

The most common pattern involves sensor failures during extreme weather events. Storms, heavy rain, and temperature extremes can knock out monitoring equipment precisely when weather has the strongest influence on bike usage. This creates a double problem: you lose data exactly when you need it most, and standard statistical assumptions about "random" missingness don't apply.

**Operational and Maintenance Gaps**

Planned maintenance creates predictable but significant gaps in transportation data. These periods often overlap with major infrastructure changes—like opening new stations or updating software systems—meaning that ignoring the operational context could hide important business insights about system growth and performance.

**Network Effects and Cascade Failures**

Transportation systems are interconnected. When a high-demand station goes offline, neighboring stations typically experience unusual demand spikes. Without data from the offline station, it becomes difficult to distinguish between genuine demand growth and temporary displacement of riders to nearby locations.

**Peak Period Vulnerabilities**

Finally, data failures during rush hours pose special challenges because peak-period predictions drive critical business decisions. Missing data during these high-stakes periods requires more sophisticated handling strategies than gaps occurring during quiet off-peak hours when the business impact is minimal.

**The Business Impact of Missing Data Patterns**

These four patterns demonstrate why missing data analysis goes beyond simple counts and percentages. Each pattern creates specific risks for demand forecasting and requires tailored handling strategies. By understanding when and why data disappears, you can make informed decisions about imputation, communicate limitations clearly to clients, and avoid building models on unreliable foundations.

### 2.3. Systematic Inconsistencies

Now that we've examined missing data, let's shift to inconsistencies that are harder to detect. Beyond obvious errors like negative values, transportation datasets often contain **systematic inconsistencies** that masquerade as genuine patterns. These subtle problems are particularly dangerous because they can mislead analysis while appearing completely normal. As a consultant, actively searching for these hidden inconsistencies protects both your models and your credibility.

The most common systematic inconsistencies in transportation data fall into four categories:

* **Temporal Zone Confusion**
  Mixing standard time and daylight saving time can create sudden “demand anomalies” around clock changes that have nothing to do with rider behavior.

* **Geographic Coordinate Systems**
  If different coordinate systems are used to record station locations, spatial analysis may reveal patterns that are purely artifacts of mismatched formats.

* **Operational Status Mixing**
  True zero demand (no rides) is very different from a station being unavailable due to maintenance or full capacity. Mixing these cases together creates artificial demand patterns that models mistakenly learn.

* **Unit Inconsistencies**
  Data collected in multiple units—such as Fahrenheit vs. Celsius, or miles vs. kilometers—corrupts correlations and model training if not standardized.

## 3. Data Quality Assessment Process

As a consultant, clients will expect you to follow a systematic, defensible process for data quality assessment — not ad-hoc checking. The 4-step process we'll use gives you a professional framework that you can explain and justify to any client:

1. **Quick Data Quality Checks** – your first diagnostic scan to flag obvious issues.
2. **Time Series Integrity Check** – systematic analysis of temporal continuity.
3. **Outlier Detection** – comprehensive detection of anomalies.
4. **Missing Data Detection** – detailed analysis of completeness.

This systematic approach demonstrates expertise and builds client confidence from day one. Steps 1, 3, and 4 apply universally across domains. Step 2 becomes critical for time-series data, like the one we usually see in transportation problems.

### 3.1. Quick Data Quality Checks

**Definition:** Quick data quality checks are rapid diagnostic scans that assess whether a dataset is fundamentally sound before investing time in detailed analysis or modeling.

**Explanation:** Think of this like a consultant's "triage" — in just a few minutes, you want to know whether the dataset looks broadly reliable, where the biggest risks lie, and which areas deserve closer investigation. When you receive a new dataset from a client, this scan is your first step, not modeling.

**Purpose:** These lightweight checks flag obvious issues across structure, value ranges, and cross-variable plausibility. We won't yet explain problems in depth or attempt fixes — that comes later. The goal is rapid risk assessment.

**Structural Snapshot**

The first thing we do is take a **structural snapshot** of the dataset: how many rows and columns it has, and whether the variables are of the expected type. This step sounds simple, but it’s one of the fastest ways to detect import errors, unexpected row counts, or inconsistencies in data types — all of which can indicate bigger problems lurking beneath the surface.

In [1]:
import pandas as pd

# Load the Washington D.C. bike-sharing dataset (intentionally messy version)
df = pd.read_csv("https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv")

# Check dataset dimensions and data types
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
print(df.info())

Rows: 10886, Columns: 12
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10341 non-null  float64
 3   workingday  10886 non-null  int64  
 4   weather     10815 non-null  float64
 5   temp        10814 non-null  float64
 6   atemp       10814 non-null  float64
 7   humidity    10814 non-null  float64
 8   windspeed   10814 non-null  float64
 9   casual      10740 non-null  float64
 10  registered  10740 non-null  float64
 11  count       10740 non-null  float64
dtypes: float64(9), int64(2), object(1)
memory usage: 1020.7+ KB
None


> **Note:** You may have noticed something unusual here: the results don't look exactly like what we saw in the previous lecture. That's intentional. From this point forward, we'll sometimes work with a slightly modified version of the Washington D.C. dataset. We've made it "messy" on purpose so you can practice handling real-world problems. As a consultant, you'll rarely get a dataset that's perfectly clean — each phase of the course will bring new challenges for you to detect and resolve.

When we run this quick check, a few important concerns stand out immediately:

* The **`holiday` column has missing values**. This means that not every day is properly labeled as a holiday or not — a detail that could easily distort demand forecasts, since holiday patterns differ sharply from regular weekdays.
* The **weather-related variables** such as `temp`, `humidity`, and `windspeed` also contain gaps. Because these are some of the most important explanatory variables in our forecasting model, missingness here reduces our ability to explain variation in bike rentals.
* Most critically, the **`count` column — our target variable — is missing in several rows**. This is a red flag: every missing entry in `count` means lost training data, and the reliability of our model hinges on how much usable demand history we have.

This single, simple scan already tells us that the dataset cannot be used “as is” for modeling. More importantly, it shows why **structural checks are powerful**: with just one command, we’ve uncovered problems in both our explanatory variables and our target.

This is exactly the kind of insight to highlight at the project start: *"Before we can move into forecasting, we've already identified major gaps in the dataset that could affect both explanatory power and prediction accuracy."*

**Value Sanity Checks**

After confirming the dataset’s structure, the next step is to ask: *“Do the values themselves make sense?”*

Every variable has **natural boundaries** defined by either business rules or physical limits:

* Bike rentals cannot be negative.
* Humidity must fall between 0% and 100%.
* Local temperatures should stay within climate-appropriate ranges.

Values that fall outside these ranges are not just unusual — they are almost always errors caused by faulty sensors, bad data entry, or processing mistakes.

In [2]:
# Basic range checks
print("Negative rentals:", (df["count"] < 0).sum())
print("Humidity out of range:", ((df["humidity"] < 0) | (df["humidity"] > 100)).sum())
print("Temperature extremes:", df["temp"].describe()[["min", "max"]])

Negative rentals: 6
Humidity out of range: 10
Temperature extremes: min      0.82
max    100.00
Name: temp, dtype: float64


Running these quick checks reveals three immediate red flags:

* We find **6 cases of negative rentals**, which is logically impossible — you can’t rent fewer than zero bikes. This is most likely a logging or entry error.
* We spot **10 humidity values above 100%**, which is physically impossible. This usually points to a faulty sensor reading or an ingestion problem.
* Finally, the **temperature maximum is close to 100°C**. While summers in Washington D.C. can be hot, they certainly don’t reach boiling point! This extreme value is almost certainly an error that could distort averages or mislead a forecasting model.

Together, these findings show why **range validation is essential**. With just a few simple checks, we can identify values that clearly break real-world rules — and if left undetected, they could slip into analysis, biasing results and damaging credibility with clients.

This step builds trust: you demonstrate that you're not just running models blindly, but verifying whether the data itself reflects reality.

**Cross-Variable Plausibility**

Numbers can look fine in isolation but make no sense once you compare them across variables. That’s why a good quick check also includes a **plausibility scan across related variables**. In transportation data, the most important relationship to test is usually between **demand** and **context variables** like weather.

For example, common sense (and business experience) tells us that **bike rentals should fall when weather conditions worsen**. If the dataset shows the opposite, that’s a red flag.

In [3]:
# Quick plausibility check: average rentals by weather condition
avg_rentals_by_weather = df.groupby("weather")["count"].mean()
print(avg_rentals_by_weather)

weather
1.0    208.579986
2.0    182.556386
3.0    119.877214
4.0    628.562500
Name: count, dtype: float64


At first glance, the results follow expectations:

- Rentals are **highest on clear days (weather = 1)**.
- They gradually decrease as conditions worsen to misty or light rain.

But the final category is suspicious. Under **heavy rain or storms (weather = 4)**, the dataset shows an **average of more than 600 rentals per hour** — even higher than on sunny days.

This makes little business sense: real-world demand should drop sharply in severe weather, not skyrocket. Such an inconsistency usually points to:

- **Mislabelled weather codes**, or
- A **misalignment between weather feeds and rental logs**.

Left uncorrected, such errors could lead to false conclusions like *"bike demand is resilient during storms"* — which in turn could drive poor operational decisions, such as overstocking bikes or overscheduling staff during extreme weather events.

This kind of cross-variable check is a reminder: **some errors only appear when you look at relationships, not just single columns**. That’s why consultants always test whether the data’s “story” matches real-world logic.

**Concluding Reflection on Quick Checks**

Our quick scan uncovered multiple issues:
- **Structural problems** in explanatory variables and target `count`
- **Impossible values** like negative rentals and extreme temperatures  
- **Cross-variable inconsistencies** where weather and demand defy business logic

This initial scan shows the dataset is not ready for modeling. We need deeper investigation of missing data and outliers before drawing conclusions.

### 3.2. Temporal Continuity Check

**Definition:** Temporal continuity refers to whether your time-series data has a complete, consistent timeline without gaps, duplicates, or misaligned sequences.

**Explanation:** Transportation data is inherently time-based. If the timeline itself is broken, then any further cleaning, imputation, or modeling will rest on shaky foundations. Missing hours create false patterns, duplicates skew averages, and misaligned sequences break seasonal analysis.

**Purpose:** Before we tackle outliers or missing values, we must verify the timeline's integrity to ensure reliable foundations for all subsequent work.

To do so, we run a timeline diagnostic. This tells us:

- The first and last timestamp in the dataset.
- How many hours should exist in that range.
- How many unique hours actually exist.
- How many are missing.
- How many duplicate rows we have for the same hour.

This gives us a quick sense of whether the dataset is complete and well-aligned, or whether we’re missing entire blocks of time.

In [4]:
import pandas as pd

# Ensure datetime is properly parsed
data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv"
df = pd.read_csv(data_path)
df["datetime"] = pd.to_datetime(df["datetime"], errors="coerce")
df = df.sort_values("datetime").reset_index(drop=True)

# Identify coverage
t_min = df["datetime"].min()
t_max = df["datetime"].max()

# Duplicated timestamps
n_dup_rows = df.duplicated(subset=["datetime"]).sum()

# Build expected hourly range
expected = pd.date_range(t_min, t_max, freq="h")
actual = pd.Index(df["datetime"].unique())

n_expected = len(expected)
n_actual = len(actual)
n_missing_hours = len(expected.difference(actual))

print("=== Timeline quick check ===")
print("time_min:", t_min)
print("time_max:", t_max)
print("expected_hours:", n_expected)
print("present_unique_hours:", n_actual)
print("missing_hours:", n_missing_hours)
print("duplicate_rows:", n_dup_rows)

=== Timeline quick check ===
time_min: 2011-01-01 00:00:00
time_max: 2012-12-19 23:00:00
expected_hours: 17256
present_unique_hours: 10862
missing_hours: 6394
duplicate_rows: 24


Between January 2011 and December 2012, the dataset should contain **17,256 hourly rows**. In reality, it only contains **10,862 unique hours**, leaving **6,394 hours missing**. That's more than a third of the timeline absent — a major structural gap. We also see **24 duplicate rows**, meaning some hours are represented more than once.

This is a critical insight: the raw data cannot be trusted as a continuous timeline. Large gaps undermine seasonal analysis, and duplicates risk double-counting demand. Before any modeling, we need to fix both problems. We will show how to do that in Chapter 5.

### 3.3. Outlier Detection

**Definition:** Outlier detection identifies data points that deviate significantly from the expected pattern — values that are unusually high, unusually low, or inconsistent with the rest of the dataset.

**Explanation:** In transportation data, outliers can represent legitimate extreme events (like snowstorms causing demand drops), data collection errors (like negative bike rentals), or operational anomalies (like maintenance affecting normal patterns). Each type requires different treatment strategies.

**Purpose:** The goal is not just to detect outliers, but to classify them correctly so that legitimate events are preserved while errors are corrected. This ensures models learn from real patterns rather than data quality issues.

Outliers matter in urban mobility because they can carry very different meanings:

- **Legitimate Extreme Events**: Real extraordinary conditions like snowstorms or city-wide festivals that models should account for
- **Data Collection Errors**: Sensor malfunctions or impossible values that need correction or removal
- **Operational Anomalies**: Maintenance activities or system changes that create artificial patterns

Now that we understand what outliers are, let’s look at the different ways to detect them. Outlier detection methods generally fall into three categories:

1. **Statistical Methods**: Mathematical rules based on the distribution of the data.
2. **Business Logic Methods**: Checks based on domain knowledge and system constraints.
3. **Temporal Methods**: Techniques that consider the time-based nature of transportation data.

**Statistical Methods**

Statistical methods use mathematical formulas to identify unusual values. They don’t require prior knowledge of the transportation system - they just look at how far a data point is from what is “normal” in the dataset.

There are several statistical approaches, such as:

- **Z-Score Analysis**
- **Interquartile Range (IQR) Method**
- **Modified Z-Score**

In this lecture, we will focus on just one example: **Z-Score Analysis**. A **Z-score** tells us how many “standard steps” (standard deviations) a data point is away from the average.

$$
Z = \frac{x - \mu}{\sigma}
$$

where:

- $x$ = the value we’re checking
- $\mu$ = the mean (average) of the data
- $\sigma$ = the standard deviation (how spread out the data is)

Think of the average bike rentals per day like the “center of gravity” of the data. Most days will be close to that average. The Z-score is like a distance meter: it tells us how far a particular day is from the typical pattern.

- A Z-score of **0** → exactly average.
- A Z-score of **+2** → two steps above average (busier than normal).
- A Z-score of **–3** → three steps below average (quieter than normal).

When a Z-score is bigger than 3 or smaller than –3, the value is far enough from the average that we should pause and ask: *Is this a real event, or is it an error?*

We use Z-scores because they:

- **Standardize values** so we can compare across variables.
- **Give a simple rule of thumb**: beyond 3 = unusual.
- **Provide a quick first filter** before applying more advanced techniques.

Let’s see how this works in practice using the Washington D.C. bike-sharing dataset. We’ll calculate the Z-scores for daily demand (`count`) and flag potential outliers.

In [5]:
import pandas as pd
import numpy as np

# Load the Washington D.C. bike-sharing dataset
data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv"
df = pd.read_csv(data_path)

# Convert datetime column to pandas datetime type
df['datetime'] = pd.to_datetime(df['datetime'])

# Aggregate rentals by day
daily_rentals = df.groupby(df['datetime'].dt.date)['count'].sum().reset_index()

# Calculate mean and std
mean = daily_rentals['count'].mean()
std = daily_rentals['count'].std()

# Compute Z-scores
daily_rentals['z_score'] = (daily_rentals['count'] - mean) / std

# Flag outliers (|Z| > 3)
outliers = daily_rentals[daily_rentals['z_score'].abs() > 3]

outliers.head()

Unnamed: 0,datetime,count,z_score
90,2011-05-15,21301.0,7.389985
191,2011-11-03,21887.0,7.649137


Both May 15 and November 3 show exceptionally high rental counts, far beyond typical daily demand. These are unlikely to be ordinary fluctuations. They could represent special city-wide events or anomalies in how trips were logged. As consultants, we need to cross-check these dates with event calendars and system logs to confirm whether these spikes reflect genuine demand or possible data quality issues.

**Business Logic Methods**

While statistical methods rely purely on mathematical rules, **business logic methods** use knowledge of the system and its physical constraints to detect outliers. Instead of asking, “Does this number look statistically unusual?”, we ask, “Is this number even possible given how the transportation system works?”

Business logic methods build rules like these based on:

- **Physical constraints**: e.g., a bike station cannot have negative bikes, nor can it rent more bikes than its maximum capacity.
- **Historical ranges**: e.g., demand has never exceeded 1,200 rentals in a day; a value above this threshold is suspicious.
- **Cross-variable checks**: e.g., it shouldn’t be possible to record “heavy rain” alongside “record-high bike usage.”

Let's see an example of this.

In [6]:
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv"
df = pd.read_csv(data_path)

# Convert datetime column to pandas datetime type
df['datetime'] = pd.to_datetime(df['datetime'])

# Check for physical constraint violations in windspeed
invalid_windspeed = df[(df['windspeed'] < 0) | (df['windspeed'] > 60)]

invalid_windspeed.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
1077,2011-03-09 10:00:00,1,0.0,1,2.0,12.3,14.395,75.0,200.0,8.0,49.0,57.0
1159,2011-03-12 23:00:00,1,0.0,0,1.0,15.58,19.695,66.0,200.0,11.0,38.0,49.0
1991,2011-05-09 21:00:00,2,0.0,1,1.0,21.32,25.0,59.0,200.0,28.0,128.0,156.0
2730,2011-07-02 16:00:00,3,0.0,0,1.0,36.08,37.12,22.0,200.0,206.0,192.0,398.0
4820,2011-11-14 22:00:00,4,0.0,1,1.0,23.78,27.275,56.0,200.0,17.0,96.0,113.0


The flagged records show windspeed values of `200.0`, which are far beyond any physically possible measurement for this system. These values clearly indicate sensor or recording errors rather than real-world weather conditions. If left uncorrected, they could distort downstream models, for example by falsely associating extreme winds with normal rental demand.

**Temporal Methods**

Transportation data is inherently tied to **time**. Unlike static datasets, values change depending on the hour, day, season, or long-term trends. **Temporal outlier detection methods** focus on identifying unusual data points that break these expected time-based patterns. For example:

- A sudden drop in rentals during a weekday morning rush hour might indicate a system outage.
- A sharp jump in rentals during winter could mean a special event.
- A long-term shift in demand may signal that the system has grown or changed in some way.

Some common temporal approaches include:

- **Change Point Detection**: Identifying sudden structural shifts in the data (e.g., a new station or policy).
- **Seasonal Anomaly Detection**: Checking if values align with expected seasonal patterns.
- **Trend Deviation Analysis**: Comparing current values to long-term growth or decline.

Let's see an example of a seasonal anomaly detection.

In [7]:
import pandas as pd

# Load the Washington D.C. bike-sharing dataset
data_path = "https://raw.githubusercontent.com/pmarcelino/predictive-modeling/main/datasets/dataset-teaching-lec-04.csv"
df = pd.read_csv(data_path)

# Convert datetime and build daily_rentals with a month column
df["datetime"] = pd.to_datetime(df["datetime"])
daily_rentals = (
    df.groupby(df["datetime"].dt.date)["count"]
      .sum()
      .reset_index()
      .rename(columns={"datetime": "date"})
)
daily_rentals["date"] = pd.to_datetime(daily_rentals["date"])
daily_rentals["month"] = daily_rentals["date"].dt.month

# Define "winter" as Dec-Feb and "summer" as Jun-Aug
winter_months = [12, 1, 2]
summer_months = [6, 7, 8]

winter_avg = daily_rentals[daily_rentals['month'].isin(winter_months)]['count'].mean()
summer_avg = daily_rentals[daily_rentals['month'].isin(summer_months)]['count'].mean()

# Flag unusual winter days (too close to summer levels)
winter_anomalies = daily_rentals[
    (daily_rentals['month'].isin(winter_months)) & 
    (daily_rentals['count'] > summer_avg * 0.8)
]

# Flag unusual summer days (too close to winter levels)
summer_anomalies = daily_rentals[
    (daily_rentals['month'].isin(summer_months)) & 
    (daily_rentals['count'] < winter_avg * 1.2)
]

seasonal_anomalies = pd.concat([winter_anomalies, summer_anomalies])
seasonal_anomalies.head()

Unnamed: 0,date,count,month
233,2012-01-07,4521.0,1
246,2012-02-01,4579.0,2
252,2012-02-07,4375.0,2
436,2012-12-01,5191.0,12
437,2012-12-02,4649.0,12


The detected anomalies highlight days in winter months (January and February) with demand levels much closer to what we’d expect in summer. For example, January 7 shows more than double the average rentals for that month. These could indicate unusually warm days that encouraged cycling, or they might reflect special events. In practice, such findings should be validated with weather data or event calendars. This illustrates how seasonal anomaly detection helps identify values that break expected seasonal patterns, providing valuable clues about real-world influences on demand.

### 3.4. Missing Data Detection

**Definition:** Missing data detection is the systematic analysis of gaps in your dataset to understand their patterns, causes, and potential impact on modeling and analysis.

**Explanation:** Unlike outliers which are individual problematic values, missing data represents systematic gaps that can undermine entire analyses. Missingness often follows patterns — clustering in certain periods, affecting groups of variables together, or reflecting underlying system failures like sensor outages.

**Purpose:** We analyze missing data patterns to design targeted cleaning strategies rather than applying generic fixes. Understanding where, when, and why data goes missing ensures our solutions address root causes and preserve data integrity.

We'll explore missing data through three systematic analyses:
- **Quantitative assessment** — count and percentage of missing values per variable
- **Temporal pattern analysis** — check if gaps cluster in specific time periods  
- **Cross-variable analysis** — identify which variables go missing together

**Quantitative Assessment of Missing Data**

A **quantitative assessment** inventories missing values across all columns, showing both you and the client the scale and location of gaps.

In [8]:
# Count missing values and calculate percentages
missing_counts = df.isnull().sum()
missing_percentages = (missing_counts / len(df)) * 100

# Display side by side for clarity
pd.DataFrame({"Missing": missing_counts, "Percentage": missing_percentages})

Unnamed: 0,Missing,Percentage
datetime,0,0.0
season,0,0.0
holiday,545,5.00643
workingday,0,0.0
weather,71,0.652214
temp,72,0.6614
atemp,72,0.6614
humidity,72,0.6614
windspeed,72,0.6614
casual,146,1.341172


Our scan shows three main areas of concern:

- First, the `holiday` column is missing in about **5% of rows**, which means we can't always tell whether a given day was a holiday — a potentially important driver of demand.
- Second, weather-related variables (`temp`, `atemp`, `humidity`, `windspeed`) are missing in around **0.6% of cases each**. That may sound small, but because they often go missing together, it likely reflects a sensor or reporting problem.
- Finally, and most importantly, the **target variables** (`casual`, `registered`, and `count`) are missing in **146 rows**, or about **1.3% of the dataset**.

Any missing demand values directly reduce the amount of training data available for forecasting, which is bad.

**Temporal Pattern Analysis**

In transportation datasets, missing data often clusters in specific periods. This makes **temporal analysis** essential: by grouping missingness across months or seasons, we can check whether the gaps are random or systematically tied to certain time periods.

This matters for forecasting because if missingness is concentrated in peak demand months, any model we build will be biased or incomplete in those periods.

In [9]:
# Extract year and month into a new column
df['year_month'] = df['datetime'].dt.strftime('%Y-%m')

# Group by the new 'year_month' column and calculate missing values
missing_by_month = df.groupby(df['year_month'])['count'].apply(lambda x: x.isnull().sum())
print(missing_by_month)

year_month
2011-01     0
2011-02     0
2011-03     0
2011-04     0
2011-05     0
2011-06     0
2011-07    74
2011-08     0
2011-09     0
2011-10     0
2011-11     0
2011-12     0
2012-01     0
2012-02     0
2012-03     0
2012-04     0
2012-05     0
2012-06     0
2012-07    72
2012-08     0
2012-09     0
2012-10     0
2012-11     0
2012-12     0
Name: count, dtype: int64


The temporal scan reveals that missing demand values are not evenly spread across the dataset. Instead, they cluster heavily in **July 2011 and July 2012**. This points to a systematic issue, such as a recurring sensor outage or reporting gap during summer months. For the client, this has a clear implication: forecasts for peak-season demand may be less reliable unless these gaps are addressed. It’s not just random noise — it’s a structural weakness in the dataset that could distort decision-making during the busiest time of year.

**Cross-Variable Analysis of Missing Data**

Sometimes, missingness in one variable aligns with gaps in others. This is a crucial diagnostic step: it helps distinguish between isolated issues (e.g., a single column not recorded) and **system-wide failures** (e.g., a weather station outage affecting several variables at once).

By checking which variables tend to go missing together, we can form a more realistic hypothesis about the underlying cause.

In [10]:
# Check rows where at least one weather-related variable is missing
weather_missing = df[df[["temp", "humidity", "windspeed"]].isnull().any(axis=1)]
print(weather_missing.head())

               datetime  season  holiday  workingday  weather  temp  atemp  \
680 2011-02-11 16:00:00       1      0.0           1      NaN   NaN    NaN   
681 2011-02-11 17:00:00       1      0.0           1      NaN   NaN    NaN   
682 2011-02-11 18:00:00       1      0.0           1      NaN   NaN    NaN   
683 2011-02-11 19:00:00       1      0.0           1      4.0   NaN    NaN   
684 2011-02-11 20:00:00       1      0.0           1      NaN   NaN    NaN   

     humidity  windspeed  casual  registered  count year_month  
680       NaN        NaN    14.0       111.0  125.0    2011-02  
681       NaN        NaN    18.0       193.0  211.0    2011-02  
682       NaN        NaN     9.0       165.0  174.0    2011-02  
683       NaN        NaN   105.0       476.0  581.0    2011-02  
684       NaN        NaN     2.0        61.0   63.0    2011-02  


The cross-variable check confirms that missingness is not isolated — entire blocks of weather data (`temp`, `atemp`, `humidity`, `windspeed`) disappear at the same time. For example, on **February 11, 2011**, several consecutive hours show all weather variables missing together. This strongly suggests a **weather station outage or reporting failure**, not random gaps. Recognizing that these variables fail together allows us to design a coordinated cleaning strategy rather than treating each column as an independent problem.

**Concluding Reflection on Missing Data**

Our analysis shows that missingness is not random noise — it's a **systematic problem** with serious implications:

- **Quantitative scan:** Notable gaps in `holiday`, weather variables, and target `count`
- **Temporal analysis:** Gaps cluster in specific months, particularly July
- **Cross-variable analysis:** Weather variables often fail together, indicating sensor outages

This structured diagnosis transforms vague concerns into actionable insights. Instead of simply saying *"the dataset is incomplete"*, we can now explain precisely **where, when, and why** missingness occurs.

This prepares us for the next step. In **Data Cleaning Techniques**, we'll move from diagnosis to **treatment**: applying targeted strategies to address missingness (alongside outliers and other issues), ensuring the dataset is ready for reliable modeling.

## 4. Data Cleaning Strategies and Implementation

### 4.1. From Assessment to Treatment: Where Clients See Value

We've diagnosed the problems in our bike-sharing dataset: impossible values, extreme outliers, missing data blocks, and timeline gaps. **Now comes the treatment stage** — this is where clients see the real consulting value.

Diagnosis impresses clients with your analytical rigor, but **cleaning delivers the reliable data foundation** they need for business decisions. While assessment shows what's wrong, cleaning demonstrates how you solve problems systematically and transparently.

This transition from "finding issues" to "fixing issues" represents the shift from diagnostic consultant to solution provider. Clients pay for datasets they can trust, models they can deploy, and insights they can act on.

**Data cleaning** is not cosmetic work to make datasets "look nice." It's a structured process that:

- Distinguishes **errors** from **real events**
- Applies **consistent, rule-based fixes** where possible
- Decides when to **impute or drop** values that cannot be fixed
- Keeps every change **transparent and auditable**

This systematic approach ensures that your cleaning decisions can be explained, defended, and replicated — critical requirements for professional consulting work.

### 4.2. The Unified Cleaning Workflow

Since we are working with time-series data, we will start by **standardizing the timeline** and fixing its structural problems — this is a critical first step that ensures we have a reliable temporal foundation.

From there, we will follow a standard data cleaning workflow that, while always dependent on the specific dataset and business context, can be generalized into a systematic process that learners can apply to their own projects. Once the timeline is reliable, every suspicious value — whether it’s extreme, impossible, or missing — is treated with the same **three-step decision tree**:

1. **Is this an event or an error?**

   - *Event* → keep, but **flag** (e.g., snowstorm, festival).
   - *Error* → continue.

2. **If error: Can I fix it with a rule?**

   - Examples: cap humidity to 100, relabel mis-coded weather categories, set negative rentals to `NaN`.
   - If yes → **fix and flag**.

3. **If cannot fix: Should I impute or drop?**

   - **Predictors (features):** impute if valuable, drop if not.
   - **Target (`count`):** never impute for modeling → drop missing rows.
   - Always **flag** imputations or dropped ranges.

👉 **Why flag?**
Flagging ensures that every change is **visible, auditable, and explainable**. Flags allow you to:

- Compare model performance with and without imputed values.
- Communicate risks to clients (*“July demand is less reliable: 20% of weather values were imputed.”*).
- Keep a record of what changed and why.

This mindset — *event or error? fix, impute, or drop? always flag* — is the backbone of professional data cleaning.

### 4.3. Standardizing the Timeline

As we saw in Chapter 4, the timeline is not reliable: we have missing hours and duplicate rows. To fix this, we will:

- Collapse duplicate rows.
- Reindex to a continuous hourly timeline and flag inserted hours.

**Collapse Duplicate Rows**

Let's start by collapsing duplicate rows. Having more than one row per hour is inconsistent with how the system should work: the bike-sharing dataset should have exactly one record for each hour.

But how do we combine duplicates? We use a **clear, rule-based aggregation policy**:

- For **targets** (`count`, `casual`, `registered`): **SUM** the values, since demand is additive.
- For **numeric predictors** (e.g., weather variables): take the **MEAN**, since conditions are measured as averages.
- For **categorical variables** (`holiday`, `weather`, `season`): take the **FIRST** value, as codes should be stable within an hour.

First, we'll identify and collapse duplicate rows using a rule-based aggregation strategy. This ensures consistent handling while preserving the underlying data meaning.

In [11]:
import numpy as np

# Identify duplicated timestamps
rows_per_ts = df.groupby("datetime").size().rename("n_rows_per_ts")
duplicated_ts = rows_per_ts[rows_per_ts > 1].index
n_dup_timestamps = len(duplicated_ts)
n_dup_rows = int((rows_per_ts[rows_per_ts > 1] - 1).sum())

print(f"Duplicated timestamps: {n_dup_timestamps} | Extra duplicate rows to collapse: {n_dup_rows}")

# Build aggregation policy
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
agg_map = {c: "mean" for c in numeric_cols}

for c in ["count", "casual", "registered"]:
    if c in agg_map:
        agg_map[c] = "sum"

for c in ["holiday", "weather", "season"]:
    if c in df.columns:
        agg_map[c] = "first"

# Aggregate to one row per hour
df = (
    df.groupby("datetime", as_index=False)
      .agg(agg_map)
      .sort_values("datetime")
      .reset_index(drop=True)
)

# Add a flag for hours that were collapsed from duplicates
df["flag_collapsed_from_duplicates"] = df["datetime"].isin(duplicated_ts)

print("Collapse complete → policy: SUM targets, MEAN numeric predictors, FIRST categoricals.")
print("Hours affected:", int(df["flag_collapsed_from_duplicates"].sum()))

Duplicated timestamps: 24 | Extra duplicate rows to collapse: 24
Collapse complete → policy: SUM targets, MEAN numeric predictors, FIRST categoricals.
Hours affected: 24


We collapsed **24 duplicated hours**, removing 24 extra rows from the dataset. The aggregation policy ensures that:

- Demand counts remain correct (no double-counting).
- Weather predictors reflect average conditions.
- Categorical codes remain stable.

The flag `flag_collapsed_from_duplicates` marks these hours so we can always trace which rows were affected. For transparency, this is important: if a client later asks why certain hours look unusual, we can point to the duplication issue.

**Reindex to a Continuous Hourly Timeline**

Finally, we enforce a **continuous hourly index**. Right now, the dataset simply skips missing hours — they aren’t represented at all. This makes gaps invisible and impossible to handle systematically.

By reindexing:

- We insert a row for **every missing hour**.
- Those rows will contain `NaN`s for predictors and/or target.
- We add a flag to mark which rows were **inserted**.

Next, we'll enforce a continuous hourly timeline by inserting rows for missing hours. This makes all gaps explicit and manageable.

In [12]:
# Build the full hourly index
time_min = df["datetime"].min()
time_max = df["datetime"].max()
full_hours = pd.date_range(time_min, time_max, freq="h")

# Keep original set of hours
original_hours = pd.Index(df["datetime"])

# Reindex and flag
df = df.set_index("datetime").reindex(full_hours)
df.index.name = "datetime"
df["flag_missing_timestamp"] = ~df.index.isin(original_hours)

# Audit
inserted_hours = int(df["flag_missing_timestamp"].sum())
total_hours = len(df)
present_hours = total_hours - inserted_hours

print("=== Reindex audit ===")
print("time_min:", time_min)
print("time_max:", time_max)
print("total_hours_after_reindex:", total_hours)
print("present_hours_from_source:", present_hours)
print("inserted_missing_hours:", inserted_hours)

=== Reindex audit ===
time_min: 2011-01-01 00:00:00
time_max: 2012-12-19 23:00:00
total_hours_after_reindex: 17256
present_hours_from_source: 10862
inserted_missing_hours: 6394


After reindexing, the dataset now has **17,256 rows** — one for each expected hour. Of these, **6,394 rows were inserted** to represent missing hours. These rows currently contain `NaN`s, which is exactly what we want: the gaps are now explicit and can be handled in the cleaning workflow.

From here:

- For **predictors**, missing values can be imputed using interpolation or seasonal medians.
- For the **target (`count`)**, rows with missing demand must be dropped before model training.
- The flag `flag_missing_timestamp` allows us to communicate clearly to clients how much of the dataset is reconstructed rather than observed.

With these two steps, we’ve established a **reliable timeline**. The dataset now has exactly one row per hour, duplicates resolved, and missing periods made explicit. This creates the solid foundation we need before applying the unified cleaning workflow to outliers and missing data.

### 4.4. Applying the Workflow to Outliers

Earlier we learned to detect outliers; now we'll apply the unified workflow to treat them systematically. The key shift is from detection to decision-making.

We apply the three-step workflow to each outlier:

1. **Is this an event or an error?**

   - *Event* → keep, but **flag** for transparency.
   - *Error* → continue.

2. **If error: Can I fix it with a rule?**

   - If yes → **fix and flag**.
   - If no → move to step 3.

3. **If cannot fix: Should I impute or drop?**

   - **Predictors** → impute if valuable, drop if not.
   - **Target (`count`)** → drop rows (never impute for modeling).
   - Always **flag** the treatment.

In this section, we’ll look at three kinds of outliers in our dataset: **impossible values, extreme-but-possible values, and flagged unusual demand days.**

**Impossible Values (Clear Errors)**

Some values are not just unusual — they are **physically impossible**. For example:

* Negative rentals (`count < 0`).
* Humidity above 100%.
* Temperatures above 100°C in Washington, D.C.

These values can never represent real-world events, so they are **always errors**. Our workflow says: *if error, fix if possible; if not, set to missing (`NaN`) so it can be imputed or dropped later.*

Now we apply rule-based fixes to impossible values and flag each intervention:

In [13]:
import numpy as np

# Negative rentals (cannot exist)
df["flag_negative_count"] = df["count"] < 0
df.loc[df["flag_negative_count"], "count"] = np.nan

# Humidity outside [0,100]
df["flag_humidity_invalid"] = (df["humidity"] < 0) | (df["humidity"] > 100)
df.loc[df["flag_humidity_invalid"], "humidity"] = np.nan

# Temperature above plausible range
df["flag_temp_invalid"] = df["temp"] > 60  # conservative cutoff for Washington climate
df.loc[df["flag_temp_invalid"], "temp"] = np.nan

print("Impossible values flagged and set to NaN where needed.")

Impossible values flagged and set to NaN where needed.


**Consultant reflection:**
Impossible values were found and corrected: negative rentals, humidity outside 0–100%, and extreme temperatures above 60°C. Each was set to `NaN` so that predictors can later be imputed and targets dropped if necessary. Flags preserve full transparency, letting us show clients exactly which rows were affected.

**Extreme-but-Possible Values (Events or Errors)**

Other values may look suspiciously large or small but could still be real. For example:

- A sudden spike in rentals on a festival day.
- A sharp drop in rentals during a blizzard.

Our task is to distinguish **events** (valid, keep+flag) from **errors** (sensor glitches, logging bugs). We'll use Z-scores to flag candidates for further investigation:

In [14]:
# Calculate Z-scores for daily rentals
daily_rentals = (
    df["count"]
    .groupby(df.index.date)
    .sum()
    .rename("daily_count")
    .reset_index()
)

mean = daily_rentals["daily_count"].mean()
std = daily_rentals["daily_count"].std()
daily_rentals["z_score"] = (daily_rentals["daily_count"] - mean) / std

# Flag potential outliers (|Z| > 3)
daily_rentals["flag_daily_outlier"] = daily_rentals["z_score"].abs() > 3
outliers = daily_rentals[daily_rentals["flag_daily_outlier"]]

outliers.head()

Unnamed: 0,index,daily_count,z_score,flag_daily_outlier
134,2011-05-15,21301.0,6.44916,True
306,2011-11-03,21887.0,6.654598,True


The Z-score scan highlights several days where total rentals are far beyond normal levels. These could represent genuine events (such as public holidays or festivals) or errors (like duplicated logs). Our workflow requires further investigation before making a decision:

- If the spike matches an event calendar → keep as **event**, but flag.
- If no event explains it → treat as **error**, and decide whether to drop or impute.

This step demonstrates the consultant’s value: not just deleting data, but connecting patterns to real-world context.

Outlier treatment is not about “removing unusual values.” It’s about applying a **consistent, auditable decision process**:

- **Impossible values** → always errors → set to missing, flag.
- **Extreme-but-possible values** → could be events or errors.

  - If event → keep, flag.
  - If error → fix if rule-based, else impute or drop.

This way, nothing disappears silently. Every change is recorded, every uncertainty is visible, and every decision can be defended to a client.

### 4.5. Professional Data Cleaning Checklist

Here's your consultant-ready workflow for any data cleaning project:

**Phase 1: Foundation**
- [ ] Standardize timeline (collapse duplicates, enforce continuity)
- [ ] Add flags for all structural changes

**Phase 2: The Three-Step Decision Process**
For every suspicious value:
- [ ] **Step 1:** Event or error? (Context check)
- [ ] **Step 2:** If error, can I fix with a rule? (Apply fix + flag)
- [ ] **Step 3:** If unfixable, impute or drop? (Predictors vs. targets)

**Phase 3: Documentation**
- [ ] Flag every intervention with clear labels
- [ ] Document aggregation policies and business logic
- [ ] Quantify impact ("15% of weather data imputed")
- [ ] Prepare client communication on data limitations

**Quality Gates:**
- ✅ No impossible values remain
- ✅ Timeline is continuous and complete
- ✅ Every change is flagged and traceable
- ✅ Client can understand what was done and why

This checklist ensures your cleaning is not just thorough, but **defendable to clients and auditable by colleagues**.

---

## Summary and Transition to Feature Engineering Implementation

Your mastery of data quality assessment and cleaning establishes the reliability foundation essential for professional transportation consulting. Understanding missing data patterns, outlier detection techniques, and systematic cleaning workflows provides the data integrity infrastructure necessary for sophisticated business applications.

The data quality expertise you've developed transforms messy real-world datasets into analytical-grade information. Your ability to assess data completeness, handle missing values systematically, and validate cleaning procedures creates the trustworthy data foundation for all advanced analysis you'll perform as a transportation consultant.

Professional data quality proficiency distinguishes competent consultants who combine technical rigor with business judgment. Your expertise enables engagement with complex transportation datasets while maintaining analytical integrity essential for generating reliable business insights and strategic recommendations.

Your next challenge involves implementing advanced feature engineering and preprocessing techniques that transform clean data into model-ready formats. The feature engineering implementation will demonstrate how data quality foundations translate to working solutions that prepare transportation data for sophisticated predictive modeling.

The integration of data quality mastery with feature engineering practices creates comprehensive data preparation capability essential for professional transportation consulting success. Your data reliability foundation combined with systematic feature development expertise enables sophisticated business applications that drive strategic value creation and competitive advantage in urban mobility markets.

The systematic approach to data quality assessment and cleaning that you've learned in this lecture transforms raw, messy data into reliable foundations for business analysis. In our next lecture, we'll build on this cleaned data foundation by learning advanced preprocessing and feature engineering techniques that will prepare your data for sophisticated machine learning models.