# Initial data cleaning

In this notebook we do some data cleaning for a small portion of the POGOH dataset, this will give some ideas on how to proceed for dealing with the data at a larger scale.

__NOTE:__ In this dataset there were some NaN observations in End Station Id and End Station Name, and because of this the ID column is read as a float instead of integer.

In [42]:
import pandas as pd

# Define the file path
file_path = "/home/manuel/Documents/AI/pogoh-ai-engineering/data/raw/april-2025.xlsx"

# Load the Excel file into a DataFrame
pogoh_df = pd.read_excel(file_path)
pogoh_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47523 entries, 0 to 47522
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Closed Status       47523 non-null  object        
 1   Duration            47523 non-null  int64         
 2   Start Station Id    47523 non-null  int64         
 3   Start Date          47523 non-null  datetime64[ns]
 4   Start Station Name  47523 non-null  object        
 5   End Date            47523 non-null  datetime64[ns]
 6   End Station Id      47497 non-null  float64       
 7   End Station Name    47497 non-null  object        
 8   Rider Type          47523 non-null  object        
dtypes: datetime64[ns](2), float64(1), int64(2), object(4)
memory usage: 3.3+ MB


I noticed that the names of some columns in the dataset have spaces, which might be easy for reading but while using the data for training models might lead to unexpected behavior. For this reason, I decided to convert them to lower case and convert the spaces to underscores.

In [None]:
# Standardize column names: lowercase, replace spaces with underscores
pogoh_df.columns = (
    pogoh_df.columns.str.strip()          # remove leading/trailing spaces
                     .str.lower()         # convert to lowercase
                     .str.replace(" ", "_")  # replace spaces with underscores
)
pogoh_df.columns

There are several tasks that could be done for data cleaning, so I'll divide them by broad categories.

## Missing & Invalid Data

- Inspect and handle rows with missing End Station Id and End Station Name.
- Drop or repair rows with Duration <= 0 or End Date < Start Date.
- Recalculate duration from timestamps and check against Duration column.

### Missing Stations

We start by handling trips where either the start station or the end station is missing. In this particular set there were no trips where the starting station information is missing, which makes sense since retrieving the bike from a stations is what initializes a trip. However, this information could still be missing due to some unforseen errors.

In [None]:
# Filter rows where either Start Station Id or Start Station Name is missing
missing_start_station = pogoh_df[
    pogoh_df["start_station_id"].isnull() | pogoh_df["start_station_name"].isnull()
]

# Count how many of these rows fall into each Closed Status category
missing_start_status_counts = missing_start_station["closed_status"].value_counts()
print(missing_start_status_counts)

# Preview the first few rows with missing values
missing_start_status_counts.head()

Regarding trips with missing information from the end station, there were 26 trips missing both the end station ID and name. All of them had the closed status terminated.
IN the case where only one were missing inputing information would be possible by matching the ID or the station name correspondingly.

In [None]:
# Filter rows where either End Station Id or End Station Name is missing
missing_end_station = pogoh_df[
    pogoh_df["end_station_id"].isnull() | pogoh_df["end_station_name"].isnull()
]

print("=== Missing End Station ID ===")
print(sum(pogoh_df["end_station_id"].isnull()))
print("\n=== Missing End Station Name ===")
print(sum(pogoh_df["end_station_name"].isnull()))
missing_end_station.describe()


In [None]:
# Count how many of these rows fall into each Closed Status category
missing_end_status_counts = missing_end_station["closed_status"].value_counts()
print(missing_end_status_counts)

# Preview the first few rows with missing values
missing_end_station.head()

The treatment of these kind of trips could be done for an anomaly detection framework. For now, I'll be dropping any trips that might exhibit this behavior.

In [None]:
# Dropping rows with missing end station ID or name
pogoh_df_cleaned = pogoh_df[~pogoh_df["end_station_id"].isnull() & ~pogoh_df["end_station_name"].isnull()].copy()
print(pogoh_df.shape)
print(pogoh_df_cleaned.shape)

Now we move to standardizing the names of the stations and checking if the stations IDs map to a unique station. 

NOTE: There have been instances of POGOH stations being relocated, this might bring some issies. For example, if a station was relocated and the name changed but the ID didn't.

For starting stations, all of the ID and names were paired uniquely.

In [None]:
# Check for start station ID mapping to more than one name
start_conflicts = (
    pogoh_df_cleaned.groupby("start_station_id")["start_station_name"]
    .nunique()
    .reset_index(name="name_count")
)
start_conflicts = start_conflicts[start_conflicts["name_count"] > 1]

pogoh_df_cleaned.groupby("start_station_id")["start_station_name"].nunique()

# Display actual name mismatches (if any)
if not start_conflicts.empty:
    display(
        pogoh_df_cleaned[
            pogoh_df_cleaned["start_station_id"].isin(start_conflicts["start_station_id"])
        ][["start_station_id", "start_station_name"]].drop_duplicates().sort_values("start_station_id")
    )


For end stations, the ID and name matching also didn't have any issues.

In [None]:
end_conflicts = (
    pogoh_df_cleaned.groupby("end_station_id")["end_station_name"]
    .nunique()
    .reset_index(name="name_count")
)
end_conflicts = end_conflicts[end_conflicts["name_count"] > 1]

if not end_conflicts.empty:
    display(
        pogoh_df_cleaned[
            pogoh_df_cleaned["end_station_id"].isin(end_conflicts["end_station_id"])
        ][["end_station_id", "end_station_name"]].drop_duplicates().sort_values("end_station_id")
    )


Just to be safe, I'm also checking that the ID/name pairs are exactly the same between the start and end stations.

In [None]:
# 1. Get unique (station_id, station_name) pairs from start and end columns
start_pairs = pogoh_df_cleaned[["start_station_id", "start_station_name"]].drop_duplicates()
end_pairs = pogoh_df_cleaned[["end_station_id", "end_station_name"]].drop_duplicates()

# 2. Rename end_pairs to align columns for comparison
end_pairs = end_pairs.rename(columns={
    "end_station_id": "start_station_id",
    "end_station_name": "start_station_name"
})


end_pairs.head()

In [None]:
# 3. Check for pairs that appear in start but not in end
start_only = pd.merge(start_pairs, end_pairs, how="left", indicator=True)
start_only = start_only[start_only["_merge"] == "left_only"].drop(columns="_merge")

# 4. Check for pairs that appear in end but not in start
end_only = pd.merge(end_pairs, start_pairs, how="left", indicator=True)
end_only = end_only[end_only["_merge"] == "left_only"].drop(columns="_merge")

# 5. Print result summary
print("Start-only station pairs (not seen in end stations):", len(start_only))
print("End-only station pairs (not seen in start stations):", len(end_only))

# 6. (Optional) Display mismatched pairs if needed
if not start_only.empty:
    print("Start-only mismatches:")
    print(start_only)

if not end_only.empty:
    print("End-only mismatches:")
    print(end_only)

In [None]:
pogoh_df_cleaned.info()