# Initial data cleaning

In this notebook we do some data cleaning for a small portion of the POGOH dataset, this will give some ideas on how to proceed for dealing with the data at a larger scale.

__NOTE:__ In this dataset there were some NaN observations in End Station Id and End Station Name, and because of this the ID column is read as a float instead of integer. We handle this issue in this notebook.

In [1]:
import pandas as pd
import string

In [2]:
# Define the file path
file_path = "/home/manuel/Documents/AI/pogoh-ai-engineering/data/raw/april-2025.xlsx"

# Load the Excel file into a DataFrame
pogoh_df = pd.read_excel(file_path)
pogoh_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47523 entries, 0 to 47522
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Closed Status       47523 non-null  object        
 1   Duration            47523 non-null  int64         
 2   Start Station Id    47523 non-null  int64         
 3   Start Date          47523 non-null  datetime64[ns]
 4   Start Station Name  47523 non-null  object        
 5   End Date            47523 non-null  datetime64[ns]
 6   End Station Id      47497 non-null  float64       
 7   End Station Name    47497 non-null  object        
 8   Rider Type          47523 non-null  object        
dtypes: datetime64[ns](2), float64(1), int64(2), object(4)
memory usage: 3.3+ MB


I noticed that the names of some columns in the dataset have spaces, which might be easy for reading but while using the data for training models might lead to unexpected behavior. For this reason, I decided to convert them to lower case and convert the spaces to underscores.

In [3]:
# Standardize column names: lowercase, replace spaces with underscores
pogoh_df.columns = (
    pogoh_df.columns.str.strip()          # remove leading/trailing spaces
                     .str.lower()         # convert to lowercase
                     .str.replace(" ", "_")  # replace spaces with underscores
)
pogoh_df.columns

Index(['closed_status', 'duration', 'start_station_id', 'start_date',
       'start_station_name', 'end_date', 'end_station_id', 'end_station_name',
       'rider_type'],
      dtype='object')

There are several tasks that could be done for data cleaning, so I'll divide them by broad categories.

## Missing & Invalid Data

- Inspect and handle rows with missing End Station Id and End Station Name.
- Drop or repair rows with Duration <= 0 or End Date < Start Date.
- Recalculate duration from timestamps and check against Duration column.

### Missing Stations

We start by handling trips where either the start station or the end station is missing. In this particular set there were no trips where the starting station information is missing, which makes sense since retrieving the bike from a stations is what initializes a trip. However, this information could still be missing due to some unforseen errors.

In [4]:
# Filter rows where either Start Station Id or Start Station Name is missing
missing_start_station = pogoh_df[
    pogoh_df["start_station_id"].isnull() | pogoh_df["start_station_name"].isnull()
]

# Count how many of these rows fall into each Closed Status category
missing_start_status_counts = missing_start_station["closed_status"].value_counts()
print(missing_start_status_counts)

# Preview the first few rows with missing values
missing_start_status_counts.head()

Series([], Name: count, dtype: int64)


Series([], Name: count, dtype: int64)

Regarding trips with missing information from the end station, there were 26 trips missing both the end station ID and name. All of them had the closed status terminated.
In the case where only one were missing inputing information would be possible by matching the ID or the station name correspondingly.

In [5]:
# Filter rows where either End Station Id or End Station Name is missing
missing_end_station = pogoh_df[
    pogoh_df["end_station_id"].isnull() | pogoh_df["end_station_name"].isnull()
]

print("=== Missing End Station ID ===")
print(sum(pogoh_df["end_station_id"].isnull()))
print("\n=== Missing End Station Name ===")
print(sum(pogoh_df["end_station_name"].isnull()))
missing_end_station.describe()


=== Missing End Station ID ===
26

=== Missing End Station Name ===
26


Unnamed: 0,duration,start_station_id,start_date,end_date,end_station_id
count,26.0,26.0,26,26,0.0
mean,18604.961538,32.615385,2025-04-15 08:35:43.500000,2025-04-15 13:45:48.461538304,
min,157.0,10.0,2025-04-03 06:26:13,2025-04-03 06:37:00,
25%,480.0,21.0,2025-04-05 17:44:42.500000,2025-04-05 18:21:00,
50%,733.5,29.0,2025-04-14 21:45:04,2025-04-14 22:18:00,
75%,3437.5,49.25,2025-04-21 19:44:15.750000128,2025-04-21 19:47:00,
max,175669.0,58.0,2025-04-28 17:22:18,2025-04-28 17:35:00,
std,41207.256464,16.613433,,,


In [6]:
# Count how many of these rows fall into each Closed Status category
missing_end_status_counts = missing_end_station["closed_status"].value_counts()
print(missing_end_status_counts)

# Preview the first few rows with missing values
missing_end_station.head()

closed_status
TERMINATED    26
Name: count, dtype: int64


Unnamed: 0,closed_status,duration,start_station_id,start_date,start_station_name,end_date,end_station_id,end_station_name,rider_type
4603,TERMINATED,762,15,2025-04-28 17:22:18,Ivy St & Walnut St,2025-04-28 17:35:00,,,MEMBER
7352,TERMINATED,45506,21,2025-04-27 01:12:34,Liberty Ave & Stanwix St,2025-04-27 13:51:00,,,CASUAL
9308,TERMINATED,78292,21,2025-04-25 17:37:08,Liberty Ave & Stanwix St,2025-04-26 15:22:00,,,CASUAL
9310,TERMINATED,78398,21,2025-04-25 17:36:22,Liberty Ave & Stanwix St,2025-04-26 15:23:00,,,CASUAL
9313,TERMINATED,78427,21,2025-04-25 17:35:53,Liberty Ave & Stanwix St,2025-04-26 15:23:00,,,CASUAL


The treatment of these kind of trips could be done for an anomaly detection framework. For now, I'll be dropping any trips that might exhibit this behavior.

In [7]:
# Dropping rows with missing end station ID or name
pogoh_df_cleaned = pogoh_df[~pogoh_df["end_station_id"].isnull() & ~pogoh_df["end_station_name"].isnull()].copy()
print(pogoh_df.shape)
print(pogoh_df_cleaned.shape)
pogoh_df_cleaned.info()

(47523, 9)
(47497, 9)
<class 'pandas.core.frame.DataFrame'>
Index: 47497 entries, 0 to 47522
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   closed_status       47497 non-null  object        
 1   duration            47497 non-null  int64         
 2   start_station_id    47497 non-null  int64         
 3   start_date          47497 non-null  datetime64[ns]
 4   start_station_name  47497 non-null  object        
 5   end_date            47497 non-null  datetime64[ns]
 6   end_station_id      47497 non-null  float64       
 7   end_station_name    47497 non-null  object        
 8   rider_type          47497 non-null  object        
dtypes: datetime64[ns](2), float64(1), int64(2), object(4)
memory usage: 3.6+ MB


Since now we've dealt with the observations with missing station IDs, we can convert the columns to integer type.

In [8]:
# Ensure both start and end station IDs are integers
pogoh_df_cleaned["start_station_id"] = pogoh_df_cleaned["start_station_id"].astype("int64")
pogoh_df_cleaned["end_station_id"] = pogoh_df_cleaned["end_station_id"].astype("int64")
pogoh_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 47497 entries, 0 to 47522
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   closed_status       47497 non-null  object        
 1   duration            47497 non-null  int64         
 2   start_station_id    47497 non-null  int64         
 3   start_date          47497 non-null  datetime64[ns]
 4   start_station_name  47497 non-null  object        
 5   end_date            47497 non-null  datetime64[ns]
 6   end_station_id      47497 non-null  int64         
 7   end_station_name    47497 non-null  object        
 8   rider_type          47497 non-null  object        
dtypes: datetime64[ns](2), int64(3), object(4)
memory usage: 3.6+ MB


Now we move to standardizing the names of the stations and checking if the stations IDs map to a unique station. 

NOTE: There have been instances of POGOH stations being relocated, this might bring some issues. For example, if a station was relocated and the name changed but the ID didn't.

For starting stations, all of the ID and names were paired uniquely.

In [9]:
# Check for start station ID mapping to more than one name
start_conflicts = (
    pogoh_df_cleaned.groupby("start_station_id")["start_station_name"]
    .nunique()
    .reset_index(name="name_count")
)
start_conflicts = start_conflicts[start_conflicts["name_count"] > 1]

pogoh_df_cleaned.groupby("start_station_id")["start_station_name"].nunique()

# Display actual name mismatches (if any)
if not start_conflicts.empty:
    display(
        pogoh_df_cleaned[
            pogoh_df_cleaned["start_station_id"].isin(start_conflicts["start_station_id"])
        ][["start_station_id", "start_station_name"]].drop_duplicates().sort_values("start_station_id")
    )


For end stations, the ID and name matching also didn't have any issues.

In [10]:
end_conflicts = (
    pogoh_df_cleaned.groupby("end_station_id")["end_station_name"]
    .nunique()
    .reset_index(name="name_count")
)
end_conflicts = end_conflicts[end_conflicts["name_count"] > 1]

if not end_conflicts.empty:
    display(
        pogoh_df_cleaned[
            pogoh_df_cleaned["end_station_id"].isin(end_conflicts["end_station_id"])
        ][["end_station_id", "end_station_name"]].drop_duplicates().sort_values("end_station_id")
    )


Just to be safe, I'm also checking that the ID/name pairs are exactly the same between the start and end stations.

In [11]:
# 1. Get unique (station_id, station_name) pairs from start and end columns
start_pairs = pogoh_df_cleaned[["start_station_id", "start_station_name"]].drop_duplicates()
end_pairs = pogoh_df_cleaned[["end_station_id", "end_station_name"]].drop_duplicates()

# 2. Rename end_pairs to align columns for comparison
end_pairs = end_pairs.rename(columns={
    "end_station_id": "start_station_id",
    "end_station_name": "start_station_name"
})

# 3. Check for pairs that appear in start but not in end
start_only = pd.merge(start_pairs, end_pairs, how="left", indicator=True)
start_only = start_only[start_only["_merge"] == "left_only"].drop(columns="_merge")

# 4. Check for pairs that appear in end but not in start
end_only = pd.merge(end_pairs, start_pairs, how="left", indicator=True)
end_only = end_only[end_only["_merge"] == "left_only"].drop(columns="_merge")

# 5. Print result summary
print("Start-only station pairs (not seen in end stations):", len(start_only))
print("End-only station pairs (not seen in start stations):", len(end_only))

# 6. (Optional) Display mismatched pairs if needed
if not start_only.empty:
    print("Start-only mismatches:")
    print(start_only)

if not end_only.empty:
    print("End-only mismatches:")
    print(end_only)

Start-only station pairs (not seen in end stations): 0
End-only station pairs (not seen in start stations): 0


All ID/Station Names seem to match. Although this is what was expected. Might be worth doing it in the full dataset as a sanity check. After checking the unique values for names of stations for both end and start stations, we can conclude that along with the names and IDs being the same across start/end stations then there is no cases where the names were misspeled or the IDs were incorrect.

In [12]:


# Checking number of unique end station names
print(len(pogoh_df_cleaned['end_station_name'].unique()))
# Cheking number of unique start station names
print(len(pogoh_df_cleaned['start_station_name'].unique()))


60
60


What we do next is format the station names in such a a way that we eliminate all leading or trailing spaces, eliminate/replace all symbols and convert abbreviations to full words.

In [13]:
# Strip whitespace at the start and end
pogoh_df_cleaned['start_station_name_clean'] = pogoh_df_cleaned['start_station_name'].str.strip()
# Make all characters lowercase
pogoh_df_cleaned['start_station_name_clean'] = pogoh_df_cleaned['start_station_name_clean'].str.lower()
# Normalize inner spaces to just one space
pogoh_df_cleaned['start_station_name_clean'] = (
    pogoh_df_cleaned['start_station_name_clean'].str.replace(r"\s+", " ", regex=True)
)
# Normalize common abbreviations
replace_dict = {
    r"\bst\b": "street",
    r"\bave\b": "avenue",
    r"\bblvd\b": "boulevard",
    r"\bdr\b": "drive",
    r"\bext\b": "extension",
    r"\bn\b": "north",
    r"\bs\b": "south",
    r"&": "and"
}
for key, val in replace_dict.items():
    pogoh_df_cleaned['start_station_name_clean'] = pogoh_df_cleaned['start_station_name_clean'].str.replace(key, val, regex=True)
# Remove punctuation symbols in the strings
# NOTE: If done earlier, might erase symbols like & form the names
pogoh_df_cleaned['start_station_name_clean'] = (
    pogoh_df_cleaned['start_station_name_clean'].str.translate(str.maketrans("","",string.punctuation))
)


pogoh_df_cleaned['start_station_name_clean'].head(15)

0           schenley drive and schenley drive extension
1              north dithridge street and centre avenue
2               south bouquet avenue and sennott street
3               south bouquet avenue and sennott street
4     south 27th street and sidney street southside ...
5                  allequippa street and darragh street
6                  allequippa street and darragh street
7                         42nd street and butler street
8                        atwood street and bates street
9                     ohara street and university place
10              south bouquet avenue and sennott street
11                     coltart avenue and forbes avenue
12                       atwood street and bates street
13              south bouquet avenue and sennott street
14                 allequippa street and darragh street
Name: start_station_name_clean, dtype: object

The above procedures will be used repeatedly with other datasets from the POGOH database, so I'll be doing a function implementation of them.

In [15]:
import sys
sys.path.append('/home/manuel/Documents/AI/pogoh-ai-engineering')
from shared_utils.pogoh_cleaning import clean_station_names, clean_column_names

df_test = pogoh_df.copy()
df_test_cleaned =  clean_column_names(df_test)

ImportError: cannot import name 'clean_column_names' from 'shared_utils.pogoh_cleaning' (/home/manuel/Documents/AI/pogoh-ai-engineering/shared_utils/pogoh_cleaning.py)

In [16]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [47]:
os.getcwd()

'/home/manuel/Documents/AI/pogoh-ai-engineering/week-0-setup'

In [None]:
df_test = pogoh_df.copy()
df_test_cleaned =  clean_station_names(df_test, col_name='start_station_name')
df_test_cleaned['start_station_name_clean'].head(15)