# Initial data cleaning (Part 2)

In this notebook we do some data cleaning for a small portion of the POGOH dataset, this will give some ideas on how to proceed for dealing with the data at a larger scale.

__NOTE:__ In this dataset there were some NaN observations in End Station Id and End Station Name, and because of this the ID column is read as a float instead of integer. We handle this issue in this notebook.

In [1]:
import pandas as pd
import string
import sys
sys.path.append('/home/manuel/Documents/AI/pogoh-ai-engineering')
import shared_utils.pogoh_cleaning as pc

# Define the file path
file_path = "/home/manuel/Documents/AI/pogoh-ai-engineering/data/raw/april-2025.xlsx"
# Load the Excel file into a DataFrame
pogoh_df = pd.read_excel(file_path)
pogoh_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47523 entries, 0 to 47522
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Closed Status       47523 non-null  object        
 1   Duration            47523 non-null  int64         
 2   Start Station Id    47523 non-null  int64         
 3   Start Date          47523 non-null  datetime64[ns]
 4   Start Station Name  47523 non-null  object        
 5   End Date            47523 non-null  datetime64[ns]
 6   End Station Id      47497 non-null  float64       
 7   End Station Name    47497 non-null  object        
 8   Rider Type          47523 non-null  object        
dtypes: datetime64[ns](2), float64(1), int64(2), object(4)
memory usage: 3.3+ MB


In [2]:
pogoh_df_cleaned = pc.clean_column_names(pogoh_df)
pogoh_df_cleaned.columns

Index(['closed_status', 'duration', 'start_station_id', 'start_date',
       'start_station_name', 'end_date', 'end_station_id', 'end_station_name',
       'rider_type'],
      dtype='object')

First we'll create a function that checks if the information between *duration*, *start date* and *end date* matches. The duration of the trip is measured seconds, so we are matching this with the difference betwen the end and start times.

In [9]:
pogoh_df_cleaned['duration'].describe()

count     47523.000000
mean        790.384277
std        2694.794272
min           0.000000
25%         222.000000
50%         381.000000
75%         821.000000
max      200129.000000
Name: duration, dtype: float64

In [10]:
pogoh_df_cleaned[['duration','start_date','end_date']].head()

Unnamed: 0,duration,start_date,end_date
0,412,2025-04-30 23:58:19,2025-05-01 00:05:11
1,179,2025-04-30 23:58:06,2025-05-01 00:01:05
2,1060,2025-04-30 23:48:29,2025-05-01 00:06:09
3,1173,2025-04-30 23:46:30,2025-05-01 00:06:03
4,394,2025-04-30 23:45:03,2025-04-30 23:51:37


In [5]:
pc.check_duration_mismatch(pogoh_df_cleaned, tolerance_sec=0)

No duration mismatches found.


Unnamed: 0,closed_status,duration,start_station_id,start_date,start_station_name,end_date,end_station_id,end_station_name,rider_type,computed_duration


In [6]:
# Sample data with one mismatch in row 2
data = {
    "start_date": [
        "2025-06-17 08:00:00",
        "2025-06-17 09:15:00",
        "2025-06-17 10:00:00",
        "2025-06-17 11:45:00"
    ],
    "end_date": [
        "2025-06-17 08:30:00",  # 1800 seconds
        "2025-06-17 09:45:00",  # 1800 seconds
        "2025-06-17 10:20:00",  # 1200 seconds, but duration says 900 (mismatch)
        "2025-06-17 12:30:00"   # 2700 seconds
    ],
    "duration": [
        1800,  # match
        1800,  # match
        900,   # mismatch!
        2700   # match
    ]
}

df_test = pd.DataFrame(data)
pc.check_duration_mismatch(df_test)


Duration mismatch found in 1 row(s).


Unnamed: 0,start_date,end_date,duration,computed_duration
0,2025-06-17 10:00:00,2025-06-17 10:20:00,900,1200.0


In [15]:
pogoh_df[pogoh_df_cleaned['end_date'].isna()]

Unnamed: 0,Closed Status,Duration,Start Station Id,Start Date,Start Station Name,End Date,End Station Id,End Station Name,Rider Type
