In this notebook we do a brief exploratory analysis for the POGOH data from April 2025 and report our findings.

In [1]:
import pandas as pd

# Define the file path
file_path = "/home/manuel/Documents/AI/pogoh-ai-engineering/data/raw/april-2025.xlsx"

# Load the Excel file into a DataFrame
pogoh_df = pd.read_excel(file_path)

We display some basic information about the dataset. The April dataset has 47523 observations and includes the columns

* Closed Status
* Duration
* Start Station Id
* Start Date
* Start Station Name
* End Date
* End Station Id
* End Station Name
* Rider type

In [3]:
# Display basic info and the first few rows
print(pogoh_df.info())
print(pogoh_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47523 entries, 0 to 47522
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Closed Status       47523 non-null  object        
 1   Duration            47523 non-null  int64         
 2   Start Station Id    47523 non-null  int64         
 3   Start Date          47523 non-null  datetime64[ns]
 4   Start Station Name  47523 non-null  object        
 5   End Date            47523 non-null  datetime64[ns]
 6   End Station Id      47497 non-null  float64       
 7   End Station Name    47497 non-null  object        
 8   Rider Type          47523 non-null  object        
dtypes: datetime64[ns](2), float64(1), int64(2), object(4)
memory usage: 3.3+ MB
None
  Closed Status  Duration  Start Station Id          Start Date  \
0        NORMAL       412                33 2025-04-30 23:58:19   
1        NORMAL       179                34 2025-04-30 2

According to what we see, the variables have the following types:

* Closed Status: String
* Duration: Integer
* Start Station Id: Integer
* Start Date: Time variable
* Start Station Name: String
* End Date: Time variable
* End Station Id: Integer
* End Station Name: String
* Rider type: String

In [4]:
# Summary statistics for numeric columns
summary_stats = pogoh_df.describe()

# Count of unique values per column
unique_counts = pogoh_df.nunique()

# Count of missing values per column
missing_values = pogoh_df.isnull().sum()

# Print outputs
print("=== Summary Statistics ===")
print(summary_stats)

print("\n=== Unique Value Counts ===")
print(unique_counts)

print("\n=== Missing Values ===")
print(missing_values)

=== Summary Statistics ===
            Duration  Start Station Id                     Start Date  \
count   47523.000000      47523.000000                          47523   
mean      790.384277         27.519054  2025-04-17 09:21:29.784861952   
min         0.000000          1.000000            2025-04-01 00:02:47   
25%       222.000000         13.000000     2025-04-10 11:12:25.500000   
50%       381.000000         27.000000            2025-04-18 12:28:52   
75%       821.000000         38.000000     2025-04-24 15:46:15.500000   
max    200129.000000         60.000000            2025-04-30 23:58:19   
std      2694.794272         15.400083                            NaN   

                            End Date  End Station Id  
count                          47523    47497.000000  
mean   2025-04-17 09:34:40.169139200       27.335411  
min              2025-04-01 00:11:08        1.000000  
25%              2025-04-10 11:22:42       13.000000  
50%              2025-04-18 12:49:28    

Here are some observations from the initial look at the data:

- We notice that the duration of the trips is measured in seconds.
- From the unique IDs, there were 60 POGOH stations (at least in this timeframe).
- Start Station and End Station list the name of the POGOH stations as a string, usually the streets where they're located.
- Closed status has 4 possible values: Normal, Grace Period, Terminated and Force closed.
- Rider type has only two possible values: Member and Casual.

In [7]:
# Value counts for key categorical variables
closed_status_counts = pogoh_df["Closed Status"].value_counts()
rider_type_counts = pogoh_df["Rider Type"].value_counts()


print("\n=== Closed Status Distribution ===")
print(closed_status_counts)

print("\n=== Rider Type Distribution ===")
print(rider_type_counts)


=== Closed Status Distribution ===
Closed Status
NORMAL           46703
GRACE_PERIOD       720
TERMINATED          62
FORCED_CLOSED       38
Name: count, dtype: int64

=== Rider Type Distribution ===
Rider Type
MEMBER    44399
CASUAL     3124
Name: count, dtype: int64


It might be of interest to learn the specifics about how they do the classification for closed status of each trip and also the distinction between rider types.