# Data Processing




### Libraries

In [5]:
import pandas as pd

### Day of Week and Is Weekend Data Preprocessing

The goal of this code is to preprocess the **ridership data** by adding two new columns:
1. **Day of Week**: A column that maps each date to the corresponding day of the week, where Monday is `1` and Sunday is `7`.
2. **Is Weekend**: A binary column indicating whether the day is a weekend (Saturday or Sunday).



In [10]:
# Load the dataset
df = pd.read_csv('./data/raw_data.csv')

# Convert the 'date' column to datetime format
df['date'] = pd.to_datetime(df['date'])

# Extract 'Day of Week' (1 = Monday, 7 = Sunday)
df['day_of_week'] = df['date'].dt.dayofweek + 1  # Monday as 1, Sunday as 7

# Add 'Is_Weekend' column: True (1) for Saturday (6) and Sunday (7)
df['is_weekend'] = df['day_of_week'].apply(lambda x: 1 if x >= 6 else 0)

# Save the cleaned data to a new CSV file
df.to_csv('./data/cleaned_data.csv', index=False)

# Print the first few rows to verify
print(df.head())


        date   time          origin       destination  ridership  day_of_week  \
0 2025-01-01  00:00  Abdullah Hukum             Klang          1            3   
1 2025-01-01  00:00  Abdullah Hukum       Telok Pulai          1            3   
2 2025-01-01  00:00           Bangi        Batu Caves          1            3   
3 2025-01-01  00:00     Bank Negara      Sungai Gadut          1            3   
4 2025-01-01  00:00       Batu Tiga  Kampung Raja Uda          1            3   

   is_weekend  
0           0  
1           0  
2           0  
3           0  
4           0  


### Public Holidays (2025) Data Processing
The goal of this code is to preprocess the public holiday data by adding a new column:

Is Holiday: A binary column indicating whether the date corresponds to a public holiday, based on a predefined list of national holidays and the Sultan of Selangor's Birthday.

In [16]:
# Load the cleaned data
df = pd.read_csv('./data/cleaned_data.csv')

# List of public holidays (National and Selangor)
holidays_data = [
    ("2025-01-01", "New Year's Day"),
    ("2025-01-29", "Chinese New Year"),
    ("2025-01-30", "Chinese New Year Holiday"),
    ("2025-02-11", "Thaipusam"),
    ("2025-03-31", "Hari Raya Aidilfitri"),
    ("2025-04-01", "Hari Raya Aidilfitri Holiday"),
    ("2025-05-01", "Labour Day"),
    ("2025-05-12", "Wesak Day"),
    ("2025-06-02", "Agong's Birthday"),
    ("2025-06-07", "Hari Raya Haji"),
    ("2025-06-27", "Awal Muharram"),
    ("2025-08-31", "Merdeka Day"),
    ("2025-09-01", "Merdeka Day Holiday"),
    ("2025-09-05", "Prophet Muhammad's Birthday"),
    ("2025-09-16", "Malaysia Day"),
    ("2025-10-20", "Deepavali"),
    ("2025-12-11", "Sultan of Selangor's Birthday"),
    ("2025-12-25", "Christmas Day")
]

# Convert holiday data to a DataFrame
holidays_df = pd.DataFrame(holidays_data, columns=["Date", "Holiday"])

# Ensure the date columns are in datetime format for comparison
df['date'] = pd.to_datetime(df['date'])
holidays_df['Date'] = pd.to_datetime(holidays_df['Date'])

# Add 'is_holiday' column based on whether the date matches any holiday
df['is_holiday'] = df['date'].isin(holidays_df['Date']).astype(int)

# Save the updated cleaned data back to the same file
df.to_csv('./data/cleaned_data.csv', index=False)

print(df.head())


        date   time          origin       destination  ridership  day_of_week  \
0 2025-01-01  00:00  Abdullah Hukum             Klang          1            3   
1 2025-01-01  00:00  Abdullah Hukum       Telok Pulai          1            3   
2 2025-01-01  00:00           Bangi        Batu Caves          1            3   
3 2025-01-01  00:00     Bank Negara      Sungai Gadut          1            3   
4 2025-01-01  00:00       Batu Tiga  Kampung Raja Uda          1            3   

   is_weekend  is_holiday  
0           0           1  
1           0           1  
2           0           1  
3           0           1  
4           0           1  
