# Data Cleansing for Transit Stations Entrances Database Preparation

In this notebook, we will be cleaning the entrances and the entrance relations datasets in preparation to uploading to the SQLite database

In [1]:
import pandas as pd
import os



In [2]:
data_directory = '../data'
entrances_file = 'klang_valley_entrances.csv'
station_entrances_file = 'klang_valley_stations_entrances_relation.csv'

entrances_data = pd.read_csv(os.path.join(data_directory, entrances_file))
station_entrances_data  = pd.read_csv(os.path.join(data_directory, station_entrances_file))

## Data Exploration
Before we start cleaning, let's explore the datasets to understand their structure and the cleaning tasks that might be necessary.

In [3]:
default_row  = pd.options.display.max_rows
default_col  = pd.options.display.max_columns


pd.set_option('display.max_rows', None)  # Set to None to display all rows
pd.set_option('display.max_columns', None)  # Set to None to display all columns

entrances_data

Unnamed: 0,Entrance ID,Longitude,Latitude,Entrance Destination,Entrance Name
0,1544031348,101.711374,3.145929,,B
1,1631412559,101.604933,3.113208,,
2,1632120095,101.69434,3.14232,,E
3,2278515570,101.644077,3.05065,,
4,2686635178,101.699182,3.138565,,C
5,2688004520,101.716143,3.128749,,
6,3308608988,101.712717,3.158762,,
7,3308608989,101.712507,3.158809,,
8,3948655246,101.721651,3.219798,,
9,4092013971,101.61413,3.022231,,


In [4]:
print("Station Entrances Dataset")
station_entrances_data

Station Entrances Dataset


Unnamed: 0,Relationship ID,Entrance ID,Station Code,Station Name
0,0,11435336038,AG8;SP8,Plaza Rakyat
1,1,10796851698,AG10;SP10,Pudu
2,2,5485710279,KJ11,Kampung Baru
3,3,5485710278,KJ11,Kampung Baru
4,4,11052165913,KJ12,Dang Wangi
5,5,9740843587,KJ13,Masjid Jamek (KJ)
6,6,9983121350,KJ13,Masjid Jamek (KJ)
7,7,11257782484,KJ13,Masjid Jamek (KJ)
8,8,11039725697,KJ13,Masjid Jamek (KJ)
9,9,10036593582,KJ13,Masjid Jamek (KJ)


In [5]:
pd.set_option('display.max_rows', default_row)  # Set to None to display all rows
pd.set_option('display.max_columns', default_col)

## Data Cleansing
The goal of the data cleansing is to ensure that each dataset has a consistent structure and no missing data. The exact steps will depend on the initial state of the datasets.

### Station Entrances Dataset

#### AG SP Station split
Some stops in the AG and SP line share the same station. Hence they're tagged as AG;SP list in OSM. We'd like to split this into individual rows

In [6]:
# split the Station ID column into multiple rows
station_entrances_data = (station_entrances_data.set_index(['Relationship ID', 'Entrance ID', 'Station Name'])
        .apply(lambda x: x.str.split(';').explode())
        .reset_index())

# reset the Relationship ID
station_entrances_data['Relationship ID'] = range(len(station_entrances_data))

print(station_entrances_data)

     Relationship ID  Entrance ID             Station Name Station Code
0                  0  11435336038             Plaza Rakyat          AG8
1                  1  11435336038             Plaza Rakyat          SP8
2                  2  10796851698                     Pudu         AG10
3                  3  10796851698                     Pudu         SP10
4                  4   5485710279             Kampung Baru         KJ11
..               ...          ...                      ...          ...
305              305  10839997852          Cyberjaya Utara         PY39
306              306  10658294223    Cyberjaya City Centre         PY30
307              307  10722980582  Putrajaya Sentral (MRT)         PY41
308              308   5044809585  Tun Razak Exchange (PY)         PY23
309              309   5044809586  Tun Razak Exchange (PY)         PY23

[310 rows x 4 columns]


### Standardizing column names
We will standardize the column names and replace spaces for consistency across our database

In [7]:
# Rename columns
entrances_data = entrances_data.rename(columns={
    'Station ID': 'station_code',
})
# Rename columns
station_entrances_data = station_entrances_data.rename(columns={
    'Station ID': 'station_code',
})


# We would now normalize/standardize the column names of the dataframe to ensure consistency
entrances_data.columns = entrances_data.columns.str.lower().str.replace(' ', '_')
station_entrances_data.columns = station_entrances_data.columns.str.lower().str.replace(' ', '_')



## Save Cleaned Datasets
Finally, we will save the cleaned datasets to new CSV files for use in creating our database

In [9]:
# Define the directory where you want to save the cleaned data
cleansed_data_directory = '../data_cleansed'
cleansed_entrances_file= 'klang_valley_entrances_cleansed.csv'
cleansed_station_entrances_file = 'klang_valley_stations_entrances_relation_cleansed.csv'

# Save the cleaned dataframes
entrances_data.to_csv(os.path.join(cleansed_data_directory, cleansed_entrances_file), index=False)
station_entrances_data.to_csv(os.path.join(cleansed_data_directory, cleansed_station_entrances_file), index=False)

