# Data Cleansing for Transit Stations Database Preparation

In this notebook, we will be cleaning and standardizing the datasets for three different transit systems: Klang Valley, Montreal, and Singapore. The cleaned data will be used for the preparation of our database.

In [1]:
import pandas as pd
import os

In [2]:
data_directory = 'data'
kl_file = 'klang_valley_stations.csv'
montreal_file = 'montreal_metro.csv'
singapore_file = 'mrtsg.csv'

kl_data = pd.read_csv(os.path.join(data_directory, kl_file))
montreal_data = pd.read_csv(os.path.join(data_directory, montreal_file))
singapore_data = pd.read_csv(os.path.join(data_directory, singapore_file))



## Data Exploration
Before we start cleaning, let's explore the datasets to understand their structure and the cleaning tasks that might be necessary.

In [3]:
print("Klang Valley Dataset")
print(kl_data.head())
print("\nMontreal Dataset")
print(montreal_data.head())
print("\nSingapore Dataset")
print(singapore_data.head())


Klang Valley Dataset
           Name Stop ID  Service Provider Name  Latitude   Longitude ROUTE ID  \
0    KL SENTRAL    KA01  Keretapi Tanah Melayu  3.134603  101.686567       KA   
1  KUALA LUMPUR    KA02  Keretapi Tanah Melayu  3.139513  101.693789       KA   
2   BANK NEGARA    KA03  Keretapi Tanah Melayu  3.154542  101.693010       KA   
3         PUTRA    KA04  Keretapi Tanah Melayu  3.165005  101.691234       KA   
4    MID VALLEY    KB01  Keretapi Tanah Melayu  3.118528  101.678985       KB   

      Route Name Line Number Line Colour Colour Hex Code          City  
0  Seremban Line           1        Blue         #0000FF  Klang Valley  
1  Seremban Line           1        Blue         #0000FF  Klang Valley  
2  Seremban Line           1        Blue         #0000FF  Klang Valley  
3  Seremban Line           1        Blue         #0000FF  Klang Valley  
4  Seremban Line           1        Blue         #0000FF  Klang Valley  

Montreal Dataset
   Index         Name               

## Data Cleansing
The goal of the data cleansing is to ensure that each dataset has a consistent structure and standardized columns. The exact steps will depend on the initial state of the datasets.

### Klang Valley Dataset

In [4]:
print(kl_data.columns)
# We would now normalize/standardize the column names of the dataframe to ensure consistency
kl_data.columns = kl_data.columns.str.lower().str.replace(' ', '_')

#ensure only first letter of a names is capitalized
kl_data['name'] = kl_data['name'].str.title()


# Rename columns
kl_data = kl_data.rename(columns={
    'city': 'region',
    'stop_id':'station_code'
})

# Rename rows

kl_data['name'] = kl_data['name'].replace({
    "Kl Sentral": "KL Sentral",
    "Ukm": "UKM",
    "Pwtc": "PWTC",
    "Taman Perindustrian Puchong (Tpp)":"Taman Perindustrian Puchong (TPP)",
    "Klcc":"KLCC",
    "Ss15":"SS15",
    "Ss18": "SS18", 
    "Usj7" : "USJ7",
    "Usj21" : "USJ21",
    "Klia2" : "KLIA2",
    "Ttdi" : "TTDI",
    "Sunu Monash" : "SunU Monash",
    "South Quay-Usj1" : "South Quay-USJ1",
    "Persiaran Klcc" : "Persiaran KLCC",
    "Upm" : "UPM",
    "Bu11" : "BU11",
    "Ss7" : "SS7",
    "Uitm" : "UiTM"
            })


print(kl_data.columns)
kl_data.head()

Index(['Name', 'Stop ID', 'Service Provider Name', 'Latitude', 'Longitude',
       'ROUTE ID', 'Route Name', 'Line Number', 'Line Colour',
       'Colour Hex Code', 'City'],
      dtype='object')
Index(['name', 'station_code', 'service_provider_name', 'latitude',
       'longitude', 'route_id', 'route_name', 'line_number', 'line_colour',
       'colour_hex_code', 'region'],
      dtype='object')


Unnamed: 0,name,station_code,service_provider_name,latitude,longitude,route_id,route_name,line_number,line_colour,colour_hex_code,region
0,KL Sentral,KA01,Keretapi Tanah Melayu,3.134603,101.686567,KA,Seremban Line,1,Blue,#0000FF,Klang Valley
1,Kuala Lumpur,KA02,Keretapi Tanah Melayu,3.139513,101.693789,KA,Seremban Line,1,Blue,#0000FF,Klang Valley
2,Bank Negara,KA03,Keretapi Tanah Melayu,3.154542,101.69301,KA,Seremban Line,1,Blue,#0000FF,Klang Valley
3,Putra,KA04,Keretapi Tanah Melayu,3.165005,101.691234,KA,Seremban Line,1,Blue,#0000FF,Klang Valley
4,Mid Valley,KB01,Keretapi Tanah Melayu,3.118528,101.678985,KB,Seremban Line,1,Blue,#0000FF,Klang Valley


### Montreal Dataset

In [5]:
print(montreal_data.columns)
# We would now normalize/standardize the column names of the dataframe to ensure consistency
montreal_data.columns = montreal_data.columns.str.lower().str.replace(' ', '_')

#drop irrelevant columns
montreal_data=montreal_data.drop(columns=['index'])


# Rename columns
montreal_data = montreal_data.rename(columns={
    '_stop_id': 'station_code',
    'city':'region'
})



print(montreal_data.columns)
montreal_data.head()



Index(['Index', 'Name', 'Odonym', 'Namesake', 'Opened', ' Stop ID', 'Latitude',
       'Longitude', 'Route Name', 'Line Colour', 'Colour Hex Code', 'City'],
      dtype='object')
Index(['name', 'odonym', 'namesake', 'opened', 'station_code', 'latitude',
       'longitude', 'route_name', 'line_colour', 'colour_hex_code', 'region'],
      dtype='object')


Unnamed: 0,name,odonym,namesake,opened,station_code,latitude,longitude,route_name,line_colour,colour_hex_code,region
0,Angrignon,Boulevard Angrignon; Parc Angrignon,"Jean-Baptiste Angrignon, city councillor",03-09-1978,G1,45.446238,-73.60362,Green Line,Green,#5F8C55,Montreal
1,Monk,Boulevard Monk,"James Monk, Quebec Attorney-General",03-09-1978,G2,45.451051,-73.5932,Green Line,Green,#5F8C55,Montreal
2,Jolicoeur,Rue Jolicoeur,"Jean-Moïse Jolicoeur, parish priest",03-09-1978,G3,45.456914,-73.58206,Green Line,Green,#5F8C55,Montreal
3,Verdun,Rue de Verdun; borough of Verdun,"Notre-Dame-de-Saverdun, France, hometown of Se...",03-09-1978,G4,45.459359,-73.57183,Green Line,Green,#5F8C55,Montreal
4,De L'Église,Avenue de l'Église,Église Saint-Paul,03-09-1978,G5,45.462874,-73.56696,Green Line,Green,#5F8C55,Montreal


### Singapore Dataset

In [6]:
print(singapore_data.columns)
# We would now normalize/standardize the column names of the dataframe to ensure consistency
singapore_data.columns = singapore_data.columns.str.lower().str.replace(' ', '_')

#drop irrelevant columns
singapore_data=singapore_data.drop(columns=['objectid','x','y'])

#ensure only first letter of a names is capitalized
singapore_data['line_colour'] = singapore_data['line_colour'].str.title()

# Rename columns
singapore_data = singapore_data.rename(columns={
    'stn_no': 'station_code',
    'route_code':'route_id',
    'city':'region'
})



print(singapore_data.columns)
singapore_data.head()

Index(['OBJECTID', 'Name', 'STN_NO', 'X', 'Y', 'Latitude', 'Longitude',
       'Line Colour', 'Colour Hex Code', 'Route Name', 'Route Code', 'City'],
      dtype='object')
Index(['name', 'station_code', 'latitude', 'longitude', 'line_colour',
       'colour_hex_code', 'route_name', 'route_id', 'region'],
      dtype='object')


Unnamed: 0,name,station_code,latitude,longitude,line_colour,colour_hex_code,route_name,route_id,region
0,Eunos MRT Station,EW7,1.319779,103.903234,Green,#009530,East West Line,EWL,Singapore
1,Chinese Garden MRT Station,EW25,1.342353,103.732624,Green,#009530,East West Line,EWL,Singapore
2,Khatib MRT Station,NS14,1.417383,103.83298,Red,#dc241f,North South Line,NSL,Singapore
3,Kranji MRT Station,NS7,1.425178,103.762187,Red,#dc241f,North South Line,NSL,Singapore
4,Redhill MRT Station,EW18,1.289563,103.816821,Green,#009530,East West Line,EWL,Singapore


## Save Cleaned Datasets
Finally, we will save the cleaned datasets to new CSV files for use in creating our database

In [7]:
# Define the directory where you want to save the cleaned data
cleansed_data_directory = 'data_cleansed'
cleansed_kl_file = 'klang_valley_stations_cleansed.csv'
cleansed_montreal_file = 'montreal_stations_cleansed.csv'
cleansed_singapore_file = 'singapore_stations_cleansed.csv'

# Save the cleaned dataframes
kl_data.to_csv(os.path.join(cleansed_data_directory, cleansed_kl_file), index=False)
montreal_data.to_csv(os.path.join(cleansed_data_directory, cleansed_montreal_file), index=False)
singapore_data.to_csv(os.path.join(cleansed_data_directory, cleansed_singapore_file), index=False)
