# Data Cleansing for Transit Stations Database Preparation

In this notebook, we will be cleaning and standardizing the datasets for three different transit systems: Klang Valley, Montreal, and Singapore. The cleaned data will be used for the preparation of our database.

In [2]:
import pandas as pd
import os



In [3]:
data_directory = 'data'
kl_file = 'klang_valley_stations.csv'
montreal_file = 'montreal_metro.csv'
singapore_file = 'mrtsg.csv'

kl_data = pd.read_csv(os.path.join(data_directory, kl_file))
montreal_data = pd.read_csv(os.path.join(data_directory, montreal_file))
singapore_data = pd.read_csv(os.path.join(data_directory, singapore_file))



## Data Exploration
Before we start cleaning, let's explore the datasets to understand their structure and the cleaning tasks that might be necessary.

In [4]:
print("Klang Valley Dataset")
print(kl_data.head())
print("\nMontreal Dataset")
print(montreal_data.head())
print("\nSingapore Dataset")
print(singapore_data.head())


Klang Valley Dataset
           Name Stop ID  Service Provider Name  Latitude   Longitude ROUTE ID  \
0    KL SENTRAL    KA01  Keretapi Tanah Melayu  3.134603  101.686567       KA   
1  KUALA LUMPUR    KA02  Keretapi Tanah Melayu  3.139513  101.693789       KA   
2   BANK NEGARA    KA03  Keretapi Tanah Melayu  3.154542  101.693010       KA   
3         PUTRA    KA04  Keretapi Tanah Melayu  3.165005  101.691234       KA   
4    MID VALLEY    KB01  Keretapi Tanah Melayu  3.118528  101.678985       KB   

      Route Name Line Number Line Colour Colour Hex Code          City  
0  Seremban Line           1        Blue         #0000FF  Klang Valley  
1  Seremban Line           1        Blue         #0000FF  Klang Valley  
2  Seremban Line           1        Blue         #0000FF  Klang Valley  
3  Seremban Line           1        Blue         #0000FF  Klang Valley  
4  Seremban Line           1        Blue         #0000FF  Klang Valley  

Montreal Dataset
   Index         Name               

## Data Cleansing
The goal of the data cleansing is to ensure that each dataset has a consistent structure and standardized columns. The exact steps will depend on the initial state of the datasets.

### Klang Valley Dataset

In [5]:
print(kl_data.columns)
# We would now normalize/standardize the column names of the dataframe to ensure consistency
kl_data.columns = kl_data.columns.str.lower().str.replace(' ', '_')
print(kl_data.columns)
kl_data.head()

Index(['Name', 'Stop ID', 'Service Provider Name', 'Latitude', 'Longitude',
       'ROUTE ID', 'Route Name', 'Line Number', 'Line Colour',
       'Colour Hex Code', 'City'],
      dtype='object')
Index(['name', 'stop_id', 'service_provider_name', 'latitude', 'longitude',
       'route_id', 'route_name', 'line_number', 'line_colour',
       'colour_hex_code', 'city'],
      dtype='object')


Unnamed: 0,name,stop_id,service_provider_name,latitude,longitude,route_id,route_name,line_number,line_colour,colour_hex_code,city
0,KL SENTRAL,KA01,Keretapi Tanah Melayu,3.134603,101.686567,KA,Seremban Line,1,Blue,#0000FF,Klang Valley
1,KUALA LUMPUR,KA02,Keretapi Tanah Melayu,3.139513,101.693789,KA,Seremban Line,1,Blue,#0000FF,Klang Valley
2,BANK NEGARA,KA03,Keretapi Tanah Melayu,3.154542,101.69301,KA,Seremban Line,1,Blue,#0000FF,Klang Valley
3,PUTRA,KA04,Keretapi Tanah Melayu,3.165005,101.691234,KA,Seremban Line,1,Blue,#0000FF,Klang Valley
4,MID VALLEY,KB01,Keretapi Tanah Melayu,3.118528,101.678985,KB,Seremban Line,1,Blue,#0000FF,Klang Valley


### Montreal Dataset

In [6]:
print(montreal_data.columns)
# We would now normalize/standardize the column names of the dataframe to ensure consistency
montreal_data.columns = montreal_data.columns.str.lower().str.replace(' ', '_')
print(montreal_data.columns)
montreal_data.head()

Index(['Index', 'Name', 'Odonym', 'Namesake', 'Opened', ' Stop ID', 'Latitude',
       'Longitude', 'Route Name', 'Line Colour', 'Colour Hex Code', 'City'],
      dtype='object')
Index(['index', 'name', 'odonym', 'namesake', 'opened', '_stop_id', 'latitude',
       'longitude', 'route_name', 'line_colour', 'colour_hex_code', 'city'],
      dtype='object')


Unnamed: 0,index,name,odonym,namesake,opened,_stop_id,latitude,longitude,route_name,line_colour,colour_hex_code,city
0,1,Angrignon,Boulevard Angrignon; Parc Angrignon,"Jean-Baptiste Angrignon, city councillor",03-09-1978,G1,45.446238,-73.60362,Green Line,Green,#5F8C55,Montreal
1,2,Monk,Boulevard Monk,"James Monk, Quebec Attorney-General",03-09-1978,G2,45.451051,-73.5932,Green Line,Green,#5F8C55,Montreal
2,3,Jolicoeur,Rue Jolicoeur,"Jean-Moïse Jolicoeur, parish priest",03-09-1978,G3,45.456914,-73.58206,Green Line,Green,#5F8C55,Montreal
3,4,Verdun,Rue de Verdun; borough of Verdun,"Notre-Dame-de-Saverdun, France, hometown of Se...",03-09-1978,G4,45.459359,-73.57183,Green Line,Green,#5F8C55,Montreal
4,5,De L'Église,Avenue de l'Église,Église Saint-Paul,03-09-1978,G5,45.462874,-73.56696,Green Line,Green,#5F8C55,Montreal


### Singapore Dataset

In [7]:
print(singapore_data.columns)
# We would now normalize/standardize the column names of the dataframe to ensure consistency
singapore_data.columns = singapore_data.columns.str.lower().str.replace(' ', '_')
print(singapore_data.columns)
singapore_data.head()

Index(['OBJECTID', 'Name', 'STN_NO', 'X', 'Y', 'Latitude', 'Longitude',
       'Line Colour', 'Colour Hex Code', 'Route Name', 'Route Code', 'City'],
      dtype='object')
Index(['objectid', 'name', 'stn_no', 'x', 'y', 'latitude', 'longitude',
       'line_colour', 'colour_hex_code', 'route_name', 'route_code', 'city'],
      dtype='object')


Unnamed: 0,objectid,name,stn_no,x,y,latitude,longitude,line_colour,colour_hex_code,route_name,route_code,city
0,1,Eunos MRT Station,EW7,35782.9553,33560.0776,1.319779,103.903234,GREEN,#009530,East West Line,EWL,Singapore
1,2,Chinese Garden MRT Station,EW25,16790.7466,36056.3019,1.342353,103.732624,GREEN,#009530,East West Line,EWL,Singapore
2,3,Khatib MRT Station,NS14,27962.3108,44352.568,1.417383,103.83298,RED,#dc241f,North South Line,NSL,Singapore
3,4,Kranji MRT Station,NS7,20081.6974,45214.5479,1.425178,103.762187,RED,#dc241f,North South Line,NSL,Singapore
4,5,Redhill MRT Station,EW18,26163.478,30218.8196,1.289563,103.816821,GREEN,#009530,East West Line,EWL,Singapore


## Save Cleaned Datasets
Finally, we will save the cleaned datasets to new CSV files for use in creating our database