# *Exploring TTC Streetcar Delays and Forecasting Delays*

- Created on: October, 2023
- Created by: Jessica Seo

## 🚂 Loading Weather Data

### Notebook Contents

- Introduction
- Data Loading 
- Data Saving

-----
### Introduction

This notebook is solely dedicated to scraping weather data from January 01, 2021, to September 30, 2023. The 'for' loop and 'enumerate' method are used to scrape data from approximately 1000 URLs from Environment Canada. The URLs are checked using */robots.txt* to determine if the pages are allowed to be scraped. It indicated that scraping was disallowed, but the fetched data is identical to the CSV file shared on the page.

Disclaimer: The loaded dataset is not used to create another web page.

------

### Data Loading

In [1]:
#importing necessary python libraries
import pandas as pd
import time

In [3]:
#Reading Toronto City Weather Data from Environment Canada 
web=pd.read_html(f'https://climate.weather.gc.ca/climate_data/hourly_data_e.html?hlyRange=2009-12-10%7C2023-10-03&dlyRange=2010-02-02%7C2023-10-02&mlyRange=%7C&StationID=48549&Prov=ON&urlExtension=_e.html&searchType=stnProv&optLimit=yearRange&StartYear=2022&EndYear=2023&selRowPerPage=25&Line=179&lstProvince=ON&timeframe=1&time=LST&time=LST&Year=2021&Month=1&Day=13#')[0]
web

Unnamed: 0,TIME LST,Temp Definition °C,Dew Point Definition °C,Rel Hum Definition %,Precip. Amount Definition mm,Wind Dir Definition 10's deg,Wind Spd Definition km/h,Visibility Definition km,Stn Press Definition kPa,Hmdx Definition,Wind Chill Definition,Weather Definition
0,00:00,-0.4,-2.3,87,0.0,24.0,17,9.7,100.75,,-5.0,Fog
1,01:00,-0.5,-2.3,88,0.0,24.0,18,8.1,100.72,,-6.0,Fog
2,02:00,-0.4,-2.2,88,0.0,24.0,18,8.1,100.68,,-5.0,Fog
3,03:00,-0.1,-1.9,88,0.0,24.0,21,9.7,100.67,,-6.0,Fog
4,04:00,0.1,-2.3,84,0.0,27.0,18,9.7,100.59,,,Fog
5,05:00,0.2,-1.9,86,0.0,22.0,21,8.1,100.52,,,Fog
6,06:00,0.5,-0.8,91,0.0,23.0,21,6.4,100.53,,,"Rain, Fog"
7,07:00,0.9,-0.1,93,0.0,22.0,21,4.0,100.37,,,Fog
8,08:00,1.2,-0.9,86,0.0,23.0,24,8.1,100.31,,,Fog
9,09:00,1.7,-0.4,86,0.0,22.0,32,6.4,100.28,,,Fog


In [4]:
#Iterating over dates from Jan 2021 - Sept 2023
import datetime

dates = []

start_date = datetime.date(2021, 1, 1)

end_date = datetime.date(2023, 9, 30)

delta = datetime.timedelta(days=1)

while (start_date <= end_date):
    dates.append(start_date)
    print(start_date, end="\n")
    
    start_date += delta

2021-01-01
2021-01-02
2021-01-03
2021-01-04
2021-01-05
2021-01-06
2021-01-07
2021-01-08
2021-01-09
2021-01-10
2021-01-11
2021-01-12
2021-01-13
2021-01-14
2021-01-15
2021-01-16
2021-01-17
2021-01-18
2021-01-19
2021-01-20
2021-01-21
2021-01-22
2021-01-23
2021-01-24
2021-01-25
2021-01-26
2021-01-27
2021-01-28
2021-01-29
2021-01-30
2021-01-31
2021-02-01
2021-02-02
2021-02-03
2021-02-04
2021-02-05
2021-02-06
2021-02-07
2021-02-08
2021-02-09
2021-02-10
2021-02-11
2021-02-12
2021-02-13
2021-02-14
2021-02-15
2021-02-16
2021-02-17
2021-02-18
2021-02-19
2021-02-20
2021-02-21
2021-02-22
2021-02-23
2021-02-24
2021-02-25
2021-02-26
2021-02-27
2021-02-28
2021-03-01
2021-03-02
2021-03-03
2021-03-04
2021-03-05
2021-03-06
2021-03-07
2021-03-08
2021-03-09
2021-03-10
2021-03-11
2021-03-12
2021-03-13
2021-03-14
2021-03-15
2021-03-16
2021-03-17
2021-03-18
2021-03-19
2021-03-20
2021-03-21
2021-03-22
2021-03-23
2021-03-24
2021-03-25
2021-03-26
2021-03-27
2021-03-28
2021-03-29
2021-03-30
2021-03-31
2021-04-01

In [5]:
dates[0].year

2021

In [6]:
dates[0].month

1

In [7]:
#Controlling the running time to prevent failing as we need to run 1000 urls
import time

time.sleep(1)

In [8]:
#Creating for enumerate loop to scrap multiple urls
scraped = {}

for index, date in enumerate(dates):
    print("Scraping this date:", date, f"-- {round(index/len(dates)*100, 2)}% done", end="\r")
    try:
        new_df = pd.read_html(f"https://climate.weather.gc.ca/climate_data/hourly_data_e.html?hlyRange=2009-12-10%7C2023-10-03&dlyRange=2010-02-02%7C2023-10-02&mlyRange=%7C&StationID=48549&Prov=ON&urlExtension=_e.html&searchType=stnProv&optLimit=yearRange&StartYear=2022&EndYear=2023&selRowPerPage=25&Line=179&lstProvince=ON&timeframe=1&time=LST&time=LST&Year={dates[index].year}&Month={dates[index].month}&Day={dates[index].day}#")[0]
        new_df["date"] = date
        scraped[date] = new_df
        time.sleep(0.3)
    except:
        print("We failed to scrape", date)
        scraped[date] = None

Scraping this date: 2023-09-30 -- 99.9% donee

In [10]:
#Concatenating all the scraped data
weather_df = pd.concat(scraped.values())
weather_df

Unnamed: 0,TIME LST,Temp Definition °C,Dew Point Definition °C,Rel Hum Definition %,Precip. Amount Definition mm,Wind Dir Definition 10's deg,Wind Spd Definition km/h,Visibility Definition km,Stn Press Definition kPa,Hmdx Definition,Wind Chill Definition,Weather Definition,date
0,00:00,-1.3,-4.7,78,0.0,25.0,4,16.1,102.17,,-3.0,LegendNANA,2021-01-01
1,01:00,-1.2,-3.6,84,0.0,24.0,4,16.1,102.16,,-3.0,LegendNANA,2021-01-01
2,02:00,-1.8,-3.2,90,0.0,27.0,4,16.1,102.19,,-3.0,LegendNANA,2021-01-01
3,03:00,-2.0,-3.3,91,0.0,30.0,5,16.1,102.26,,-4.0,LegendNANA,2021-01-01
4,04:00,-1.4,-3.5,85,0.0,27.0,5,16.1,102.24,,-3.0,LegendNANA,2021-01-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...
19,19:00,17.6,15.7,89,0.0,24,4,16.1,101.52,,,LegendNANA,2023-09-30
20,20:00,17.0,15.7,92,0.0,25,5,16.1,101.53,,,LegendNANA,2023-09-30
21,21:00,16.6,15.1,91,0.0,26,8,16.1,101.53,,,LegendNANA,2023-09-30
22,22:00,16.6,15.1,91,0.0,LegendMM,4,16.1,101.57,,,LegendNANA,2023-09-30


The data has 24,072 rows and 13 columns.

In [11]:
#Checking to see if it loaded correctly
weather_df.sample(20)

Unnamed: 0,TIME LST,Temp Definition °C,Dew Point Definition °C,Rel Hum Definition %,Precip. Amount Definition mm,Wind Dir Definition 10's deg,Wind Spd Definition km/h,Visibility Definition km,Stn Press Definition kPa,Hmdx Definition,Wind Chill Definition,Weather Definition,date
0,00:00,5.4,0.3,70,0.0,6.0,28,16.1,101.49,,,LegendNANA,2022-04-21
20,20:00,4.0,2.0,87,0.0,26.0,21,16.1,100.14,,,Rain,2022-11-12
10,10:00,25.3,23.5,90,0.0,17.0,11,16.1,100.82,36.0,,LegendNANA,2021-08-29
8,08:00,19.5,18.6,95,2.5,12.0,5,16.1,100.43,,,Rain,2023-09-12
23,23:00,21.6,13.9,61,0.0,33.0,24,16.1,100.26,25.0,,LegendNANA,2022-06-26
2,02:00,16.5,10.4,67,0.0,11.0,18,16.1,101.78,,,LegendNANA,2023-09-27
4,04:00,-9.0,-15.8,58,0.0,4.0,21,16.1,102.5,,-17.0,LegendNANA,2022-02-24
23,23:00,-2.3,-8.7,62,0.0,30.0,28,16.1,100.76,,-9.0,LegendNANA,2021-12-22
3,03:00,16.0,5.4,49,0.0,6.0,22,16.1,100.06,,,LegendNANA,2023-04-15
11,11:00,21.4,14.7,65,0.0,18.0,13,16.1,101.4,25.0,,LegendNANA,2023-08-01


In [12]:
#Adding datetime column
weather_df["datetime"] = pd.to_datetime(weather_df["date"].astype(str) + " " + weather_df["TIME LST"])

In [27]:
#Sanity check
weather_df.head(3)

Unnamed: 0,Temp Definition °C,Dew Point Definition °C,Rel Hum Definition %,Precip. Amount Definition mm,Wind Dir Definition 10's deg,Wind Spd Definition km/h,Visibility Definition km,Stn Press Definition kPa,Hmdx Definition,Wind Chill Definition,Weather Definition,datetime
0,-1.3,-4.7,78,0.0,25.0,4,16.1,102.17,,-3.0,LegendNANA,2021-01-01 00:00:00
1,-1.2,-3.6,84,0.0,24.0,4,16.1,102.16,,-3.0,LegendNANA,2021-01-01 01:00:00
2,-1.8,-3.2,90,0.0,27.0,4,16.1,102.19,,-3.0,LegendNANA,2021-01-01 02:00:00


In [16]:
#Dropping Date and TIME LST column
weather_df.drop(['date', 'TIME LST'],axis=1, inplace=True)

In [28]:
#Sanity Check
weather_df

Unnamed: 0,Temp Definition °C,Dew Point Definition °C,Rel Hum Definition %,Precip. Amount Definition mm,Wind Dir Definition 10's deg,Wind Spd Definition km/h,Visibility Definition km,Stn Press Definition kPa,Hmdx Definition,Wind Chill Definition,Weather Definition,datetime
0,-1.3,-4.7,78,0.0,25.0,4,16.1,102.17,,-3.0,LegendNANA,2021-01-01 00:00:00
1,-1.2,-3.6,84,0.0,24.0,4,16.1,102.16,,-3.0,LegendNANA,2021-01-01 01:00:00
2,-1.8,-3.2,90,0.0,27.0,4,16.1,102.19,,-3.0,LegendNANA,2021-01-01 02:00:00
3,-2.0,-3.3,91,0.0,30.0,5,16.1,102.26,,-4.0,LegendNANA,2021-01-01 03:00:00
4,-1.4,-3.5,85,0.0,27.0,5,16.1,102.24,,-3.0,LegendNANA,2021-01-01 04:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...
19,17.6,15.7,89,0.0,24,4,16.1,101.52,,,LegendNANA,2023-09-30 19:00:00
20,17.0,15.7,92,0.0,25,5,16.1,101.53,,,LegendNANA,2023-09-30 20:00:00
21,16.6,15.1,91,0.0,26,8,16.1,101.53,,,LegendNANA,2023-09-30 21:00:00
22,16.6,15.1,91,0.0,LegendMM,4,16.1,101.57,,,LegendNANA,2023-09-30 22:00:00


----
### Data Saving

In [19]:
#saving the dataframe to a new csv file! 
weather_df.to_csv('final_weather.csv', index=False)