# Downloading Weather Data

Weather condition can potentially affect the usage of bikes. For example, people might be less inclined to ride a bike in heavy rain than in a sunny day. Thus, having access to weather data, including temperature, rain amount, visibility, etc, could prove useful.

In this notebook, we will download weather data from https://climate.weather.gc.ca/. There are two forms of data avaiable based on the time interval: daily weather data and hourly weather data. We will download, examine and clean them separately.

In [14]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

Daily data contains the following potentially useful information, including max/min temperature and total rain amount on a daily basis. 

Compared to daily data, an advantage of using hourly data is that our bike usage data is also on a hourly basis. Having access to hourly data might reveal relation between weather condition and bike usage. However, one drawback of hourly data is a significant larger count of missing values. Another problem is that "Total Rain (mm)" is not avaiable in hourly data.

## Download Daily Data

The daily data can be downloaded from the following url:
https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=1706&Year=${year}&Month=${month}&Day=14&timeframe=2&submit= Download+Data

In [27]:
#define function which downloads all daily weather data in the specified range of years and months
#years must be a list of strings of years, e.g. ['2023','2024']
#months must be a list of strings of months, e.g. ['01,'02','03']
#station_id is the Station ID of the weather observer. Vancouver Intl A has station ID 51442.

def download_weather(years,months=[],station_id='51442'):

    for year in years:
        if months == []:
            filename = year + '_Daily.csv'
            if os.path.exists(filename):
                print(filename+' already exists. No download initiated.')
            else:
                url = f"https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID={station_id}&Year={year}&Day=14&timeframe=2&submit= Download+Data"
                response = requests.get(url)
                if response.status_code != 200:
                    print("unable to download daily data for "+year+f". Error code: {response.status_code}")
                else:
                    with open(filename,'wb') as f:
                        for chunk in response.iter_content(chunk_size=1024):
                            if chunk:
                                f.write(chunk)
                    print(filename + ' downloaded.')
                    response.close()
        else:
            for month in months:
                filename = year + '-' + month +'_Hourly.csv'
                if os.path.exists(filename):
                    print(filename + 'already exists. No download initiated.')
                else:
                    url = f"https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID={station_id}&Year={year}&Month={month}&Day=14&timeframe=1&submit= Download+Data"
                    response = requests.get(url)
                    if response.status_code != 200:
                        print("unable to download hourly data for " + "-".join(year,month) + f". Error code: {response.status_code}")
                    else:
                        with open(filename,'wb') as f:
                            for chunk in response.iter_content(chunk_size=1024):
                                if chunk:
                                    f.write(chunk)
                        print(filename + ' downloaded.')
                        response.close()

In [23]:
#Set the range of dates
years = ['2017','2018','2019','2020','2021','2022','2023','2024']
months = ['01','02','03','04','05','06','07','08','09','10','11','12']

In [24]:
#Download the daily weather data 
download_weather(years)

Start downloading daily weather data.
2017_Daily.csv downloaded.
Start downloading daily weather data.
2018_Daily.csv downloaded.
Start downloading daily weather data.
2019_Daily.csv downloaded.
Start downloading daily weather data.
2020_Daily.csv downloaded.
Start downloading daily weather data.
2021_Daily.csv downloaded.
Start downloading daily weather data.
2022_Daily.csv downloaded.
Start downloading daily weather data.
2023_Daily.csv downloaded.
Start downloading daily weather data.
2024_Daily.csv downloaded.


## Download Hourly Data 

In [28]:
download_weather(years,months)

2017-01_Hourly.csv downloaded.
2017-02_Hourly.csv downloaded.
2017-03_Hourly.csv downloaded.
2017-04_Hourly.csv downloaded.
2017-05_Hourly.csv downloaded.
2017-06_Hourly.csv downloaded.
2017-07_Hourly.csv downloaded.
2017-08_Hourly.csv downloaded.
2017-09_Hourly.csv downloaded.
2017-10_Hourly.csv downloaded.
2017-11_Hourly.csv downloaded.
2017-12_Hourly.csv downloaded.
2018-01_Hourly.csv downloaded.
2018-02_Hourly.csv downloaded.
2018-03_Hourly.csv downloaded.
2018-04_Hourly.csv downloaded.
2018-05_Hourly.csv downloaded.
2018-06_Hourly.csv downloaded.
2018-07_Hourly.csv downloaded.
2018-08_Hourly.csv downloaded.
2018-09_Hourly.csv downloaded.
2018-10_Hourly.csv downloaded.
2018-11_Hourly.csv downloaded.
2018-12_Hourly.csv downloaded.
2019-01_Hourly.csv downloaded.
2019-02_Hourly.csv downloaded.
2019-03_Hourly.csv downloaded.
2019-04_Hourly.csv downloaded.
2019-05_Hourly.csv downloaded.
2019-06_Hourly.csv downloaded.
2019-07_Hourly.csv downloaded.
2019-08_Hourly.csv downloaded.
2019-09_

## Data Cleaning

### Daily Data

We will first look at a sample from daily data.

In [43]:
#import the 2017 daily data into pandas dataframe
df_d_2017 = pd.read_csv("2017_Daily.csv")

In [44]:
#First five rows
df_d_2017.head()

Unnamed: 0,Longitude (x),Latitude (y),Station Name,Climate ID,Date/Time,Year,Month,Day,Data Quality,Max Temp (°C),...,Total Snow (cm),Total Snow Flag,Total Precip (mm),Total Precip Flag,Snow on Grnd (cm),Snow on Grnd Flag,Dir of Max Gust (10s deg),Dir of Max Gust Flag,Spd of Max Gust (km/h),Spd of Max Gust Flag
0,-123.18,49.19,VANCOUVER INTL A,1108395,2017-01-01,2017,1,1,,2.2,...,0.0,,0.0,,3.0,,34.0,,41,
1,-123.18,49.19,VANCOUVER INTL A,1108395,2017-01-02,2017,1,2,,1.4,...,0.0,,0.0,,2.0,,,,<31,
2,-123.18,49.19,VANCOUVER INTL A,1108395,2017-01-03,2017,1,3,,0.4,...,0.0,,0.0,,1.0,,,,<31,
3,-123.18,49.19,VANCOUVER INTL A,1108395,2017-01-04,2017,1,4,,2.2,...,0.0,,0.0,,1.0,,8.0,,32,
4,-123.18,49.19,VANCOUVER INTL A,1108395,2017-01-05,2017,1,5,,0.7,...,0.0,,0.0,,1.0,,,,<31,


We are probably only interested in columns like "Total Rain", "Total Snow", "Temperature" and perhaps "Spd of Max Gust". 

In [45]:
#Drop the redundant columns
df_d_2017.columns
df_d_2017.drop(['Longitude (x)','Latitude (y)', 'Station Name', 'Climate ID','Data Quality','Max Temp Flag','Min Temp Flag','Mean Temp Flag','Heat Deg Days (°C)','Heat Deg Days Flag','Cool Deg Days (°C)','Cool Deg Days Flag','Total Rain Flag','Total Snow Flag','Total Precip Flag','Snow on Grnd (cm)','Snow on Grnd Flag','Dir of Max Gust (10s deg)','Dir of Max Gust Flag','Spd of Max Gust Flag'],
               axis = 1,
               inplace = True
              )

In [46]:
df_d_2017.head()

Unnamed: 0,Date/Time,Year,Month,Day,Max Temp (°C),Min Temp (°C),Mean Temp (°C),Total Rain (mm),Total Snow (cm),Total Precip (mm),Spd of Max Gust (km/h)
0,2017-01-01,2017,1,1,2.2,-2.3,-0.1,0.0,0.0,0.0,41
1,2017-01-02,2017,1,2,1.4,-6.0,-2.3,0.0,0.0,0.0,<31
2,2017-01-03,2017,1,3,0.4,-7.8,-3.7,0.0,0.0,0.0,<31
3,2017-01-04,2017,1,4,2.2,-8.4,-3.1,0.0,0.0,0.0,32
4,2017-01-05,2017,1,5,0.7,-6.6,-3.0,0.0,0.0,0.0,<31


In [47]:
#check missing values
df_d_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date/Time               365 non-null    object 
 1   Year                    365 non-null    int64  
 2   Month                   365 non-null    int64  
 3   Day                     365 non-null    int64  
 4   Max Temp (°C)           361 non-null    float64
 5   Min Temp (°C)           365 non-null    float64
 6   Mean Temp (°C)          361 non-null    float64
 7   Total Rain (mm)         360 non-null    float64
 8   Total Snow (cm)         360 non-null    float64
 9   Total Precip (mm)       362 non-null    float64
 10  Spd of Max Gust (km/h)  360 non-null    object 
dtypes: float64(6), int64(3), object(2)
memory usage: 31.5+ KB


There are a few missing values in "Max Temp (°C)", "Mean Temp (°C)", "Total Rain (mm)", "Total Snow (cm)", "Total Precip (mm)", "Spd of Max Gust (km/h)" columns.

It seems reasonable to replace the missing values by interpolating the same columns of the adjacent dates. For example, if Max Temp (°C) is missing on Day 5, we can set it to be the average of the Max Temp (°C) of Day 4 and Day 6.

In [None]:
#Clean data here




### Hourly Data

We will now check the hourly data. Take the 2017-01 hourly data as a sample.

In [33]:
df_h_2017_01 = pd.read_csv("2017-01_Hourly.csv")

In [34]:
#Look at first five rows
df_h_2017_01.head()

Unnamed: 0,Longitude (x),Latitude (y),Station Name,Climate ID,Date/Time (LST),Year,Month,Day,Time (LST),Temp (°C),...,Wind Spd Flag,Visibility (km),Visibility Flag,Stn Press (kPa),Stn Press Flag,Hmdx,Hmdx Flag,Wind Chill,Wind Chill Flag,Weather
0,-123.18,49.19,VANCOUVER INTL A,1108395,2017-01-01 00:00,2017,1,1,00:00,1.2,...,,19.3,,100.54,,,,,,
1,-123.18,49.19,VANCOUVER INTL A,1108395,2017-01-01 01:00,2017,1,1,01:00,0.9,...,,24.1,,100.55,,,,,,Cloudy
2,-123.18,49.19,VANCOUVER INTL A,1108395,2017-01-01 02:00,2017,1,1,02:00,1.2,...,,19.3,,100.61,,,,,,
3,-123.18,49.19,VANCOUVER INTL A,1108395,2017-01-01 03:00,2017,1,1,03:00,0.6,...,,19.3,,100.65,,,,,,
4,-123.18,49.19,VANCOUVER INTL A,1108395,2017-01-01 04:00,2017,1,1,04:00,0.6,...,,19.3,,100.65,,,,,,Cloudy


For hourly data, we are probably only interested in the "Temp", "Visibility" and "Weather" columns. However, the "Weather" column is missing a large amount of values, as we can see below.

In [35]:
df_h_2017_01.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 744 entries, 0 to 743
Data columns (total 30 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Longitude (x)        744 non-null    float64
 1   Latitude (y)         744 non-null    float64
 2   Station Name         744 non-null    object 
 3   Climate ID           744 non-null    int64  
 4   Date/Time (LST)      744 non-null    object 
 5   Year                 744 non-null    int64  
 6   Month                744 non-null    int64  
 7   Day                  744 non-null    int64  
 8   Time (LST)           744 non-null    object 
 9   Temp (°C)            744 non-null    float64
 10  Temp Flag            0 non-null      float64
 11  Dew Point Temp (°C)  744 non-null    float64
 12  Dew Point Temp Flag  0 non-null      float64
 13  Rel Hum (%)          744 non-null    int64  
 14  Rel Hum Flag         0 non-null      float64
 15  Precip. Amount (mm)  0 non-null      flo

We only have 309 non-null values for weather condition out of 744 observations. 