# Introduction to Data Science

### Project topic: Weather forecasting in Toronto City

### Description
- _This is a project to apply our knowlegde in Data Science to solve the weather forecasting problem in Toronto City._

- _The dataset that we have collected is from the Canada government's website: [climate](https://climate.weather.gc.ca/historical_data/search_historic_data_e.html)_

- _In this case, we are going to use Python frameworks like: `requests`, `BeautifulSoup` to automatically collect the data._

- _Based on the collected dataset, we will apply some DS techniques in order to forecast the weather in Toronto City hourly._

### Group - 07:
- **21127072 - Nguyễn Hữu Khánh**
- 21127160 - Nguyễn Thanh Sơn
- 21127246 - Lê Minh Đức

### Prepare

- Install libraries

In [None]:
!pip install requests
!pip install beautifulsoup4
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install seaborn
!pip install lxml

- Import libraries

In [35]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import selenium as se
import math

from datetime import datetime
from bs4 import BeautifulSoup
from io import StringIO

- Declare important variables

In [26]:
# Data URL
url = 'https://climate.weather.gc.ca/climate_data/hourly_data_e.html'

# DataFrame to store data
datas = pd.DataFrame()

### Support Function

In [27]:
def is_valid_date(year, month, day):
    try:
        datetime(year, month, day)
        return True
    except ValueError:
        return False
    
def fixDataFramTimeUTC(df, year, month, day):
    nump = np.asarray(df['TIME UTC'])
    nump = datetime(year,month,day).strftime("%Y-%m-%d") + "T" + nump
    df['TIME UTC'] = nump

### Get data

In [28]:
def getDataFromWeb(m_start, y_start, m_end, y_end, url_data):
    result = None
    for year in range(y_start, y_end + 1):
        for month in range(1, 13):
            for day in range(1, 32):
                if not is_valid_date(year, month, day) \
                    or ((year == y_start and month < m_start) or (year == y_end and month > m_end)):
                    continue 
                print(f'{year}-{month}-{day}')   
                params = {'StationID': '31688', 'selRowPerPage': 25, 'timeframe': 1, 'time': 'UTC', 'Year': year, 'Month': month, 'Day': day}
                response = requests.get(url_data, params=params)
                soup = BeautifulSoup(response.text,'html.parser')
                daily_table = soup.find('table')
                df = pd.read_html(StringIO(str(daily_table)))[0]
                fixDataFramTimeUTC(df, year, month, day)
                result = pd.concat([result, df], ignore_index=True)
    return result
    

In [None]:
datas = getDataFromWeb(1,2020,2,2020,url)

### Data preprocessing

#### Remove unnecessary or unpredictable columns

In [34]:
datas = datas[["TIME UTC",
               "Temp Definition °C",
               "Dew Point Definition °C", 
               "Rel Hum Definition %", 
               "Precip. Amount Definition mm", 
               "Stn Press Definition kPa", 
               "Hmdx Definition", 
               "Weather Definition"]]

#### Handle missing value

##### Fill in missing data rows

##### Fill in the humidex index columns

- Convert `datas` to _list_

In [37]:
data_list = datas.values.tolist()

- Setup humidex calculate function

$$H = T_{air} + \frac{5}{9} (6.11 * exp(5417.7530(\frac{1}{273.15} - \frac{1}{273.15 + T_{dew}})) - 10)$$

  - **Where:**
    - **$H$:** denotes the Humidex
    - **$T_{air}$:** is the air temperature in °C
    - **$T_{dew}$:** is the dewpoint temperature in °C
    - **$exp$:** is the exponential function

  - _Reference: [Wikipedia](https://en.wikipedia.org/wiki/Humidex)_

In [38]:
def hmdx_cal(temp, dew):
    try:
        temp = float(temp)
        dew = float(dew)
    except(ValueError, TypeError):
        return -1
    
    result = temp + (5/9) * (6.11 * math.exp(5417.7530 * ((1 / 273.15)-(1 / (273.15 + dew))))-10)

    return max(result, 0) # Because if humidex < 0, then it is not necessary

- Use `hmdx_cal` to calculate humidex

In [39]:
for item in data_list:
    item[6] = round(hmdx_cal(item[1], item[2]),0)

[['2020-01-01T00:00', 0.7, -4.5, 68.0, 0.0, 98.7, 0, 'LegendNANA'],
 ['2020-01-01T01:00', 0.4, -4.6, 69.0, 0.0, 98.73, 0, 'LegendNANA'],
 ['2020-01-01T02:00', 0.2, -4.9, 68.0, 0.0, 98.75, 0, 'LegendNANA'],
 ['2020-01-01T03:00', -0.1, -5.0, 69.0, 0.0, 98.77, 0, 'LegendNANA'],
 ['2020-01-01T04:00', -0.2, -4.6, 72.0, 0.0, 98.81, 0, 'LegendNANA'],
 ['2020-01-01T05:00', -0.1, -5.2, 69.0, 0.0, 98.82, 0, 'LegendNANA'],
 ['2020-01-01T06:00', -0.1, -4.9, 70.0, 0.0, 98.84, 0, 'LegendNANA'],
 ['2020-01-01T07:00', -0.1, -5.5, 67.0, 0.0, 98.87, 0, 'LegendNANA'],
 ['2020-01-01T08:00', -0.4, -5.3, 69.0, 0.0, 98.91, 0, 'LegendNANA'],
 ['2020-01-01T09:00', -0.7, -4.5, 75.0, 0.0, 98.96, 0, 'LegendNANA'],
 ['2020-01-01T10:00', -0.5, -4.9, 72.0, 0.2, 98.96, 0, 'LegendNANA'],
 ['2020-01-01T11:00', -0.6, -4.8, 73.0, 0.0, 98.99, 0, 'LegendNANA'],
 ['2020-01-01T12:00', -0.5, -4.8, 72.0, 0.0, 99.05, 0, 'LegendNANA'],
 ['2020-01-01T13:00', -0.4, -5.6, 68.0, 0.0, 99.11, 0, 'LegendNANA'],
 ['2020-01-01T14:00', -0

##### Forecast weather via available data