# Introduction to Data Science

### Project topic: Weather forecasting in Toronto City

### Description
- _This is a project to apply our knowlegde in Data Science to solve the weather forecasting problem in Toronto City._

- _The dataset that we have collected is from the Canada government's website: [climate](https://climate.weather.gc.ca/historical_data/search_historic_data_e.html)_

- _In this case, we are going to use Python frameworks like: `requests`, `BeautifulSoup` to automatically collect the data._

- _Based on the collected dataset, we will apply some DS techniques in order to forecast the weather in Toronto City hourly._

### Group - 07:
- **21127072 - Nguyễn Hữu Khánh**
- 21127160 - Nguyễn Thanh Sơn
- 21127246 - Lê Minh Đức

### Prepare

- Install libraries

In [108]:
!pip install requests
!pip install beautifulsoup4
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install seaborn
!pip install lxml
!pip install scikit-learn



- Import libraries

In [129]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import selenium as se
import math

from datetime import datetime
from bs4 import BeautifulSoup
from io import StringIO
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

- Declare important variables

In [130]:
# Data URL
url = 'https://climate.weather.gc.ca/climate_data/hourly_data_e.html'

# DataFrame to store data
datas = pd.DataFrame()

### Support Function

In [131]:
def is_valid_date(year, month, day):
    try:
        datetime(year, month, day)
        return True
    except ValueError:
        return False
    
def fixDataFramTimeUTC(df, year, month, day):
    nump = np.asarray(df['TIME UTC'])
    nump = datetime(year,month,day).strftime("%Y-%m-%d") + "T" + nump
    df['TIME UTC'] = nump

### Get data

#### Use 1 in 2 methods to get data

- **Method 1:** Scrape data from the web

In [132]:
def getDataFromWeb(m_start, y_start, m_end, y_end, url_data):
    result = None
    for year in range(y_start, y_end + 1):
        for month in range(1, 13):
            for day in range(1, 32):
                if not is_valid_date(year, month, day) \
                    or ((year == y_start and month < m_start) or (year == y_end and month > m_end)):
                    continue 
                print(f'{year}-{month}-{day}')   
                params = {'StationID': '31688', 'selRowPerPage': 25, 'timeframe': 1, 'time': 'UTC', 'Year': year, 'Month': month, 'Day': day}
                response = requests.get(url_data, params=params)
                soup = BeautifulSoup(response.text,'html.parser')
                daily_table = soup.find('table')
                df = pd.read_html(StringIO(str(daily_table)))[0]
                fixDataFramTimeUTC(df, year, month, day)
                result = pd.concat([result, df], ignore_index=True)
    return result
    

- **Method 2:** Get data from available csv file (_use when you need to save time_)

In [133]:
def getDataFromCSV(filename):
    return pd.read_csv(filename)

#### Choose method to get data

In [134]:
# datas = getDataFromWeb(1,2020,2,2020,url)
datas = getDataFromCSV("weather_data.csv")

### Data preprocessing

#### Remove unnecessary or unpredictable columns

In [135]:
datas = datas[["TIME UTC",
               "Temp Definition °C",
               "Dew Point Definition °C", 
               "Rel Hum Definition %", 
               "Precip. Amount Definition mm", 
               "Stn Press Definition kPa", 
               "Hmdx Definition", 
               "Weather Definition"]]

#### Handle missing value

##### Fill in missing data rows

##### Fill in the humidex index columns

- Convert `datas` to _list_

In [136]:
data_list = datas.values.tolist()

- Setup humidex calculate function

$$H = T_{air} + \frac{5}{9} (6.11 * exp(5417.7530(\frac{1}{273.15} - \frac{1}{273.15 + T_{dew}})) - 10)$$

  - **Where:**
    - **$H$:** denotes the Humidex
    - **$T_{air}$:** is the air temperature in °C
    - **$T_{dew}$:** is the dewpoint temperature in °C
    - **$exp$:** is the exponential function

  - _Reference: [Wikipedia](https://en.wikipedia.org/wiki/Humidex)_

In [137]:
def hmdx_cal(temp, dew):
    try:
        temp = float(temp)
        dew = float(dew)
    except(ValueError, TypeError):
        return -1
    
    result = temp + (5/9) * (6.11 * math.exp(5417.7530 * ((1 / 273.15)-(1 / (273.15 + dew))))-10)

    return max(result, 0) # Because if humidex < 0, then it is not necessary

- Use `hmdx_cal` to calculate humidex

In [138]:
for item in data_list:
    item[6] = round(hmdx_cal(item[1], item[2]),0)

##### Forecast weather via available data

- Convert `list` to `DataFrame`

In [139]:
columns = ["TIME UTC", "Temp Definition °C", "Dew Point Definition °C", "Rel Hum Definition %", 
           "Precip. Amount Definition mm", "Stn Press Definition kPa", "Hmdx Definition", "Weather Definition"]
datas = pd.DataFrame(data_list, columns=columns)

- Assume weather conditions

In [140]:
datas['Weather Definition'] = datas.apply(
    lambda row: 'Sunny' if row['Temp Definition °C'] > 25 and row['Rel Hum Definition %'] < 50 else 
                'Rainy' if row['Precip. Amount Definition mm'] > 5 else 
                'Foggy' if row['Rel Hum Definition %'] > 90 and row['Temp Definition °C'] < 5 else 
                'Cloudy',
    axis=1
)

##### Build a machine learning model to forecast weather

- Set `X`, `y`

In [141]:
X = datas[["Temp Definition °C", "Dew Point Definition °C", "Rel Hum Definition %", "Precip. Amount Definition mm", "Stn Press Definition kPa", "Hmdx Definition"]]
y = datas[["Weather Definition"]]

- Use `train_test_split` to split into training set and test set (80% training, 20% test)

In [142]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

- Create a linear model

In [143]:
model = RandomForestClassifier(n_estimators=100, random_state=42)

- Train model

In [144]:
model.fit(X_train, y_train)

  return fit_method(estimator, *args, **kwargs)


- Prediction on test set

In [146]:
y_pred = model.predict(X_test)

- Model evaluation

In [149]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=['Sunny', 'Rainy', 'Cloudy', 'Foggy'])

- Print result

In [150]:
print(f"Độ chính xác của mô hình: {accuracy * 100:.2f}%")
print("Báo cáo phân loại:")
print(report)

Độ chính xác của mô hình: 99.99%
Báo cáo phân loại:
              precision    recall  f1-score   support

       Sunny       1.00      1.00      1.00     15686
       Rainy       0.99      1.00      1.00       355
      Cloudy       1.00      0.96      0.98        53
       Foggy       1.00      1.00      1.00       414

    accuracy                           1.00     16508
   macro avg       1.00      0.99      0.99     16508
weighted avg       1.00      1.00      1.00     16508



##### Write `datas` into csv file

- Export to csv file

In [151]:
datas.to_csv("test.csv")