
# 01 Data Ingestion & Exploratory Data Analysis

This notebook programmatically downloads daily weather data for Toronto (station **31688** – Toronto City Centre) from Environment and Climate Change Canada's Historical Climate Data service.

We fetch data for a specified range of years using the Government of Canada's [bulk data API](https://climate.weather.gc.ca/climate_data/bulk_data_e.html). The resulting dataset includes daily measurements such as maximum/minimum/mean temperature, precipitation, snow on ground and wind gusts.

After downloading the dataset, we perform a thorough exploratory data analysis (EDA) to understand the structure, completeness and basic statistics of each field. The EDA outputs a data audit report that summarises column types, missing values and descriptive statistics.

The raw data is saved into the `data/raw/` directory and the data audit report is saved to `reports/data_audit_report.md` for future reference.


In [1]:

import os
import pandas as pd
import io
import requests

# Ensure directories exist
os.makedirs('data/raw', exist_ok=True)
os.makedirs('reports', exist_ok=True)


In [2]:

# Function to download daily weather data
def download_daily_weather(station_id, years):
    dfs = []
    for year in years:
        url = (
            'https://climate.weather.gc.ca/climate_data/bulk_data_e.html?'
            f'format=csv&stationID={station_id}&Year={year}&Month=1&Day=1&'
            'timeframe=2&submit=Download+Data'
        )
        print(f'Downloading {year} data from {url}')
        resp = requests.get(url, timeout=30)
        resp.raise_for_status()
        df = pd.read_csv(io.BytesIO(resp.content))
        df['DownloadYear'] = year
        dfs.append(df)
    return pd.concat(dfs, ignore_index=True)

# Parameters
station_id = 31688
years_to_fetch = [2019, 2020]

# Download data
weather_df = download_daily_weather(station_id, years_to_fetch)

# Save raw data
raw_path = 'data/raw/toronto_daily_weather_2019_2020.csv'
weather_df.to_csv(raw_path, index=False)
print(f'Saved raw data to {raw_path}')

# Display head
weather_df.head()


Downloading 2019 data from https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=31688&Year=2019&Month=1&Day=1&timeframe=2&submit=Download+Data


Downloading 2020 data from https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=31688&Year=2020&Month=1&Day=1&timeframe=2&submit=Download+Data


Saved raw data to data/raw/toronto_daily_weather_2019_2020.csv


  values = values.astype(str)


Unnamed: 0,Longitude (x),Latitude (y),Station Name,Climate ID,Date/Time,Year,Month,Day,Data Quality,Max Temp (°C),...,Total Snow Flag,Total Precip (mm),Total Precip Flag,Snow on Grnd (cm),Snow on Grnd Flag,Dir of Max Gust (10s deg),Dir of Max Gust Flag,Spd of Max Gust (km/h),Spd of Max Gust Flag,DownloadYear
0,-79.4,43.67,TORONTO CITY,6158355,2019-01-01,2019,1,1,,6.3,...,,0.4,,0.0,,,M,,M,2019
1,-79.4,43.67,TORONTO CITY,6158355,2019-01-02,2019,1,2,,0.6,...,,2.9,,0.0,,,M,,M,2019
2,-79.4,43.67,TORONTO CITY,6158355,2019-01-03,2019,1,3,,1.5,...,,0.0,,5.0,,,M,,M,2019
3,-79.4,43.67,TORONTO CITY,6158355,2019-01-04,2019,1,4,,7.8,...,,0.0,,4.0,,,M,,M,2019
4,-79.4,43.67,TORONTO CITY,6158355,2019-01-05,2019,1,5,,3.9,...,,0.0,,1.0,,,M,,M,2019


In [3]:

# Data audit function
def generate_data_audit_report(df):
    report_lines = []
    report_lines.append('# Data Audit Report\n')
    # Shape
    report_lines.append('## Dataset Shape\n')
    report_lines.append(f'Total rows: {len(df)}\n')
    report_lines.append(f'Total columns: {df.shape[1]}\n')
    # Column types
    report_lines.append('## Column Types\n')
    for col, dtype in df.dtypes.items():
        report_lines.append(f'- **{col}**: {dtype}')
    report_lines.append('')
    # Missing values
    report_lines.append('## Missing Values\n')
    missing = df.isnull().sum()
    for col, count in missing.items():
        pct = count / len(df) * 100
        report_lines.append(f'- **{col}**: {count} missing ({pct:.2f}%)')
    report_lines.append('')
    # Descriptive statistics
    report_lines.append('## Descriptive Statistics (Numeric Columns)\n')
    numeric_df = df.select_dtypes(include=['number'])
    desc = numeric_df.describe().transpose()
    report_lines.append(desc.to_string())
    return '\n'.join(report_lines)

# Generate report and save
report_md = generate_data_audit_report(weather_df)
report_path = 'reports/data_audit_report.md'
with open(report_path, 'w') as f:
    f.write(report_md)
print(f'Wrote data audit report to {report_path}')

# Show DataFrame info
weather_df.info()


Wrote data audit report to reports/data_audit_report.md
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 32 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Longitude (x)              731 non-null    float64
 1   Latitude (y)               731 non-null    float64
 2   Station Name               731 non-null    object 
 3   Climate ID                 731 non-null    int64  
 4   Date/Time                  731 non-null    object 
 5   Year                       731 non-null    int64  
 6   Month                      731 non-null    int64  
 7   Day                        731 non-null    int64  
 8   Data Quality               0 non-null      float64
 9   Max Temp (°C)              724 non-null    float64
 10  Max Temp Flag              7 non-null      object 
 11  Min Temp (°C)              724 non-null    float64
 12  Min Temp Flag              7 non-null      object 