# Santiago Air Quality Predictions: PM 2.5

<div class="alert">
<h5>Download, Explore and Preprocess Sensor Data:</h5>

In this notebook, our main goal is to obtain data from air quality stations located throughout Santiago, Chile. 

Additionally, we will include GPS coordinates for each station. 
This step is essential as we intend to use the KNN algorithm, which depends on spatial information.

</div> 

In [None]:
from datetime import datetime

import itables

import src.eda_utils as eda_utils

In [None]:
# Jupyter notebooks will cache the scripts, 
# but this allows for automatic reloading of updated scripts, 
# eliminating the need to manually reload each time.
%load_ext autoreload
%autoreload 2

<div class="alert">

First, let's visit the Chilean government's website, https://sinca.mma.gob.cl, to explore the available data.

Upon initial inspection, we observe various stations, noting that some are offline. 
These correspond to stations that no longer report data (but did so in the past).<br>
As we are interested in the most recent data, we will only work with the stations that are online.

Upon reviewing the files, we find that in addition to the offline stations, the 'Independencia' station has data only until 2022. 
For this reason, we will exclude it. 

</div> 

In [None]:
stations = {
    "Cerrillos II":     ('RM', 'D35'),
    "Cerro Navia":      ('RM', 'D18'),
    "El Bosque":        ('RM', 'D17'),
    #"Independencia":   ('RM', 'D11'), # We remove this one
    "La Florida":       ('RM', 'D12'),
    "Las Condes":       ('RM', 'D13'),
    "Pudhahuel" :       ('RM', 'D15'),
    "Puente Alto":      ('RM', 'D27'),
    "Quilicura":        ('RM', 'D30'),
    "Parque O'Higgins": ('RM', 'D14'),
    "Talagante":        ('RM', 'D28'),
}


<div class="alert">
Downloading the data and creating a dataframe to analyze various pollutants. 

The core logic resides in the `eda_utils` module to maintain a clean notebook structure and enable code reuse in automated development.

This separation ensures that the analysis section remains focused, while the utility functions are organized externally for modularity.
</div> 


In [None]:
df25 = eda_utils.get_pollutant_df(stations, 'PM25')
df10 = eda_utils.get_pollutant_df(stations, 'PM10')

In [None]:
# Explore data;
itables.show(df25)
itables.show(df10)

In [None]:
# Explore N/As ;
eda_utils.create_station_na_heatmap(df25, "PM2.5")
eda_utils.create_station_na_heatmap(df10, "PM10")

<div class="alert">
We can see that the Cerrillos station has the highest number of NA values in the oldest samples. 

Other stations have null values in some validated records, but they have preliminary data or are still unvalidated. 

</div> 

In [None]:
# After a quick examination of the graphs, for simplicity, we will set the 
# calculated pollutant value as the first non-null value from the following columns:

df25['PM2.5'] = df25[['Validated Records', 'Preliminary Records', 'Unvalidated Records']].bfill(axis=1).iloc[:, 0]
df10['PM10'] = df10[['Validated Records', 'Preliminary Records', 'Unvalidated Records']].bfill(axis=1).iloc[:, 0]

In [None]:
# And drop the columns...
df25 = df25.drop(columns=['Validated Records', 'Preliminary Records', 'Unvalidated Records'])
df10 = df10.drop(columns=['Validated Records', 'Preliminary Records', 'Unvalidated Records'])

In [None]:
# Now we create a single dataframe with both pollutants
df = df25.merge(df10, on=['Station','DateTime'], how='outer')

<div class="alert">

We encounter a challenge related to reporting delays in the data. At times, there is a lag of one or two weeks before the data becomes available.


To simplify our analysis and mitigate potential complications arising from missing values due to delayed reporting rather than sensor errors, we have decided to trim the affected data.

</div>

In [None]:
df = eda_utils.trim_unreported_data(df, ['PM2.5', 'PM10'])

<div class="alert">
We will explore the nature of the data a bit further, both by station and by pollutant.
</div> 

In [None]:
eda_utils.create_histogram_plot(df, 64)

In [None]:
# Generate boxplots of pollutant values for each sensor station
eda_utils.create_boxplot(df)

<div class="alert">
We will examine the potential correlation between `PM2.5` and `PM10`.
</div> 

In [None]:
eda_utils.create_scatterplot(df) 

<div class="alert">
Now, we will investigate the changes in pollutants over a specified time range.
</div> 

In [None]:
start_date = datetime(2021, 1, 1)
end_date = datetime(2023, 11, 30)
# generate a time series plot of pollutant data for a paricular station
eda_utils.create_time_series_plot(df, start_date, end_date)

<div class="alert">
<h5>GPS Coordinates !</h5>

In this phase, we will integrate geographic coordinates from stations to augment our dataset with GPS information for each measurement.
</div> 

In [None]:
station_data = eda_utils.get_coordinates_df()
station_data

In [None]:
df = eda_utils.merge_gps_data(df, station_data)

# look how it went!
df.head(5)

<div class="alert">
Finally, export the processed data!
</div> 

In [None]:
eda_utils.save_interim_data(df, 'stations_data')