# Santiago Air Quality Predictions: PM 2.5

<div class="alert">
<h5>Data Prediction for Areas Adjacent to Monitoring Stations:</h5>

In this third notebook, we aim to predict air quality (PM 2.5) for areas near monitoring stations. We will perform data imputation for these neighboring regions and visualize the results on a map of the Region Metropolitana de Santiago.

Additionally, an animation will be created to dynamically explore the evolution of air quality in that specific area.
</div>

In [None]:
import os
import warnings
from datetime import timedelta

import folium
import folium.plugins as plugins
import pandas as pd

import src.predict_utils as predict_utils

In [None]:
# Jupyter notebooks will cache the scripts, 
# but this allows for automatic reloading of updated scripts, 
# eliminating the need to manually reload each time.
%load_ext autoreload
%autoreload 2

In [None]:
# Filter warnings...
warnings.filterwarnings("ignore")

In [None]:
# Load the dataset with missing values filled in.
destination_path = os.path.join('./data/interim/', 'stations_data_with_imputed_values.feather')

full_dataset = pd.read_feather(destination_path)
full_dataset['DateTime'] = pd.to_datetime(full_dataset['DateTime'], dayfirst=True)

# a little check...
full_dataset.tail(5)

<div class="alert">
In the previous notebook, we selected KNN as the method for imputing our values.<p>

Now, our focus is on determining the optimal k parameter to use.
A larger k has the potential to enhance predictions, but it comes with increased computational costs.<p>

</div>


In [None]:
# Make an estimate of mean absolute error (MAE) for a range of k values.
kmin = 1
kmax = 10

# For this evaluation we will use a subset of the data to speed up the process...
for kneighbors in range(kmin, kmax + 1):
    mae = predict_utils.calculate_mae_for_k(full_dataset, k=kneighbors, target_pollutant="PM2.5")
    print(f'k = {kneighbors}, MAE = {mae}')

<div class="alert">
Upon analysis, it becomes evident that a k value of 3 or 4 seems to be a good choice, as the improvement over the next values is relatively marginal.

We have decided to proceed with k=4.
</div>


In [None]:
k = 4
target = 'PM2.5'

# We set the start_date and the end_date as the last 24h
start_date = full_dataset['DateTime'].max() - timedelta(hours=24)
end_date = start_date + timedelta(hours=24)

predict_utils.create_heat_map_with_date_range(full_dataset, start_date, end_date, k, target)

<div class="alert">
After examining the heatmap, we proceed to generate an animation that visually illustrates the evolution of air quality.<p>

Since we are using the whole dataset, this could take some time...
</div>


In [None]:
# Choose parameters for the animation
k = 4
n_points_grid = 128

# Filter a date range, in this case the last 12hours!
delta_range = timedelta(hours=12)
start_date = full_dataset['DateTime'].max() - delta_range
end_date = start_date + delta_range

# Create the features for the animation (these are the shapes that will appear on the map)
features = predict_utils.create_animation_features(full_dataset, start_date, end_date, k, n_points_grid, target)
print('Features for the animation created successfully! Run the next cell to see the result!')

<div class="alert">
And finally we have our animation for areas near monitoring stations. 
</div>


In [None]:
scl_location = (-33.45694, -70.64827)

# Create the map animation using the folium library
map_animation = folium.Map(location=scl_location, zoom_start=11) 
# Add the features to the animation
plugins.TimestampedGeoJson(
    {"type": "FeatureCollection", "features": features},
    period="PT1H",
    duration='PT1H',
    add_last_point=True
).add_to(map_animation)

# Run the animation
map_animation