<span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Daily Feature Pipeline for Air Quality (aqicn.org) and weather (openmeteo)</span>

## 🗒️ This notebook is divided into the following sections:
1. Download and Parse Data
2. Feature Group Insertion


__This notebook should be scheduled to run daily__

In the book, we use a GitHub Action stored here:
[.github/workflows/air-quality-daily.yml](https://github.com/featurestorebook/mlfs-book/blob/main/.github/workflows/air-quality-daily.yml)

However, you are free to use any Python Orchestration tool to schedule this program to run daily.

### <span style='color:#ff5f27'> 📝 Imports

In [1]:
import datetime
import time
import requests
import pandas as pd
import hopsworks
from functions import util
import json
import os
import warnings
import holidays
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm

to inform you about an upcoming change in our API versioning strategy that may affect your
project's dependencies. Starting from version 1.0 onwards, we will be following a loose form of
Semantic Versioning (SemVer, https://semver.org) to provide clearer communication regarding any
potential breaking changes.

This means that while we strive to maintain backward compatibility, there might be occasional
updates that introduce breaking changes to our API. To ensure the stability of your projects,
we highly recommend pinning the version of our API that you rely on. You can pin your current
holidays v0.x dependency (e.g., holidays==0.63) or limit it (e.g., holidays<1.0) in order to
avoid potentially unwanted upgrade to the version 1.0 when it's released (ETA 2025Q1-Q2).

If you have any questions or concerns regarding this change, please don't hesitate to reach out
to us via https://github.com/vacanza/holidays/discussions/1800.



## <span style='color:#ff5f27'> 🌍 Get the Sensor URL, Country, City, Street names from Hopsworks </span>

__Update the values in the cell below.__

__These should be the same values as in notebook 1 - the feature backfill notebook__


In [25]:
# If you haven't set the env variable 'HOPSWORKS_API_KEY', then uncomment the next line and enter your API key
# os.environ["HOPSWORKS_API_KEY"] = ""
#check if os.environ["HOPSWORKS_API_KEY"] is set
if not os.environ.get("HOPSWORKS_API_KEY"):
    with open('../../data/hopsworks-api-key.txt', 'r') as file:
        os.environ["HOPSWORKS_API_KEY"] = file.read().rstrip()

project = hopsworks.login(project="juls_first_project")
fs = project.get_feature_store() 
secrets = hopsworks.get_secrets_api()


BICING_API_KEY = secrets.get_secret("BICING_API_KEY").value
station_str = secrets.get_secret("STATION_PARAMS_JSON").value
station= json.loads(station_str)

city=station['city']
station_id=station['station_id']
bikes_url=station['bikes_url']
latitude=station['latitude']
longitude=station['longitude']

today = datetime.date.today()

station_str

2025-01-03 13:27:04,180 INFO: Closing external client and cleaning up certificates.
Connection closed.
2025-01-03 13:27:04,189 INFO: Initializing external client
2025-01-03 13:27:04,190 INFO: Base URL: https://c.app.hopsworks.ai:443
2025-01-03 13:27:05,315 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1164440


'{"city": "Barcelona", "station_id": 42, "bikes_url": "https://opendata-ajuntament.barcelona.cat/data/dataset/estat-estacions-bicing/resource/1b215493-9e63-4a12-8980-2d7e0fa19f85/download/recurs.json", "latitude": "41.404511", "longitude": "2.189881"}'

### <span style="color:#ff5f27;"> 🔮 Get references to the Feature Groups </span>

In [26]:
# Retrieve feature groups
bikes_fg = fs.get_feature_group(
    name='bikes',
    version=2,
)
weather_fg = fs.get_feature_group(
    name='weather',
    version=2,
)

---

## <span style='color:#ff5f27'> 🌫 Retrieve Today's Air Quality data (PM2.5) from the AQI API</span>


In [27]:

import pandas as pd

bike_today_df = util.fetch_station_data(bikes_url,BICING_API_KEY, station_id)


bike_today_df


Unnamed: 0,station_id,num_bikes_available,last_reported
0,42,3,2025-01-03 12:25:57+00:00


In [28]:
# cast last_reported to a string
bike_today_df = bike_today_df[['last_reported', 'num_bikes_available']]

# Create column 'day' with the date
bike_today_df['day'] = bike_today_df['last_reported'].dt.strftime('%Y-%m-%d')
# Create column 'time' with the hour
bike_today_df['time'] = bike_today_df['last_reported'].dt.strftime('%H')
bike_today_df = bike_today_df.rename(columns={"last_reported": "date"})

## Add a column date with the date based on the columns day and time, type datetime
bike_today_df['date'] = bike_today_df['day'] + ' ' + bike_today_df['time'] + ':00:00'
bike_today_df['date'] = pd.to_datetime(bike_today_df['date'], format='%Y-%m-%d %H:%M:%S')

## Adding a new boolean column if the date is weekend or not
bike_today_df['is_weekend'] = bike_today_df['date'].dt.dayofweek > 4
## Adding a new boolean column if the date is holiday or not
holidays_es = holidays.Spain()
bike_today_df['is_holiday'] = bike_today_df['date'].dt.date.astype(str).map(lambda x: x in holidays_es)

# cast num_bikes_available to double
bike_today_df['num_bikes_available'] = bike_today_df['num_bikes_available'].astype(float)
# cast time to int
bike_today_df['time'] = bike_today_df['time'].astype(int)


bike_today_df

Unnamed: 0,date,num_bikes_available,day,time,is_weekend,is_holiday
0,2025-01-03 12:00:00,3.0,2025-01-03,12,False,False


In [29]:
bike_today_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   date                 1 non-null      datetime64[ns]
 1   num_bikes_available  1 non-null      float64       
 2   day                  1 non-null      object        
 3   time                 1 non-null      int64         
 4   is_weekend           1 non-null      bool          
 5   is_holiday           1 non-null      bool          
dtypes: bool(2), datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 166.0+ bytes


## <span style='color:#ff5f27'> 🌦 Get Weather Forecast data</span>

In [30]:
hourly_df = util.get_hourly_weather_forecast(city, latitude, longitude)

#add a column with the
hourly_df['day'] = hourly_df['date'].dt.strftime('%Y-%m-%d')
# Create column 'time' with the hour
hourly_df['time'] = hourly_df['date'].dt.strftime('%H')
hourly_df['city'] = city
# cast time to int
hourly_df['time'] = hourly_df['time'].astype(int)
hourly_df

Coordinates 41.5°N 2.25°E
Elevation 13.0 m asl
Timezone None None
Timezone difference to GMT+0 0 s


Unnamed: 0,date,temperature,precipitation,day,time,city
0,2025-01-03 00:00:00,5.55,0.0,2025-01-03,0,Barcelona
1,2025-01-03 01:00:00,5.95,0.0,2025-01-03,1,Barcelona
2,2025-01-03 02:00:00,6.55,0.0,2025-01-03,2,Barcelona
3,2025-01-03 03:00:00,6.85,0.0,2025-01-03,3,Barcelona
4,2025-01-03 04:00:00,6.50,0.0,2025-01-03,4,Barcelona
...,...,...,...,...,...,...
235,2025-01-12 19:00:00,6.30,0.0,2025-01-12,19,Barcelona
236,2025-01-12 20:00:00,5.60,0.0,2025-01-12,20,Barcelona
237,2025-01-12 21:00:00,4.95,0.0,2025-01-12,21,Barcelona
238,2025-01-12 22:00:00,4.25,0.0,2025-01-12,22,Barcelona


In [31]:
hourly_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           240 non-null    datetime64[ns]
 1   temperature    240 non-null    float32       
 2   precipitation  240 non-null    float32       
 3   day            240 non-null    object        
 4   time           240 non-null    int64         
 5   city           240 non-null    object        
dtypes: datetime64[ns](1), float32(2), int64(1), object(2)
memory usage: 9.5+ KB


## <span style="color:#ff5f27;">⬆️ Uploading new data to the Feature Store</span>

In [32]:
# Insert new data
bikes_fg.insert(bike_today_df)

Uploading Dataframe: 100.00% |██████████| Rows 1/1 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: bikes_2_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1164440/jobs/named/bikes_2_offline_fg_materialization/executions


(Job('bikes_2_offline_fg_materialization', 'SPARK'), None)

In [33]:
# Insert new data
weather_fg.insert(hourly_df)

Uploading Dataframe: 100.00% |██████████| Rows 240/240 | Elapsed Time: 00:01 | Remaining Time: 00:00


(Job('weather_2_offline_fg_materialization', 'SPARK'), None)

## <span style="color:#ff5f27;">⏭️ **Next:** Part 03: Training Pipeline
 </span> 

In the following notebook you will read from a feature group and create training dataset within the feature store
