# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Feature Pipeline</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/air_quality/2_feature_pipeline.ipynb)


## 🗒️ This notebook is divided into the following sections:
1. Parse Data
2. Feature Group Insertion

### <span style='color:#ff5f27'> 📝 Imports

In [1]:
import datetime
import time
import requests
import pandas as pd
import json

from functions import *
import features.air_quality

import warnings
warnings.filterwarnings("ignore")

In [2]:
with open('target_cities.json') as json_file:
    target_cities = json.load(json_file)

In [3]:
today = datetime.date.today()

In [4]:
today, str(today)

(datetime.date(2023, 4, 17), '2023-04-17')

---

## <span style='color:#ff5f27'> 🌫 Filling gaps in Air Quality data (PM2.5)</span>

### First time we will determine the 'last update date' using our backfill data
#### Next time we will use `feature view` method from Hopsworks Feature Store

In [5]:
df_air_quality = pd.read_csv("data/backfill_pm2_5.csv")
df_weather = pd.read_csv("data/backfill_weather.csv")

In [6]:
last_dates_aq = df_air_quality[["date", "city_name"]].groupby("city_name").max()
last_dates_aq.date = last_dates_aq.date.astype(str)

# here is a dictionary with city names as keys and last updated date as values
last_dates_aq = last_dates_aq.to_dict()["date"]

In [7]:
last_dates_aq["Berlin"], last_dates_aq["Columbus"]

('2023-04-17', '2023-04-17')

### <span style='color:#ff5f27'>  🧙🏼‍♂️ Parsing PM2.5 data

In [8]:
start_of_cell = time.time()

df_aq_update = pd.DataFrame()

for continent in target_cities:
    for city_name, coords in target_cities[continent].items():
        df_ = get_aqi_data_from_open_meteo(city_name=city_name,
                                           coordinates=coords,
                                           start_date=last_dates_aq[city_name],
                                           end_date=str(today))
        df_aq_update = pd.concat([df_aq_update, df_]).reset_index(drop=True)
    
end_of_cell = time.time()
print("-" * 64)
print(f"Parsed new PM2.5 data for ALL locations up to {str(today)}.")
print(f"Took {round(end_of_cell - start_of_cell, 2)} sec.\n")

Processed PM2_5 for NORTH BEND - NORTH BEND WAY since 2023-04-17 till 2023-04-17.
Took 0.15 sec.

Processed PM2_5 for LAKE FOREST PARK TOWNE CENTER since 2023-04-17 till 2023-04-17.
Took 0.14 sec.

Processed PM2_5 for SEATTLE - DUWAMISH since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Processed PM2_5 for SEATTLE - BEACON HILL since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Processed PM2_5 for SEATTLE - SOUTH PARK #2 since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Processed PM2_5 for KENT - JAMES & CENTRAL since 2023-04-17 till 2023-04-17.
Took 0.14 sec.

Processed PM2_5 for TACOMA - L STREET since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Processed PM2_5 for TACOMA - ALEXANDER AVE since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Processed PM2_5 for DARRINGTON - FIR ST (Darrington High School) since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Processed PM2_5 for MARYSVILLE - 7TH AVE (Marysville Junior High) since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Processed PM2_5 for Se

In [15]:
# calculate 28 days ago from today
date_threshold = today - datetime.timedelta(days=30)

df_air_quality.date = (df_air_quality.date).astype(str)
# filter rows based on date threshold
df_air_quality = df_air_quality[df_air_quality['date'] > str(date_threshold)]

df_air_quality

Unnamed: 0,city_name,date,pm2_5
3729,Amsterdam,2023-03-19,15.0
3730,Amsterdam,2023-03-20,9.0
3731,Amsterdam,2023-03-21,8.0
3732,Amsterdam,2023-03-22,6.0
3733,Amsterdam,2023-03-23,8.0
...,...,...,...
158060,Tampa,2023-04-13,6.1
158061,Tampa,2023-04-14,7.4
158062,Tampa,2023-04-15,12.1
158063,Tampa,2023-04-16,12.2


In [16]:
# we need the previous data to calculate aggregation functions
df_air_quality_new = pd.concat([df_air_quality, df_aq_update]).reset_index(drop=True)
df_air_quality_new = df_air_quality_new.drop_duplicates(subset=['city_name', 'date'])

### <span style="color:#ff5f27;">🛠 Feature Engineering</span>

In [17]:
df_air_quality_new['date'] = pd.to_datetime(df_air_quality_new['date'])

In [18]:
features.air_quality.shift_pm_2_5(df_air_quality_new, days=7) # add features about 7 previous PM2.5 values

features.air_quality.moving_average(df_air_quality_new, 7)
features.air_quality.moving_average(df_air_quality_new, 14)
features.air_quality.moving_average(df_air_quality_new, 28)

for i in [7, 14, 28]:
    for func in [features.air_quality.moving_std,
                 features.air_quality.exponential_moving_average,
                 features.air_quality.exponential_moving_std
                 ]:
        func(df_air_quality_new, i)
        

df_air_quality_new = df_air_quality_new.sort_values(by=["date", "pm2_5"]).dropna()
df_air_quality_new = df_air_quality_new.reset_index(drop=True)

In [19]:
features.air_quality.year(df_air_quality_new)
features.air_quality.day_of_month(df_air_quality_new)
features.air_quality.month(df_air_quality_new)
features.air_quality.day_of_week(df_air_quality_new)
features.air_quality.is_weekend(df_air_quality_new)
features.air_quality.sin_day_of_year(df_air_quality_new)
features.air_quality.cos_day_of_year(df_air_quality_new)
features.air_quality.sin_day_of_week(df_air_quality_new)
features.air_quality.cos_day_of_week(df_air_quality_new)

In [20]:
df_air_quality_new.isna().sum().sum()

0

In [21]:
df_air_quality_new.shape

(135, 31)

In [22]:
df_air_quality_new.columns

Index(['city_name', 'date', 'pm2_5', 'pm_2_5_previous_1_day',
       'pm_2_5_previous_2_day', 'pm_2_5_previous_3_day',
       'pm_2_5_previous_4_day', 'pm_2_5_previous_5_day',
       'pm_2_5_previous_6_day', 'pm_2_5_previous_7_day', 'mean_7_days',
       'mean_14_days', 'mean_28_days', 'std_7_days', 'exp_mean_7_days',
       'exp_std_7_days', 'std_14_days', 'exp_mean_14_days', 'exp_std_14_days',
       'std_28_days', 'exp_mean_28_days', 'exp_std_28_days', 'year',
       'day_of_month', 'month', 'day_of_week', 'is_weekend', 'sin_day_of_year',
       'cos_day_of_year', 'sin_day_of_week', 'cos_day_of_week'],
      dtype='object')

---

## <span style='color:#ff5f27'> 🌦 Filling gaps in Weather data</span>

In [23]:
last_dates_weather = df_weather[["date", "city_name"]].groupby("city_name").max()
last_dates_weather.date = last_dates_weather.date.astype(str)
last_dates_weather = last_dates_weather.to_dict()["date"]

### <span style='color:#ff5f27'>  🧙🏼‍♂️ Parsing Weather data

In [24]:
start_of_cell = time.time()

df_weather_update = pd.DataFrame()

for continent in target_cities:
    for city_name, coords in target_cities[continent].items():
        df_ = get_weather_data_from_open_meteo(city_name=city_name,
                                               coordinates=coords,
                                               start_date=last_dates_aq[city_name],
                                               end_date=str(today),
                                               forecast=True)
        df_weather_update = pd.concat([df_weather_update, df_]).reset_index(drop=True)
    
end_of_cell = time.time()
print("-" * 64)
print(f"Parsed new weather data for ALL cities up to {str(today)}.")
print(f"Took {round(end_of_cell - start_of_cell, 2)} sec.\n")

Parsed weather for NORTH BEND - NORTH BEND WAY since 2023-04-17 till 2023-04-17.
Took 0.14 sec.

Parsed weather for LAKE FOREST PARK TOWNE CENTER since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Parsed weather for SEATTLE - DUWAMISH since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Parsed weather for SEATTLE - BEACON HILL since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Parsed weather for SEATTLE - SOUTH PARK #2 since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Parsed weather for KENT - JAMES & CENTRAL since 2023-04-17 till 2023-04-17.
Took 0.12 sec.

Parsed weather for TACOMA - L STREET since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Parsed weather for TACOMA - ALEXANDER AVE since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Parsed weather for DARRINGTON - FIR ST (Darrington High School) since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Parsed weather for MARYSVILLE - 7TH AVE (Marysville Junior High) since 2023-04-17 till 2023-04-17.
Took 0.13 sec.

Parsed weather for Seattle-10th 

In [26]:
df_air_quality_new.date = pd.to_datetime(df_air_quality_new.date)
df_weather_update.date = pd.to_datetime(df_weather_update.date)

df_air_quality_new["unix_time"] = df_air_quality_new["date"].apply(convert_date_to_unix)
df_weather_update["unix_time"] = df_weather_update["date"].apply(convert_date_to_unix)

In [27]:
df_air_quality_new.date = df_air_quality_new.date.astype(str)
df_weather_update.date = df_weather_update.date.astype(str)

---

## <span style="color:#ff5f27;">⬆️ Uploading new data to the Feature Store</span>

### <span style="color:#ff5f27;"> 🔮 Connecting to Hopsworks Feature Store </span>

In [28]:
import hopsworks


project = hopsworks.login()
fs = project.get_feature_store() 

air_quality_fg = fs.get_or_create_feature_group(
    name = 'air_quality',
    version = 1
)
weather_fg = fs.get_or_create_feature_group(
    name = 'weather',
    version = 1
)

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/14502
Connected. Call `.close()` to terminate connection gracefully.


In [29]:
air_quality_fg.insert(df_air_quality_new, write_options={"wait_for_job": False})

Uploading Dataframe: 0.00% |          | Rows 0/135 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/14502/jobs/named/air_quality_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f87bd562eb0>, None)

In [31]:
weather_fg.insert(df_weather_update, write_options={"wait_for_job": True})

Uploading Dataframe: 0.00% |          | Rows 0/45 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/14502/jobs/named/weather_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7f87bd581c10>, None)