# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span>

<span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Backfill Features to the Feature Store</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/air_quality/1_backfill_feature_groups.ipynb)


## 🗒️ This notebook is divided into the following sections:
1. Fetch historical data
2. Connect to the Hopsworks feature store
3. Create feature groups and insert them to the feature store

![tutorial-flow](../../images/01_featuregroups.png)

### <span style='color:#ff5f27'> 📝 Imports

In [1]:
!pip install folium streamlit-folium --q

In [2]:
import os
import datetime
import time
import requests
import pandas as pd
import json
import folium

from functions import *
import features.air_quality

---

## <span style='color:#ff5f27'> 🌍 Representing the Target cities </span>

In [3]:
with open('target_cities.json') as json_file:
    target_cities = json.load(json_file)

In [4]:
# Create a folium map centered on the first location in the list
my_map = folium.Map(location=[42.57, -44.092], zoom_start=3)

for continent in target_cities:
        for city_name, coords in target_cities[continent].items():
            folium.CircleMarker(
                location=coords,
                popup=city_name
            ).add_to(my_map)

my_map

In [5]:
# # Save the map to an HTML file
# my_map.save("map_all_target_cities.html")

## <span style='color:#ff5f27'> 🌫 Processing Air Quality data</span>

### [🇪🇺 EEA](https://discomap.eea.europa.eu/map/fme/AirQualityExport.htm)
#### EEA means European Environmental Agency

In [6]:
target_cities["EU"]

{'Amsterdam': [52.37, 4.89],
 'Athina': [37.98, 23.73],
 'Berlin': [52.52, 13.39],
 'Gdansk': [54.37, 18.61],
 'Kraków': [50.06, 19.94],
 'London': [51.51, -0.13],
 'Madrid': [40.42, -3.7],
 'Marseille': [43.3, 5.37],
 'Milano': [45.46, 9.19],
 'München': [48.14, 11.58],
 'Napoli': [40.84, 14.25],
 'Paris': [48.85, 2.35],
 'Sevilla': [37.39, -6.0],
 'Stockholm': [59.33, 18.07],
 'Tallinn': [59.44, 24.75],
 'Varna': [43.21, 27.92],
 'Wien': [48.21, 16.37]}

In [7]:
df_eu = pd.read_csv("data/backfill_pm2_5_eu.csv")

In [8]:
df_eu.isna().sum().sum()

0

In [9]:
print("Size of this dataframe:", df_eu.shape)

df_eu.sample(3)

Size of this dataframe: (63548, 3)


Unnamed: 0,city_name,date,pm2_5
48746,Sevilla,2023-03-01,12.0
37485,München,2023-02-27,11.0
31674,Milano,2017-07-10,11.0


### [🇺🇸 USEPA](https://aqs.epa.gov/aqsweb/documents/data_api.html#daily)
#### USEPA means United States Environmental Protection Agency
[Manual downloading](https://www.epa.gov/outdoor-air-quality-data/download-daily-data)



In [10]:
target_cities["US"]

{'Albuquerque': [35.08, -106.65],
 'Atlanta': [33.75, -84.39],
 'Chicago': [41.88, -87.62],
 'Columbus': [39.96, -83.0],
 'Dallas': [32.78, -96.8],
 'Denver': [39.74, -104.98],
 'Houston': [29.76, -95.37],
 'Los Angeles': [34.05, -118.24],
 'New York': [40.71, -74.01],
 'Phoenix-Mesa': [33.66, -112.04],
 'Salt Lake City': [40.76, -111.89],
 'San Francisco': [37.78, -122.42],
 'Tampa': [27.95, -82.46]}

In [11]:
df_us = pd.read_csv("data/backfill_pm2_5_us.csv")

In [12]:
df_us.isna().sum().sum()

0

In [13]:
print("Size of this dataframe:", df_us.shape)

df_us.sample(3)

Size of this dataframe: (46037, 3)


Unnamed: 0,date,city_name,pm2_5
35335,2013-08-04,Salt Lake City,3.8
36800,2017-08-09,Salt Lake City,7.6
602,2014-08-26,Albuquerque,3.8


### <span style="color:#ff5f27;">🏢 Processing special city - `Seattle`</span>
#### We need different stations across the Seattle. 
I downloaded daily `PM2.5` data manually [here](https://www.epa.gov/outdoor-air-quality-data/download-daily-data)

In [14]:
target_cities["Seattle"]

{'NORTH BEND - NORTH BEND WAY': [47.49022, -121.77278],
 'LAKE FOREST PARK TOWNE CENTER': [47.755, -122.2806],
 'SEATTLE - DUWAMISH': [47.55975, -122.33827],
 'SEATTLE - BEACON HILL': [47.56824, -122.30863],
 'SEATTLE - SOUTH PARK #2': [47.53091, -122.3208],
 'KENT - JAMES & CENTRAL': [47.38611, -122.23028],
 'TACOMA - L STREET': [47.1864, -122.4517],
 'TACOMA - ALEXANDER AVE': [47.2656, -122.3858],
 'DARRINGTON - FIR ST (Darrington High School)': [48.2469, -121.6031],
 'MARYSVILLE - 7TH AVE (Marysville Junior High)': [48.05432, -122.17153],
 'Seattle-10th & Weller': [47.59722, -122.31972],
 'Bellevue-SE 12th St': [47.60086, -122.1484],
 'Tacoma-S 36th St': [47.22634, -122.46256],
 'Tukwila Allentown': [47.49854, -122.27839],
 'Tulalip-Totem Beach Rd': [48.06534, -122.28519]}

In [15]:
df_seattle = pd.read_csv("data/backfill_pm2_5_seattle.csv")

In [16]:
df_seattle.isna().sum().sum()

0

In [17]:
print("Size of this dataframe:", df_seattle.shape)

df_seattle.sample(3)

Size of this dataframe: (46479, 3)


Unnamed: 0,city_name,date,pm2_5
25749,SEATTLE - DUWAMISH,2019-10-01,7.6
41303,SEATTLE - DUWAMISH,2022-05-18,6.4
15893,Bellevue-SE 12th St,2017-11-16,0.8


In [18]:
df_seattle.city_name.value_counts()

NORTH BEND - NORTH BEND WAY                      3705
TACOMA - L STREET                                3696
SEATTLE - BEACON HILL                            3691
MARYSVILLE - 7TH AVE (Marysville Junior High)    3648
DARRINGTON - FIR ST (Darrington High School)     3614
SEATTLE - SOUTH PARK #2                          3577
TACOMA - ALEXANDER AVE                           3569
KENT - JAMES & CENTRAL                           3556
SEATTLE - DUWAMISH                               3439
Seattle-10th & Weller                            3097
LAKE FOREST PARK TOWNE CENTER                    2999
Tacoma-S 36th St                                 2574
Bellevue-SE 12th St                              2172
Tukwila Allentown                                2074
Tulalip-Totem Beach Rd                           1068
Name: city_name, dtype: int64

### <span style="color:#ff5f27;">🌟 All together</span>

In [19]:
df_air_quality = pd.concat([df_eu, df_us, df_seattle]).reset_index(drop=True)

In [20]:
df_air_quality.sample(5)

Unnamed: 0,city_name,date,pm2_5
133255,DARRINGTON - FIR ST (Darrington High School),2018-10-28,1.7
51034,Stockholm,2019-02-25,8.0
45736,Sevilla,2014-12-03,20.0
46175,Sevilla,2016-02-15,13.0
121584,SEATTLE - DUWAMISH,2016-08-25,9.3


In [21]:
df_air_quality.shape

(156064, 3)

### <span style="color:#ff5f27;">🛠 Feature Engineering</span>

In [22]:
df_air_quality['date'] = pd.to_datetime(df_air_quality['date'])

In [23]:
df_air_quality.head(2)

Unnamed: 0,city_name,date,pm2_5
0,Amsterdam,2013-01-01,14.0
1,Amsterdam,2013-01-02,8.0


In [24]:
features.air_quality.shift_pm_2_5(df_air_quality, days=7) # add features about 7 previous PM2.5 values

features.air_quality.moving_average(df_air_quality, 7)
features.air_quality.moving_average(df_air_quality, 14)
features.air_quality.moving_average(df_air_quality, 28)


for i in [7, 14, 28]:
    for func in [features.air_quality.moving_std,
                 features.air_quality.exponential_moving_average,
                 features.air_quality.exponential_moving_std
                 ]:
        func(df_air_quality, i)
        

df_air_quality = df_air_quality.sort_values(by=["date", "pm2_5"]).dropna()
df_air_quality = df_air_quality.reset_index(drop=True)

In [25]:
features.air_quality.year(df_air_quality)
features.air_quality.day_of_month(df_air_quality)
features.air_quality.month(df_air_quality)
features.air_quality.day_of_week(df_air_quality)
features.air_quality.is_weekend(df_air_quality)
features.air_quality.sin_day_of_year(df_air_quality)
features.air_quality.cos_day_of_year(df_air_quality)
features.air_quality.sin_day_of_week(df_air_quality)
features.air_quality.cos_day_of_week(df_air_quality)

In [26]:
df_air_quality.isna().sum().sum()

0

In [27]:
df_air_quality.shape

(154804, 31)

In [28]:
df_air_quality.columns

Index(['city_name', 'date', 'pm2_5', 'pm_2_5_previous_1_day',
       'pm_2_5_previous_2_day', 'pm_2_5_previous_3_day',
       'pm_2_5_previous_4_day', 'pm_2_5_previous_5_day',
       'pm_2_5_previous_6_day', 'pm_2_5_previous_7_day', 'mean_7_days',
       'mean_14_days', 'mean_28_days', 'std_7_days', 'exp_mean_7_days',
       'exp_std_7_days', 'std_14_days', 'exp_mean_14_days', 'exp_std_14_days',
       'std_28_days', 'exp_mean_28_days', 'exp_std_28_days', 'year',
       'day_of_month', 'month', 'day_of_week', 'is_weekend', 'sin_day_of_year',
       'cos_day_of_year', 'sin_day_of_week', 'cos_day_of_week'],
      dtype='object')

---

## <span style='color:#ff5f27'> 🌦 Loading Weather Data from [Open Meteo](https://open-meteo.com/en/docs)

In [29]:
df_weather = pd.read_csv("data/backfill_weather.csv")

In [30]:
df_weather.sample(3)

Unnamed: 0,city_name,date,temperature_max,temperature_min,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,wind_speed_max,wind_gusts_max,wind_direction_dominant
31007,Milano,2015-10-21,14.7,3.9,0.0,0.0,0.0,0.0,7.2,16.6,25
87181,Houston,2015-09-05,32.9,24.5,5.3,5.3,0.0,4.0,11.1,29.2,176
163502,Seattle - Tukwila Allentown,2018-10-30,12.8,7.3,1.8,1.8,0.0,9.0,10.5,28.8,172


---

In [31]:
df_air_quality.date = pd.to_datetime(df_air_quality.date)
df_weather.date = pd.to_datetime(df_weather.date)

df_air_quality["unix_time"] = df_air_quality["date"].apply(convert_date_to_unix)
df_weather["unix_time"] = df_weather["date"].apply(convert_date_to_unix)

In [44]:
df_air_quality.date = df_air_quality.date.astype(str)
df_weather.date = df_weather.date.astype(str)

### <span style="color:#ff5f27;"> 🔮 Connecting to Hopsworks Feature Store </span>

In [33]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/14502
Connected. Call `.close()` to terminate connection gracefully.


---

## <span style="color:#ff5f27;">🪄 Creating Feature Groups</span>

### <span style='color:#ff5f27'> 🌫 Air Quality Data

In [39]:
air_quality_fg = fs.get_or_create_feature_group(
    name='air_quality',
    description='Air Quality characteristics of each day',
    version=1,
    primary_key=['unix_time','city_name'],
    online_enabled=False,
    # partition_key=["city_name"],
    event_time=["unix_time"]
)    

In [40]:
air_quality_fg.insert(df_air_quality, write_options={"wait_for_job": False})

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/14502/fs/14422/fg/36275


Uploading Dataframe: 0.00% |          | Rows 0/154804 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/14502/jobs/named/air_quality_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7fdc5b1fc7c0>, None)

### <span style='color:#ff5f27'> 🌦 Weather Data

In [42]:
weather_fg = fs.get_or_create_feature_group(
    name='weather',
    description='Weather characteristics of each day',
    version=1,
    primary_key=['unix_time','city_name'],
    online_enabled=False,
    # partition_key=["city_name"],
    event_time=["unix_time"]
) 

In [43]:
weather_fg.insert(df_weather, write_options={"wait_for_job": True})

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/14502/fs/14422/fg/36277


Uploading Dataframe: 0.00% |          | Rows 0/168975 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/14502/jobs/named/weather_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7fdc6abc2b50>, None)