# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span>

<span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Feature Backfill</span>

**Note**: This tutorial does not support Google Colab.

## 🗒️ This notebook is divided into the following sections:

1. Fetch historical data.
2. Connect to the Hopsworks feature store.
3. Create feature groups and insert them to the feature store.

![tutorial-flow](../../images/01_featuregroups.png)

## <span style='color:#ff5f27'> 📝 Imports

In [1]:
!pip install -U hopsworks --quiet
!pip install geopy folium streamlit-folium --q

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.11.0 requires protobuf<3.20,>=3.9.2, but you have protobuf 4.25.3 which is incompatible.
tensorboard 2.11.2 requires protobuf<4,>=3.9.2, but you have protobuf 4.25.3 which is incompatible.
ray 2.0.0 requires protobuf<4.0.0,>=3.15.3, but you have protobuf 4.25.3 which is incompatible.[0m[31m
[0m

In [2]:
import json

import pandas as pd
import folium

from features import air_quality
from functions.common_functions import convert_date_to_unix

import warnings
warnings.filterwarnings("ignore")

## <span style='color:#ff5f27'> 🌍 Representing the Target cities </span>

In [3]:
# Open the 'target_cities.json' file in read mode
with open('target_cities.json') as json_file:
    # Load the JSON data from the file into a Python dictionary
    target_cities = json.load(json_file)

# Now, 'target_cities' contains the data from the JSON file

In [5]:
# Create a folium map centered on the first location in the list
my_map = folium.Map(location=[42.57, -44.092], zoom_start=3)

for continent in target_cities:
        for city_name, coords in target_cities[continent].items():
            folium.CircleMarker(
                location=coords,
                popup=city_name,
            ).add_to(my_map)
#my_map

In [None]:
# # Save the map to an HTML file
# my_map.save("map_all_target_cities.html")

## <span style='color:#ff5f27'> 🌫 Processing Air Quality data</span>

### [🇪🇺 EEA](https://discomap.eea.europa.eu/map/fme/AirQualityExport.htm)
#### EEA means European Environmental Agency

In [6]:
# EU Cities 
target_cities["EU"]

{'Amsterdam': [52.37, 4.89],
 'Athina': [37.98, 23.73],
 'Berlin': [52.52, 13.39],
 'Gdansk': [54.37, 18.61],
 'Kraków': [50.06, 19.94],
 'London': [51.51, -0.13],
 'Madrid': [40.42, -3.7],
 'Marseille': [43.3, 5.37],
 'Milano': [45.46, 9.19],
 'München': [48.14, 11.58],
 'Napoli': [40.84, 14.25],
 'Paris': [48.85, 2.35],
 'Sevilla': [37.39, -6.0],
 'Stockholm': [59.33, 18.07],
 'Tallinn': [59.44, 24.75],
 'Varna': [43.21, 27.92],
 'Wien': [48.21, 16.37]}

In [12]:
# Read the CSV file from the specified URL into a pandas DataFrame
df_eu = pd.read_csv("https://repo.hops.works/dev/davit/air_quality/backfill_pm2_5_eu.csv")

# Print the size of the 'df_eu' DataFrame (number of rows and columns)
print("⛳️ Size of this dataframe:", df_eu.shape)

# Check for missing values in the 'df_eu' DataFrame
print(f'⛳️ Missing Values: {df_eu.isna().sum().sum()}')

# Display a random sample of three rows from the 'df_eu' DataFrame
df_eu.sample(3)

⛳️ Size of this dataframe: (63548, 3)
⛳️ Missing Values: 0


Unnamed: 0,city_name,date,pm2_5
11887,Gdansk,2014-09-21,23.0
17498,Kraków,2019-10-23,56.0
42593,Paris,2016-08-04,7.0


### [🇺🇸 USEPA](https://aqs.epa.gov/aqsweb/documents/data_api.html#daily)
#### USEPA means United States Environmental Protection Agency
[Manual downloading](https://www.epa.gov/outdoor-air-quality-data/download-daily-data)



In [13]:
# US Cities 
target_cities["US"]

{'Albuquerque': [35.08, -106.65],
 'Atlanta': [33.75, -84.39],
 'Chicago': [41.88, -87.62],
 'Columbus': [39.96, -83.0],
 'Dallas': [32.78, -96.8],
 'Denver': [39.74, -104.98],
 'Houston': [29.76, -95.37],
 'Los Angeles': [34.05, -118.24],
 'New York': [40.71, -74.01],
 'Phoenix-Mesa': [33.66, -112.04],
 'Salt Lake City': [40.76, -111.89],
 'San Francisco': [37.78, -122.42],
 'Tampa': [27.95, -82.46]}

In [16]:
# Read the CSV file from the specified URL into a pandas DataFrame
df_us = pd.read_csv("https://repo.hops.works/dev/davit/air_quality/backfill_pm2_5_us.csv")

# Print the size of the 'df_us' DataFrame (number of rows and columns)
print("⛳️ Size of this dataframe:", df_us.shape)

# Check for missing values in the 'df_us' DataFrame
print(f'⛳️ Missing Values: {df_us.isna().sum().sum()}')

# Display a random sample of three rows from the 'df_us' DataFrame
df_us.sample(3)

⛳️ Size of this dataframe: (46037, 3)
⛳️ Missing Values: 0


Unnamed: 0,date,city_name,pm2_5
21476,2015-01-14,Houston,11.3
26321,2018-11-28,Los Angeles,7.8
43002,2014-09-01,Tampa,11.8


### <span style="color:#ff5f27;">🏢 Processing special city - `Seattle`</span>
#### We need different stations across the Seattle. 
I downloaded daily `PM2.5` data manually [here](https://www.epa.gov/outdoor-air-quality-data/download-daily-data)

In [15]:
target_cities["Seattle"]

{'Bellevue-SE 12th St': [47.60086, -122.1484],
 'DARRINGTON - FIR ST (Darrington High School)': [48.2469, -121.6031],
 'KENT - JAMES & CENTRAL': [47.38611, -122.23028],
 'LAKE FOREST PARK TOWNE CENTER': [47.755, -122.2806],
 'MARYSVILLE - 7TH AVE (Marysville Junior High)': [48.05432, -122.17153],
 'NORTH BEND - NORTH BEND WAY': [47.49022, -121.77278],
 'SEATTLE - BEACON HILL': [47.56824, -122.30863],
 'SEATTLE - DUWAMISH': [47.55975, -122.33827],
 'SEATTLE - SOUTH PARK #2': [47.53091, -122.3208],
 'Seattle-10th & Weller': [47.59722, -122.31972],
 'TACOMA - ALEXANDER AVE': [47.2656, -122.3858],
 'TACOMA - L STREET': [47.1864, -122.4517],
 'Tacoma-S 36th St': [47.22634, -122.46256],
 'Tukwila Allentown': [47.49854, -122.27839],
 'Tulalip-Totem Beach Rd': [48.06534, -122.28519]}

In [17]:
# Read the CSV file from the specified URL into a pandas DataFrame
df_seattle = pd.read_csv("https://repo.hops.works/dev/davit/air_quality/backfill_pm2_5_seattle.csv")

# Print the size of the 'df_seattle' DataFrame (number of rows and columns)
print("⛳️ Size of this dataframe:", df_seattle.shape)

# Check for missing values in the 'df_seattle' DataFrame
print(f'⛳️ Missing Values: {df_seattle.isna().sum().sum()}')

# Display a random sample of three rows
df_seattle.sample(3)

⛳️ Size of this dataframe: (46479, 3)
⛳️ Missing Values: 0


Unnamed: 0,city_name,date,pm2_5
8709,SEATTLE - BEACON HILL,2015-11-05,9.5
6634,DARRINGTON - FIR ST (Darrington High School),2014-06-24,1.7
45134,NORTH BEND - NORTH BEND WAY,2023-01-12,0.3


### <span style="color:#ff5f27;">🌟 All together</span>

In [19]:
# Concatenate the DataFrames df_eu, df_us, and df_seattle along the rows and reset the index
df_air_quality = pd.concat(
    [df_eu, df_us, df_seattle],
).reset_index(drop=True)

# Print the shape of the df_air_quality DataFrame
print(f'⛳️ DF shape: {df_air_quality.shape}')

# Display a random sample of five rows from the df_air_quality DataFrame
df_air_quality.sample(5)

⛳️ DF shape: (156064, 3)


Unnamed: 0,city_name,date,pm2_5
106487,Tampa,2014-06-30,9.3
12453,Gdansk,2016-04-09,9.0
101342,Salt Lake City,2020-04-29,5.8
46538,Sevilla,2017-02-12,8.0
117821,Seattle-10th & Weller,2015-12-12,5.7


## <span style="color:#ff5f27;">🛠 Feature Engineering</span>

In [20]:
# Convert the 'date' column in the df_air_quality DataFrame to datetime format
df_air_quality['date'] = pd.to_datetime(df_air_quality['date'])

In [21]:
# Apply feature engineering to the df_air_quality DataFrame using the air_quality.feature_engineer_aq() function
df_air_quality = air_quality.feature_engineer_aq(df_air_quality)

# Drop rows with missing values in the df_air_quality DataFrame
df_air_quality = df_air_quality.dropna()

# Check and print the total number of missing values in the df_air_quality DataFrame
df_air_quality.isna().sum().sum()

0

In [22]:
# Print the shape (number of rows and columns) of the df_air_quality DataFrame
df_air_quality.shape

(154533, 31)

In [23]:
# Retrieve and display the column names of the df_air_quality DataFrame
df_air_quality.columns

Index(['city_name', 'date', 'pm2_5', 'pm_2_5_previous_1_day',
       'pm_2_5_previous_2_day', 'pm_2_5_previous_3_day',
       'pm_2_5_previous_4_day', 'pm_2_5_previous_5_day',
       'pm_2_5_previous_6_day', 'pm_2_5_previous_7_day', 'mean_7_days',
       'mean_14_days', 'mean_28_days', 'std_7_days', 'exp_mean_7_days',
       'exp_std_7_days', 'std_14_days', 'exp_mean_14_days', 'exp_std_14_days',
       'std_28_days', 'exp_mean_28_days', 'exp_std_28_days', 'year',
       'day_of_month', 'month', 'day_of_week', 'is_weekend', 'sin_day_of_year',
       'cos_day_of_year', 'sin_day_of_week', 'cos_day_of_week'],
      dtype='object')

## <span style='color:#ff5f27'> 🌦 Loading Weather Data from [Open Meteo](https://open-meteo.com/en/docs)

In [27]:
# Read the CSV file from the specified URL into a pandas DataFrame for weather data
df_weather = pd.read_csv("https://repo.hops.works/dev/davit/air_quality/backfill_weather.csv")

# Display the first three rows of the df_weather DataFrame
df_weather.head(3)

Unnamed: 0,city_name,date,temperature_max,temperature_min,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,wind_speed_max,wind_gusts_max,wind_direction_dominant
0,Amsterdam,2013-01-01,9.2,5.5,10.2,10.2,0.0,14.0,32.0,62.6,255
1,Amsterdam,2013-01-02,7.8,5.6,0.5,0.5,0.0,2.0,22.9,39.6,251
2,Amsterdam,2013-01-03,10.3,8.2,2.0,2.0,0.0,6.0,22.2,39.2,255


---

In [28]:
# Apply the 'convert_date_to_unix' function to create a new 'unix_time' column in df_air_quality
df_air_quality["unix_time"] = pd.to_datetime(df_air_quality.date).apply(convert_date_to_unix)

# Apply the 'convert_date_to_unix' function to create a new 'unix_time' column in df_weather
df_weather["unix_time"] = pd.to_datetime(df_weather.date).apply(convert_date_to_unix)

# Convert the 'date' column in the df_air_quality DataFrame back to string format
df_air_quality.date = df_air_quality.date.astype(str)

# Convert the 'date' column in the df_weather DataFrame back to string format
df_weather.date = df_weather.date.astype(str)

## <span style="color:#ff5f27;"> 🔮 Connecting to Hopsworks Feature Store </span>

In [29]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://snurran.hops.works/p/5242
Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27;">🪄 Creating Feature Groups</span>

### <span style='color:#ff5f27'> 🌫 Air Quality Data

In [30]:
# Get or create feature group
air_quality_fg = fs.get_or_create_feature_group(
    name='air_quality',
    description='Air Quality characteristics of each day',
    version=1,
    primary_key=['unix_time','city_name'],
    event_time=["unix_time"],
)   
# Insert data
air_quality_fg.insert(df_air_quality)



Feature Group created successfully, explore it at 
https://snurran.hops.works/p/5242/fs/5190/fg/5194


Uploading Dataframe: 0.00% |          | Rows 0/154533 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: air_quality_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://snurran.hops.works/p/5242/jobs/named/air_quality_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7f9a243e3f70>, None)

### <span style='color:#ff5f27'> 🌦 Weather Data

In [31]:
# Get or create feature group
weather_fg = fs.get_or_create_feature_group(
    name='weather',
    description='Weather characteristics of each day',
    version=1,
    primary_key=['unix_time','city_name'],
    event_time=["unix_time"],
) 
# Insert data
weather_fg.insert(df_weather)

Feature Group created successfully, explore it at 
https://snurran.hops.works/p/5242/fs/5190/fg/5195


Uploading Dataframe: 0.00% |          | Rows 0/168975 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: weather_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://snurran.hops.works/p/5242/jobs/named/weather_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7f9a243e0190>, None)

---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 02: Feature Pipeline 
 </span> 

In the following notebook you will parse data and insert it into Feature Groups.