# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span>

<span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Backfill Features to the Feature Store</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/air_quality/1_backfill_feature_groups.ipynb)


## 🗒️ This notebook is divided into the following sections:
1. Fetch historical data
2. Connect to the Hopsworks feature store
3. Create feature groups and insert them to the feature store

![tutorial-flow](../../images/01_featuregroups.png)

### <span style='color:#ff5f27'> 📝 Imports

In [None]:
!pip install geopy folium streamlit-folium --q

In [None]:
import datetime
import time
import requests
import json

import pandas as pd
import folium

from functions import *

import warnings
warnings.filterwarnings("ignore")

---

## <span style='color:#ff5f27'> 🌍 Representing the Target cities </span>

In [None]:
with open('target_cities.json') as json_file:
    target_cities = json.load(json_file)

In [None]:
# Create a folium map centered on the first location in the list
my_map = folium.Map(location=[42.57, -44.092], zoom_start=3)

for continent in target_cities:
        for city_name, coords in target_cities[continent].items():
            folium.CircleMarker(
                location=coords,
                popup=city_name
            ).add_to(my_map)

my_map

In [None]:
# # Save the map to an HTML file
# my_map.save("map_all_target_cities.html")

## <span style='color:#ff5f27'> 🌫 Processing Air Quality data</span>

### [🇪🇺 EEA](https://discomap.eea.europa.eu/map/fme/AirQualityExport.htm)
#### EEA means European Environmental Agency

In [None]:
target_cities["EU"]

In [None]:
df_eu = pd.read_csv("data/backfill_pm2_5_eu.csv")

In [None]:
df_eu.isna().sum().sum()

In [None]:
print("Size of this dataframe:", df_eu.shape)

df_eu.sample(3)

### [🇺🇸 USEPA](https://aqs.epa.gov/aqsweb/documents/data_api.html#daily)
#### USEPA means United States Environmental Protection Agency
[Manual downloading](https://www.epa.gov/outdoor-air-quality-data/download-daily-data)



In [None]:
target_cities["US"]

In [None]:
df_us = pd.read_csv("data/backfill_pm2_5_us.csv")

In [None]:
df_us.isna().sum().sum()

In [None]:
print("Size of this dataframe:", df_us.shape)

df_us.sample(3)

### <span style="color:#ff5f27;">🏢 Processing special city - `Seattle`</span>
#### We need different stations across the Seattle. 
I downloaded daily `PM2.5` data manually [here](https://www.epa.gov/outdoor-air-quality-data/download-daily-data)

In [None]:
target_cities["Seattle"]

In [None]:
df_seattle = pd.read_csv("data/backfill_pm2_5_seattle.csv")

In [None]:
df_seattle.isna().sum().sum()

In [None]:
print("Size of this dataframe:", df_seattle.shape)

df_seattle.sample(3)

In [None]:
df_seattle.city_name.value_counts()

### <span style="color:#ff5f27;">🌟 All together</span>

In [None]:
df_air_quality = pd.concat([df_eu, df_us, df_seattle]).reset_index(drop=True)

In [None]:
df_air_quality.sample(5)

In [None]:
df_air_quality.shape

### <span style="color:#ff5f27;">🛠 Feature Engineering</span>

In [None]:
df_air_quality.head(2)

In [None]:
df_air_quality['date'] = pd.to_datetime(df_air_quality['date'])

In [None]:
df_air_quality = feature_engineer_aq(df_air_quality)
df_air_quality = df_air_quality.dropna()

In [None]:
df_air_quality.isna().sum().sum()

In [None]:
df_air_quality.shape

In [None]:
df_air_quality.columns

---

## <span style='color:#ff5f27'> 🌦 Loading Weather Data from [Open Meteo](https://open-meteo.com/en/docs)

In [None]:
df_weather = pd.read_csv("data/backfill_weather.csv")

In [None]:
df_weather.city_name.value_counts()

In [None]:
df_weather.sample(3)

---

In [None]:
df_air_quality.date = pd.to_datetime(df_air_quality.date)
df_weather.date = pd.to_datetime(df_weather.date)

df_air_quality["unix_time"] = df_air_quality["date"].apply(convert_date_to_unix)
df_weather["unix_time"] = df_weather["date"].apply(convert_date_to_unix)

In [None]:
df_air_quality.date = df_air_quality.date.astype(str)
df_weather.date = df_weather.date.astype(str)

---

### <span style="color:#ff5f27;"> 🔮 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

## <span style="color:#ff5f27;">🪄 Creating Feature Groups</span>

### <span style='color:#ff5f27'> 🌫 Air Quality Data

In [None]:
air_quality_fg = fs.get_or_create_feature_group(
    name='air_quality',
    description='Air Quality characteristics of each day',
    version=1,
    primary_key=['unix_time','city_name'],
    online_enabled=False,
    # partition_key=["city_name"],
    event_time=["unix_time"]
)    

In [None]:
air_quality_fg.insert(df_air_quality, write_options={"wait_for_job": False})

### <span style='color:#ff5f27'> 🌦 Weather Data

In [None]:
weather_fg = fs.get_or_create_feature_group(
    name='weather',
    description='Weather characteristics of each day',
    version=1,
    primary_key=['unix_time','city_name'],
    online_enabled=False,
    # partition_key=["city_name"],
    event_time=["unix_time"]
) 

In [None]:
weather_fg.insert(df_weather, write_options={"wait_for_job": True})