# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Feature Pipeline</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/air_quality/2_feature_pipeline.ipynb)


## üóíÔ∏è This notebook is divided into the following sections:
1. Parse Data
2. Feature Group Insertion

### <span style='color:#ff5f27'> üìù Imports

In [None]:
import datetime
import time
import requests
import pandas as pd
import json

from functions import *

import warnings
warnings.filterwarnings("ignore")

In [None]:
with open('target_cities.json') as json_file:
    target_cities = json.load(json_file)

In [None]:
today = datetime.date.today()

In [None]:
today, str(today)

---

## <span style='color:#ff5f27'> üå´ Filling gaps in Air Quality data (PM2.5)</span>

### First time we will determine the 'last update date' using our backfill data
#### Next time we will use `feature view` method from Hopsworks Feature Store

In [None]:
df_air_quality = pd.read_csv("data/backfill_pm2_5.csv")
df_weather = pd.read_csv("data/backfill_weather.csv")

In [None]:
last_dates_aq = df_air_quality[["date", "city_name"]].groupby("city_name").max()
last_dates_aq.date = last_dates_aq.date.astype(str)

# here is a dictionary with city names as keys and last updated date as values
last_dates_aq = last_dates_aq.to_dict()["date"]

In [None]:
last_dates_aq["Berlin"], last_dates_aq["Columbus"]

### <span style='color:#ff5f27'>  üßôüèº‚Äç‚ôÇÔ∏è Parsing PM2.5 data

In [None]:
start_of_cell = time.time()

df_aq_raw = pd.DataFrame()

for continent in target_cities:
    for city_name, coords in target_cities[continent].items():
        df_ = get_aqi_data_from_open_meteo(city_name=city_name,
                                           coordinates=coords,
                                           start_date=last_dates_aq[city_name],
                                           end_date=str(today))
        df_aq_raw = pd.concat([df_aq_raw, df_]).reset_index(drop=True)
    
end_of_cell = time.time()
print("-" * 64)
print(f"Parsed new PM2.5 data for ALL locations up to {str(today)}.")
print(f"Took {round(end_of_cell - start_of_cell, 2)} sec.\n")

In [None]:
# calculate 60 days ago from today
date_threshold = today - datetime.timedelta(days=60)

df_air_quality.date = (df_air_quality.date).astype(str)
# filter rows based on date threshold
df_air_quality = df_air_quality[df_air_quality['date'] > str(date_threshold)]

df_air_quality

In [None]:
# we need the previous data to calculate aggregation functions
df_aq_update = pd.concat([df_air_quality, df_aq_raw]).reset_index(drop=True)
df_aq_update = df_aq_update.drop_duplicates(subset=['city_name', 'date'])

In [None]:
df_aq_update.shape

### <span style="color:#ff5f27;">üõ† Feature Engineering PM2.5</span>

In [None]:
df_aq_update['date'] = pd.to_datetime(df_aq_update['date'])

In [None]:
df_aq_update = feature_engineer_aq(df_aq_update)
df_aq_update = df_aq_update.dropna()

In [None]:
df_aq_update.isna().sum().sum()

In [None]:
df_aq_update.shape

In [None]:
df_aq_update.columns

---

## <span style='color:#ff5f27'> üå¶ Filling gaps in Weather data</span>

In [None]:
last_dates_weather = df_weather[["date", "city_name"]].groupby("city_name").max()
last_dates_weather.date = last_dates_weather.date.astype(str)
last_dates_weather = last_dates_weather.to_dict()["date"]

### <span style='color:#ff5f27'>  üßôüèº‚Äç‚ôÇÔ∏è Parsing Weather data

In [None]:
start_of_cell = time.time()

df_weather_update = pd.DataFrame()

for continent in target_cities:
    for city_name, coords in target_cities[continent].items():
        df_ = get_weather_data_from_open_meteo(city_name=city_name,
                                               coordinates=coords,
                                               start_date=last_dates_weather[city_name],
                                               end_date=str(today),
                                               forecast=True)
        df_weather_update = pd.concat([df_weather_update, df_]).reset_index(drop=True)
    
end_of_cell = time.time()
print("-" * 64)
print(f"Parsed new weather data for ALL cities up to {str(today)}.")
print(f"Took {round(end_of_cell - start_of_cell, 2)} sec.\n")

In [None]:
df_aq_update.date = pd.to_datetime(df_aq_update.date)
df_weather_update.date = pd.to_datetime(df_weather_update.date)

df_aq_update["unix_time"] = df_aq_update["date"].apply(convert_date_to_unix)
df_weather_update["unix_time"] = df_weather_update["date"].apply(convert_date_to_unix)

In [None]:
df_aq_update.date = df_aq_update.date.astype(str)
df_weather_update.date = df_weather_update.date.astype(str)

---

## <span style="color:#ff5f27;">‚¨ÜÔ∏è Uploading new data to the Feature Store</span>

### <span style="color:#ff5f27;"> üîÆ Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks


project = hopsworks.login()
fs = project.get_feature_store() 

air_quality_fg = fs.get_or_create_feature_group(
    name = 'air_quality',
    version = 1
)
weather_fg = fs.get_or_create_feature_group(
    name = 'weather',
    version = 1
)

In [None]:
air_quality_fg.insert(df_aq_update, write_options={"wait_for_job": False})

In [None]:
weather_fg.insert(df_weather_update, write_options={"wait_for_job": True})