# <span style="font-width:bold; font-size: 3rem; color:#1EB182;">**Hopsworks Feature Store** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Backfill Features to the Feature Store</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/citibike/1_backfill_feature_groups.ipynb)

**Note**: you may get an error when installing hopsworks on Colab, and it is safe to ignore it.

This is the first part of the advanced series of tutorials about Hopsworks Feature Store. As part of this first module, you will work with data related to citibikes usage and meteorological observations in the NYC. 

The objective of this tutorial is to demonstrate how to work with the **Hopworks Feature Store**  for batch data with a goal of training and deploying a model that can predict citibikes usage per station in the future.

## 🗒️ This notebook is divided in 3 sections:
1. **Loading the data and feature engineeing**.
2. **Connect to the Hopsworks feature store**.
3. **Create feature groups and upload them to the feature store**.

### <span style='color:#ff5f27'> 📝 Imports

In [None]:
!pip install -U hopsworks --quiet

In [None]:
from datetime import timedelta, datetime
import pandas as pd

from pandas.tseries.holiday import USFederalHolidayCalendar

from functions import *

import warnings

# Mute warnings
warnings.filterwarnings("ignore")

---

## <span style="color:#ff5f27;"> 💽 Load the historical data and 🛠️ Perform Feature Engineering</span>

The data you will use comes from three different sources:

- Citi Bike [Trip Histories](https://s3.amazonaws.com/tripdata/index.html);
- Getting US National Holidays from `USFederalHolidayCalendar` (`pandas.tseries.holiday` package);
- Different meteorological observations from [VisualCrossing](https://www.visualcrossing.com/).

### <span style="color:#ff5f27;"> 🚲 Citibike usage info</span>

Downloadable files of Citi Bike trip data are located [here](https://s3.amazonaws.com/tripdata/index.html). Originally data includes:

    Ride ID
    Rideable type
    Started at
    Ended at
    Start station name
    Start station ID
    End station name
    End station ID
    Start latitude
    Start longitude
    End latitude
    End Longitude
    Member or casual ride



Let's download some data [from here](https://s3.amazonaws.com/tripdata/index.html) and perform preprocessing (removal of redundant columns and data grouping)

In [None]:
# get data for x months
df_raw = get_citibike_data("01/2022", "04/2022")

In [None]:
df_raw

In [None]:
df_raw.station_id = df_raw.station_id.astype(str)

In [None]:
df_enhanced = engineer_citibike_features(df_raw)
df_enhanced = df_enhanced.dropna()
df_enhanced.station_id = df_enhanced.station_id.astype(str)

In [None]:
df_enhanced

In [None]:
random_station_id = df_enhanced.station_id.sample(1).values[0]

df_enhanced[df_enhanced.station_id == random_station_id]

In [None]:
df_enhanced.info()

### <span style="color:#ff5f27;">📒 Citibike stations info</span>

In [None]:
df_stations_info = pd.read_csv("data/stations_info.csv")

In [None]:
df_stations_info[df_stations_info["station_id"] == '7976.08']

In [None]:
df_stations_info = df_stations_info.drop_duplicates(subset=["station_id"]) 
df_stations_info = df_stations_info.reset_index(drop=True)

In [None]:
df_stations_info.head(3)

In [None]:
import plotly.express as px


fig = px.scatter_mapbox(df_stations_info, 
                        lat="lat", 
                        lon="long",
                        zoom=9.5,
                        hover_name="station_name",
                        height=400,
                        width=600)

fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### <span style="color:#ff5f27;"> 📅 US holidays</span>

In [None]:
cal = USFederalHolidayCalendar()

#generate a feature of 20 years worth of US holiday days.
start_date_for_cal = datetime.strptime('2017-01-01', '%Y-%m-%d')
end_date_for_cal = start_date_for_cal + timedelta(days=365*10)

holidays = pd.DataFrame(cal.holidays(start=start_date_for_cal, end=end_date_for_cal),
                        columns=['date'])
holidays['date'] = holidays['date'].dt.strftime('%Y-%m-%d')
holidays['holiday'] = 1

In [None]:
df_holidays = pd.DataFrame(pd.date_range(start_date_for_cal, end_date_for_cal),
                           columns=["date"])
df_holidays['date'] = df_holidays['date'].dt.strftime('%Y-%m-%d')

df_holidays

In [None]:
df_holidays = df_holidays.set_index("date").join(holidays.set_index("date"), how="left").fillna(0)

In [None]:
df_holidays['holiday'] = df_holidays['holiday'].astype(int)

In [None]:
df_holidays = df_holidays.reset_index(drop=False)

In [None]:
df_holidays.head(3)

In [None]:
df_holidays.tail(3)

### <span style="color:#ff5f27;"> 🌤 Meteorological measurements from VisualCrossing</span>

### You will parse weather data so you should get an API key from [VisualCrossing](https://www.visualcrossing.com/). You can use [this link](https://www.visualcrossing.com/account).

### Don't forget to create an `.env` configuration file where all the necessary environment variables (API keys) will be stored:
![](images/api_keys_env_file.png)

In [None]:
df_enhanced.date = df_enhanced.date.astype(str)

start_date, end_date = df_enhanced.date.min(), df_enhanced.date.max()

In [None]:
df_weather = get_weather_data(city="nyc",
                              start_date=str(start_date).split()[0],
                              end_date=str(end_date).split()[0])

In [None]:
df_weather.tail(5)

In [None]:
# unix columns creation

df_enhanced["timestamp"] = df_enhanced["date"].apply(convert_date_to_unix)
df_holidays["timestamp"] = df_holidays ["date"].apply(convert_date_to_unix)
df_weather["timestamp"] = df_weather["date"].apply(convert_date_to_unix)

---

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks
project = hopsworks.login()
fs = project.get_feature_store()

---

## <span style="color:#ff5f27;"> 🪄 Creating Feature Groups </span>

A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features. In this case, you will create next feature groups: CitiBike usage per station, Stations information, Meteorological measurements in NYC and US holidays.

In [None]:
citibike_usage_fg = fs.get_or_create_feature_group(
    name="citibike_usage",
    version=1,
    description="Citibike stations usage data.",
    primary_key=["date", "station_id"],
    event_time="timestamp"
)

In [None]:
citibike_usage_fg.insert(df_enhanced, write_options={"wait_for_job": False})

In [None]:
df = citibike_stations_info_fg.read()

In [None]:
df[df["station_id"] == '7976.08']

In [None]:
citibike_stations_info_fg = fs.get_or_create_feature_group(
    name="citibike_stations_info",
    version=1,
    description="Citibike stations information.",
    primary_key=["station_id"],
    online_enabled=True
)

In [None]:
citibike_stations_info_fg.insert(df_stations_info, write_options={"wait_for_job": False})

In [None]:
us_holidays_fg = fs.get_or_create_feature_group(
    name="us_holidays",
    version=1,
    description="US holidays calendar.",
    primary_key=["date"],
    event_time="timestamp"
)

In [None]:
us_holidays_fg.insert(df_holidays, write_options={"wait_for_job": False})

In [None]:
meteorological_measurements_fg = fs.get_or_create_feature_group(
    name="meteorological_measurements",
    version=1,
    description="Meteorological measurements for NYC.",
    primary_key=["date"],
    event_time="timestamp"
)

In [None]:
meteorological_measurements_fg.insert(df_weather, write_options={"wait_for_job": False})

## <span style="color:#ff5f27;">⏭️ **Next:** Part 02 </span>

In the next notebook, you will be parsing new monthly data for the Feature Groups.