# <span style="font-width:bold; font-size: 3rem; color:#1EB182;">**Hopsworks Feature Store** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Backfill Features to the Feature Store</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/electricity/1_backfill_feature_groups.ipynb)

**Note**: you may get an error when installing hopsworks on Colab, and it is safe to ignore it.

This is the first part of the advanced series of tutorials about Hopsworks Feature Store. As part of this first module, you will work with data related to electricity prices and meteorological observations in Sweden. 

The objective of this tutorial is to demonstrate how to work with the **Hopworks Feature Store**  for batch data with a goal of training and deploying a model that can predict electricity prices in the future.

## 🗒️ This notebook is divided in 3 sections:
1. **Loading the data and feature engineeing**,
2. **Connect to the Hopsworks feature store**,
3. **Create feature groups and upload them to the feature store**.

### <span style='color:#ff5f27'> 📝 Imports

In [1]:
!pip install -U hopsworks --quiet


[notice] A new release of pip available: 22.2.2 -> 22.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
from datetime import timedelta, datetime
import pandas as pd

from pandas.tseries.holiday import USFederalHolidayCalendar

from functions import *

---

## <span style="color:#ff5f27;"> 🚲 Load the Citi Bike Trip Histories</span>

Downloadable files of Citi Bike trip data are located [here](https://s3.amazonaws.com/tripdata/index.html). The data includes:

    Ride ID
    Rideable type
    Started at
    Ended at
    Start station name
    Start station ID
    End station name
    End station ID
    Start latitude
    Start longitude
    End latitude
    End Longitude
    Member or casual ride



Let's download some data and perform preprocessing (removal of redundant columns and data grouping)

In [3]:
df = get_citibike_data("12/2021", "01/2022")

_____ Processing 12/2021... _____
Retrieving DataFrame from the existing csv file...💿
_____ Processing 01/2022... _____
Retrieving DataFrame from the existing csv file...💿


In [4]:
df

Unnamed: 0,date,station_id,users_count
0,2022-01-01,2782.02,2
1,2022-01-01,2832.03,3
2,2022-01-01,2912.08,1
3,2022-01-01,2932.01,2
4,2022-01-01,2961.05,1
...,...,...,...
150779,2021-12-31,8778.01,1
150780,2021-12-31,8782.01,2
150781,2021-12-31,8795.01,2
150782,2021-12-31,8795.03,2


In [12]:
df_enhanced = engineer_citibike_features(df)

In [14]:
df_enhanced.tail(3)

Unnamed: 0,date,station_id,users_count,mean_7_days,mean_14_days,mean_56_days,std_7_days,exp_mean_7_days,exp_std_7_days,rate_of_change_7_days,std_14_days,exp_mean_14_days,exp_std_14_days,rate_of_change_14_days,std_56_days,exp_mean_56_days,exp_std_56_days,rate_of_change_56_days
150556,2021-12-31,8795.01,2,2.0,2.0,2.196429,0.816497,1.878734,0.748158,-33.333333,1.037749,1.962195,1.009614,100.0,1.565766,2.147267,1.518377,0.0
150557,2021-12-31,8795.03,2,1.857143,2.071429,2.196429,0.690066,1.90905,0.650401,-33.333333,0.997249,1.967236,0.939994,-50.0,1.565766,2.1421,1.491751,-33.333333
150558,2021-12-31,8841.03,3,1.857143,2.0,2.196429,0.690066,2.181788,0.760011,200.0,0.877058,2.104937,0.947898,200.0,1.565766,2.172201,1.473978,0.0


In [10]:
cal = USFederalHolidayCalendar()

#generate a feature of 20 years worth of US holiday days.
start_date = datetime.strptime('2017-01-01', '%Y-%m-%d')
end_date = start_date + timedelta(days=365*10)

holiday_df = pd.DataFrame(cal.holidays(start=start_date, end=end_date), columns=['date'])
holiday_df['date'] = holiday_df['date'].dt.strftime('%Y-%m-%d')

In [None]:
holiday_df.iloc[40:45]

In [39]:
df_enhanced["holiday"] = df_enhanced["date"].apply(lambda x: 
                                                  1 if str(x) in holiday_df.date.tolist() else 0)

In [46]:
df_enhanced.head(3)

Unnamed: 0,date,station_id,users_count,mean_7_days,mean_14_days,mean_56_days,std_7_days,exp_mean_7_days,exp_std_7_days,rate_of_change_7_days,std_14_days,exp_mean_14_days,exp_std_14_days,rate_of_change_14_days,std_56_days,exp_mean_56_days,exp_std_56_days,rate_of_change_56_days,holiday
0,2022-01-01,4513.09,6,8.285714,9.071429,10.482143,4.270608,7.978453,4.519115,-50.0,5.511726,9.066067,5.38047,-45.454545,6.367506,10.388788,6.06354,-45.454545,0
1,2022-01-01,4517.03,5,7.285714,8.642857,10.375,4.070802,7.23384,4.154199,-64.285714,5.583039,8.523924,5.210274,-50.0,6.408978,10.172454,6.035566,-72.222222,0
2,2022-01-01,4519.02,12,7.0,8.785714,10.267857,3.559026,8.42538,4.232282,140.0,5.645673,8.987401,5.003103,1100.0,6.32884,10.245452,5.924248,71.428571,0


---

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks
project = hopsworks.login()
fs = project.get_feature_store()

---

## <span style="color:#ff5f27;"> 🪄 Creating Feature Groups </span>

A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features. In this case, you will create a feature group for the Meteorological measurements from SMHI, Electricity prices feature group from NORD POOL and Swedish holidays feature group.

In [None]:
citibike_stations_fg = fs.get_or_create_feature_group(
    name="citibike_stations",
    version=1,
    description="Citibike_stations across the NYC.",
    primary_key=["date", "station_id"],
    online_enabled=True
)

In [None]:
df.date = df.date.apply(str)
df.station_id = df.station_id.apply(str)
df.users_count = df.users_count.apply(int)

In [None]:
df

In [None]:
citibike_stations_fg.insert(df)

---

## <span style="color:#ff5f27;">⏭️ **Next:** Part 02 </span>

In the next notebook, you will be generating new data for the Feature Groups.