# <span style="font-width:bold; font-size: 3rem; color:#1EB182;">**Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Feature Pipeline</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/electricity/2_feature_pipeline.ipynb)

## 🗒️ This notebook is divided into 2 sections:
1. **Parse Data**,
2. **Insert new data into the Feature Store**.

### <span style='color:#ff5f27'> 📝 Imports

In [2]:
!pip install -U hopsworks --quiet


[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
from datetime import timedelta, datetime
import pandas as pd

from functions import *

import warnings

# Mute warnings
warnings.filterwarnings("ignore")

---

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [4]:
import hopsworks
project = hopsworks.login()
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/3348




Connected. Call `.close()` to terminate connection gracefully.


In [5]:
citibike_usage_fg = fs.get_or_create_feature_group(
    name="citibike_usage",
    version=1
)

In [6]:
meteorological_measurements_fg = fs.get_or_create_feature_group(
    name="meteorological_measurements",
    version=1
)

### <span style="color:#ff5f27;">📅 Getting tha last date</span>


In [7]:
last_date = citibike_usage_fg.read()["date"].max()

2022-11-18 22:53:23,988 INFO: USE `test3_featurestore`
2022-11-18 22:53:32,029 INFO: SELECT `fg0`.`date` `date`, `fg0`.`station_id` `station_id`, `fg0`.`users_count` `users_count`, `fg0`.`prev_users_count` `prev_users_count`, `fg0`.`mean_7_days` `mean_7_days`, `fg0`.`mean_14_days` `mean_14_days`, `fg0`.`std_7_days` `std_7_days`, `fg0`.`exp_mean_7_days` `exp_mean_7_days`, `fg0`.`exp_std_7_days` `exp_std_7_days`, `fg0`.`rate_of_change_7_days` `rate_of_change_7_days`, `fg0`.`std_14_days` `std_14_days`, `fg0`.`exp_mean_14_days` `exp_mean_14_days`, `fg0`.`exp_std_14_days` `exp_std_14_days`, `fg0`.`rate_of_change_14_days` `rate_of_change_14_days`, `fg0`.`timestamp` `timestamp`
FROM `test3_featurestore`.`citibike_usage_1` `fg0`


In [8]:
last_date = last_date.split("-")

In [9]:
last_date

['2022', '01', '31']

In [10]:
target_year, target_month = int(last_date[0]), int(last_date[1])

if target_month == 12:
    target_month = 1
    target_year += 1
else:
    target_month += 1

In [12]:
print(f"So, now let's download citibike data for {target_month}/{target_year}")

So, now let's download citibike data for 2/2022


---

## <span style="color:#ff5f27;"> 🪄 Parsing new data</span>

### <span style="color:#ff5f27;"> 🚲 Citibike usage info</span>

In [13]:
# get new month data
df_raw_batch = get_citibike_data(f"{target_month}/{target_year}", f"{target_month}/{target_year}")

_____ Processing 02/2022... _____
https://s3.amazonaws.com/tripdata/202202-citibike-tripdata.csv.zip
Retrieving DataFrame from the existing csv file...💿

✅ Done ✅


In [14]:
df_raw_batch

Unnamed: 0,date,station_id,users_count
0,2022-02-01,2951.05,1
1,2022-02-01,3169.07,2
2,2022-02-01,3192.05,2
3,2022-02-01,3199.01,6
4,2022-02-01,3208.07,3
...,...,...,...
42051,2022-02-28,7855.03,6
42052,2022-02-28,7879.01,4
42053,2022-02-28,7893.05,8
42054,2022-02-28,7915.11,7


In [15]:
df_enhanced_batch = engineer_citibike_features(df_raw_batch)

In [16]:
df_enhanced_batch = df_enhanced_batch.dropna()

In [17]:
df_enhanced_batch.station_id = df_enhanced_batch.station_id.astype(str)

In [18]:
df_enhanced_batch.tail(3)

Unnamed: 0,date,station_id,users_count,prev_users_count,mean_7_days,mean_14_days,std_7_days,exp_mean_7_days,exp_std_7_days,rate_of_change_7_days,std_14_days,exp_mean_14_days,exp_std_14_days,rate_of_change_14_days
40497,2022-02-28,8582.09,9,16.0,8.428571,6.928571,5.940178,7.344737,5.705006,80.0,4.937522,7.386146,5.160604,125.0
40498,2022-02-28,8616.06,3,1.0,8.142857,6.857143,6.17599,6.258553,5.342248,-75.0,4.9901,6.801327,5.047278,50.0
40499,2022-02-28,8665.09,5,4.0,7.142857,7.071429,6.011893,5.943914,4.663818,-44.444444,4.827235,6.56115,4.741534,-54.545455


### <span style="color:#ff5f27;"> 🌤 Meteorological measurements from VisualCrossing</span>

### You will parse weather data so you should get an API key from [VisualCrossing](https://www.visualcrossing.com/). You can use [this link](https://www.visualcrossing.com/weather-api).

### Don't forget to create an `.env` configuration file where all the necessary environment variables (API keys) will be stored:
![](images/api_keys_env_file.png)

In [19]:
df_enhanced_batch.date = df_enhanced_batch.date.astype(str)

start_date, end_date_ = df_enhanced_batch.date.min(), df_enhanced_batch.date.max()

# lets get weather data for the future 5 days also (our app will predict users_count for these 5 future days)
end_date = datetime.strptime(end_date_, "%Y-%m-%d") + timedelta(days=5)
end_date = datetime.strftime(end_date, "%Y-%m-%d")

In [20]:
df_weather_batch = get_weather_data(city="nyc", start_date=start_date, end_date=end_date)

In [21]:
df_weather_batch.tail(5)

Unnamed: 0,date,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,precip,precipprob,precipcover,snow,snowdepth,windspeed,visibility
28,2022-03-01,7.9,0.6,4.4,6.7,-2.8,2.0,-3.4,58.2,0.0,0,0.0,0.0,0.0,20.2,15.9
29,2022-03-02,12.1,5.0,7.9,12.1,3.1,6.8,-1.2,55.0,0.0,0,0.0,0.0,0.0,16.3,16.0
30,2022-03-03,7.1,-3.2,4.6,5.5,-9.6,2.2,-6.8,46.5,0.501,100,12.5,0.0,0.0,22.5,16.0
31,2022-03-04,4.2,-5.1,-0.4,0.9,-10.5,-4.1,-12.0,42.2,0.0,0,0.0,0.0,0.0,16.5,15.9
32,2022-03-05,7.8,0.7,4.1,7.0,-2.3,2.3,-5.2,51.7,0.0,0,0.0,0.0,0.0,16.4,16.0


In [22]:
# lets fix datatypes
for column in ["snowdepth", "snow"]:
    df_weather_batch[column] = df_weather_batch[column].astype("double")

In [23]:
# unix columns creation

df_enhanced_batch["timestamp"] = df_enhanced_batch["date"].apply(convert_date_to_unix)
df_weather_batch["timestamp"] = df_weather_batch["date"].apply(convert_date_to_unix)

---

## <span style="color:#ff5f27;">⬆️ Uploading new data to the Feature Store</span>

In [24]:
citibike_usage_fg.insert(df_enhanced_batch, write_options={"wait_for_job": False})

Uploading Dataframe: 0.00% |          | Rows 0/41056 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/3348/jobs/named/citibike_usage_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x2b1de244f10>, None)

In [25]:
meteorological_measurements_fg.insert(df_weather_batch, write_options={"wait_for_job": False})

Uploading Dataframe: 0.00% |          | Rows 0/33 | Elapsed Time: 00:00 | Remaining Time: ?

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/3348/jobs/named/meteorological_measurements_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x2b1de113b50>, None)

---

## <span style="color:#ff5f27;">⏭️ **Next:** Part 03 </span>

In the next notebook, you will create a feature view and training dataset.