# <span style="font-width:bold; font-size: 3rem; color:#1EB182;">**Hopsworks Feature Store** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 04: Model training & UI Exploration</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/electricity/4_model_training_and_registration.ipynb)


## 🗒️ This notebook is divided into 3 main sections:
1. **Loading the training data**
2. **Train the model**
3. **Register model to Hopsworks model registry**.

![tutorial-flow](https://github.com/logicalclocks/hopsworks-tutorials/blob/master/images/03_model.png?raw=1)

### <span style="color:#ff5f27;">📝 Importing Libraries</span>

In [6]:
!pip uninstall hopsworks -y

Found existing installation: hopsworks 3.1.0.dev1
Uninstalling hopsworks-3.1.0.dev1:
  Successfully uninstalled hopsworks-3.1.0.dev1


In [7]:
!pip install -U hopsworks --quiet


[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score

import warnings
warnings.filterwarnings('ignore')

---

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [3]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/3348




Connected. Call `.close()` to terminate connection gracefully.


---

## <span style="color:#ff5f27;">🪝 Feature View and Training Dataset Retrieval</span>

In [3]:
feature_view = fs.get_feature_view(
    name = 'citibike_feature_view',
    version = 1
)

In [54]:
X_train, _ = feature_view.get_training_data(1)
X_test, _ = feature_view.get_training_data(2)

In [55]:
X_train.head(3)

Unnamed: 0,date,station_id,users_count,users_count_next_day,mean_7_days,mean_14_days,mean_56_days,std_7_days,exp_mean_7_days,exp_std_7_days,...,windgust,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex,moonphase
0,2022-01-01,3978.13,0.007463,0.093567,0.021918,0.018256,0.011468,0.017237,0.016381,0.013195,...,,0.063291,0.378763,0.430913,1.0,0.0,0.068585,0.033333,0.0,0.99
1,2022-01-01,5145.02,0.0,0.163743,0.023288,0.028398,0.023144,0.026667,0.01587,0.022766,...,,0.063291,0.378763,0.430913,1.0,0.0,0.068585,0.033333,0.0,0.99
2,2022-01-01,5335.07,0.003731,0.078947,0.073973,0.072346,0.058382,0.078278,0.050418,0.05401,...,,0.063291,0.378763,0.430913,1.0,0.0,0.068585,0.033333,0.0,0.99


In [56]:
X_train = X_train.drop(columns=["station_id", "timestamp"])
X_test = X_test.drop(columns=["station_id", "timestamp"])

In [57]:
X_train.shape, X_test.shape

((27403, 42), (13655, 42))

In [58]:
X_train = X_train.set_index("date")
X_test = X_test.set_index("date")

In [59]:
X_train = X_train.fillna(0)
X_test = X_test.fillna(0)

In [60]:
y_train, y_test = X_train.pop("users_count_next_day"), X_test.pop("users_count_next_day")

---

## <span style="color:#ff5f27;">🧬 Modeling</span>

In [61]:
X_train

Unnamed: 0_level_0,users_count,mean_7_days,mean_14_days,mean_56_days,std_7_days,exp_mean_7_days,exp_std_7_days,rate_of_change_7_days,std_14_days,exp_mean_14_days,...,windgust,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex,moonphase
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-01-01,0.007463,0.021918,0.018256,0.011468,0.017237,0.016381,0.013195,0.003882,0.026747,0.017730,...,0.000000,0.063291,0.378763,0.430913,1.000000,0.000000,0.068585,0.033333,0.00,0.99
2022-01-01,0.000000,0.023288,0.028398,0.023144,0.026667,0.015870,0.022766,0.000929,0.040107,0.021810,...,0.000000,0.063291,0.378763,0.430913,1.000000,0.000000,0.068585,0.033333,0.00,0.99
2022-01-01,0.003731,0.073973,0.072346,0.058382,0.078278,0.050418,0.054010,0.000601,0.079996,0.060020,...,0.000000,0.063291,0.378763,0.430913,1.000000,0.000000,0.068585,0.033333,0.00,0.99
2022-01-01,0.011194,0.050685,0.056795,0.070892,0.041171,0.042281,0.033153,0.010444,0.059107,0.049132,...,0.000000,0.063291,0.378763,0.430913,1.000000,0.000000,0.068585,0.033333,0.00,0.99
2022-01-01,0.011194,0.049315,0.051386,0.067556,0.050375,0.038621,0.041405,0.003444,0.053226,0.048102,...,0.000000,0.063291,0.378763,0.430913,1.000000,0.000000,0.068585,0.033333,0.00,0.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-01-20,0.014925,0.043836,0.067613,0.052335,0.041641,0.053694,0.040957,0.019631,0.089942,0.059687,...,0.601671,0.451477,0.900502,0.711944,0.672096,0.534247,0.094005,0.233333,0.25,0.53
2022-01-20,0.160448,0.250685,0.216362,0.453920,0.316037,0.243502,0.270388,0.069240,0.318177,0.256733,...,0.601671,0.451477,0.900502,0.711944,0.672096,0.534247,0.094005,0.233333,0.25,0.53
2022-01-20,0.018657,0.076712,0.062880,0.117807,0.071207,0.070506,0.065018,0.004240,0.086922,0.073096,...,0.601671,0.451477,0.900502,0.711944,0.672096,0.534247,0.094005,0.233333,0.25,0.53
2022-01-20,0.011194,0.083562,0.077755,0.112594,0.078928,0.077789,0.066680,0.015693,0.091510,0.080744,...,0.601671,0.451477,0.900502,0.711944,0.672096,0.534247,0.094005,0.233333,0.25,0.53


In [88]:
y_train

date
2022-01-01    0.093567
2022-01-01    0.163743
2022-01-01    0.078947
2022-01-01    0.175439
2022-01-01    0.289474
                ...   
2022-01-20    0.125731
2022-01-20    0.105263
2022-01-20    0.055556
2022-01-20    0.017544
2022-01-20    0.008772
Name: users_count_next_day, Length: 27403, dtype: float64

In [448]:
df_raw = feature_view.query.read()

RestAPIError: Metadata operation error: (url: https://c.app.hopsworks.ai/hopsworks-api/api/project/3348/featurestores/query). Server response: 
HTTP code: 404, HTTP reason: Not Found, body: b'{"type":"restApiJsonResponse","errorCode":270009,"errorMsg":"Featuregroup wasn\'t found.","usrMsg":"Could not find feature group with ID 3747"}', error code: 270009, error msg: Featuregroup wasn't found., user msg: Could not find feature group with ID 3747

In [272]:
df_raw = df_raw[['date', 'station_id', 'users_count', 'users_count_next_day',
       'mean_7_days', 'mean_14_days', 'mean_56_days', 'std_7_days',
       'exp_mean_7_days', 'exp_std_7_days', 'rate_of_change_7_days',
       'std_14_days', 'exp_mean_14_days', 'exp_std_14_days',
       'rate_of_change_14_days', 'std_56_days', 'exp_mean_56_days',
       'exp_std_56_days', 'rate_of_change_56_days', 'timestamp', 'holiday',]]

In [273]:
df_raw = df_raw.drop("users_count_next_day", axis=1)

In [435]:
df_raw.station_id

0        6955.05
1        4522.07
2        7323.09
3        4157.10
4        7340.07
          ...   
41827    4565.04
41828    7567.06
41829    7652.04
41830    7727.08
41831    8016.07
Name: station_id, Length: 41832, dtype: object

In [440]:
df_raw

Unnamed: 0,date,station_id,users_count,mean_7_days,mean_14_days,mean_56_days,std_7_days,exp_mean_7_days,exp_std_7_days,rate_of_change_7_days,std_14_days,exp_mean_14_days,exp_std_14_days,rate_of_change_14_days,std_56_days,exp_mean_56_days,exp_std_56_days,rate_of_change_56_days,timestamp,holiday
0,2022-01-01,6955.05,15,6.285714,8.071429,5.053571,4.498677,7.678035,5.735364,87.500000,7.151808,6.923428,5.874008,50.000000,5.306667,5.673582,5.652998,1400.000000,1640991600000,0
1,2022-01-01,4522.07,3,2.714286,3.500000,3.107143,1.976047,2.826627,2.162340,50.000000,3.006403,3.221213,2.554362,200.000000,2.371640,3.234615,2.505084,200.000000,1640991600000,0
2,2022-01-01,7323.09,13,6.571429,6.000000,7.160714,4.540820,7.189504,5.231968,333.333333,4.819831,6.724708,5.295011,-7.142857,7.717222,6.799611,6.742447,550.000000,1640991600000,0
3,2022-01-01,4157.10,2,1.857143,2.071429,3.053571,1.069045,1.981891,1.265917,100.000000,1.774360,2.222955,1.740199,100.000000,2.407712,2.688131,2.208589,100.000000,1640991600000,0
4,2022-01-01,7340.07,41,30.428571,25.142857,22.053571,24.979039,30.220189,23.033836,156.250000,19.759544,26.967963,21.591744,17.142857,21.328956,21.638513,20.146736,-49.382716,1640991600000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41827,2022-01-31,4565.04,11,9.428571,13.142857,10.375000,7.546680,9.029527,7.331266,-52.173913,10.189415,10.480827,8.421714,-52.173913,6.984561,10.601686,7.635587,450.000000,1643583600000,0
41828,2022-01-31,7567.06,14,13.428571,18.928571,20.910714,9.744351,16.646423,9.978066,16.666667,10.702008,17.823263,10.441576,-6.666667,12.164444,20.367135,13.035362,-68.181818,1643583600000,0
41829,2022-01-31,7652.04,8,10.714286,13.785714,18.089286,8.323804,11.886505,7.068570,-65.217391,8.275809,13.441627,8.157040,-11.111111,9.607340,17.450666,11.284154,-63.636364,1643583600000,0
41830,2022-01-31,7727.08,13,15.714286,17.000000,17.089286,9.673233,17.085410,9.653280,550.000000,12.076678,16.690244,10.651417,-13.333333,10.267826,17.307993,11.385533,-27.777778,1643583600000,0


In [441]:
df_raw.station_id.value_counts()

6955.05    31
7365.08    31
7009.02    31
6170.02    31
4651.02    31
           ..
8167.04     4
8539.02     3
7365.13     3
7132.08     2
8358.10     1
Name: station_id, Length: 1589, dtype: int64

In [463]:
10168 / 328

31.0

In [462]:
len(popular_stations)

328

In [460]:
popular_stations = df_raw.station_id.value_counts()[df_raw.station_id.value_counts() == 31].index
df_raw[df_raw.station_id.isin(popular_stations)]

In [461]:
df_raw[df_raw.station_id.isin(popular_stations)]

Unnamed: 0,date,station_id,users_count,mean_7_days,mean_14_days,mean_56_days,std_7_days,exp_mean_7_days,exp_std_7_days,rate_of_change_7_days,std_14_days,exp_mean_14_days,exp_std_14_days,rate_of_change_14_days,std_56_days,exp_mean_56_days,exp_std_56_days,rate_of_change_56_days,timestamp,holiday
0,2022-01-01,6955.05,15,6.285714,8.071429,5.053571,4.498677,7.678035,5.735364,87.500000,7.151808,6.923428,5.874008,50.000000,5.306667,5.673582,5.652998,1400.000000,1640991600000,0
5,2022-01-01,5696.03,17,8.714286,7.714286,7.875000,6.575568,8.271906,7.272677,41.666667,6.568322,8.145047,6.605256,-15.000000,6.132069,7.805060,6.132698,112.500000,1640991600000,0
12,2022-01-01,5282.02,2,17.142857,16.214286,15.142857,11.922368,13.142150,11.202670,-88.888889,9.022865,15.339972,10.820457,-88.888889,11.168834,15.232542,10.743395,100.000000,1640991600000,0
23,2022-01-01,6920.03,21,9.857143,6.428571,4.607143,8.706866,11.050084,9.578961,250.000000,7.002354,8.248478,8.310821,250.000000,5.062082,5.584623,6.125254,425.000000,1640991600000,0
24,2022-01-01,6206.08,10,8.285714,7.785714,6.267857,6.701102,7.827800,6.682053,-37.500000,6.530150,7.322278,6.445616,0.000000,5.086492,6.453206,5.573984,-16.666667,1640991600000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41824,2022-01-31,5886.02,27,23.571429,28.214286,28.107143,26.181600,30.138393,18.652345,107.692308,19.407289,29.902179,20.102704,42.105263,17.683766,28.531301,18.742729,12.500000,1643583600000,0
41825,2022-01-31,6925.09,54,23.857143,27.071429,27.017857,15.704109,35.732708,18.177360,10.204082,19.699529,32.073136,19.572537,170.000000,20.539533,29.260565,21.389082,74.193548,1643583600000,0
41827,2022-01-31,4565.04,11,9.428571,13.142857,10.375000,7.546680,9.029527,7.331266,-52.173913,10.189415,10.480827,8.421714,-52.173913,6.984561,10.601686,7.635587,450.000000,1643583600000,0
41828,2022-01-31,7567.06,14,13.428571,18.928571,20.910714,9.744351,16.646423,9.978066,16.666667,10.702008,17.823263,10.441576,-6.666667,12.164444,20.367135,13.035362,-68.181818,1643583600000,0


In [469]:
from calendar import monthrange
monthrange(2022, 1)

(5, 31)

df = df_raw.copy()

In [402]:
from functions import *
from datetime import datetime, timedelta

In [403]:
df = df.reset_index()

In [456]:
end_date_, end_date

('2022-01-31', '2022-02-01')

In [404]:
start_date, end_date_ = df.date.min(), df.date.max()

end_date = datetime.strptime(end_date_, "%Y-%m-%d") + timedelta(days=1)
end_date = datetime.strftime(end_date, "%Y-%m-%d")

In [405]:
df_weather_raw = get_weather_data(city="nyc",
                              start_date=start_date,
                              end_date=end_date)

In [406]:
station_ids = df.station_id.unique()

In [407]:
df_weather = df_weather_raw.loc[df_weather_raw.index.repeat(len(station_ids))]

In [408]:
df_weather["station_id"] = [0] * df_weather.shape[0]

In [409]:
for i in range(df_weather_raw.shape[0]):
    df_weather.loc[i, "station_id"] = station_ids

In [410]:
df = df.set_index(["date", "station_id"])
df_weather = df_weather.set_index(["date", "station_id"])

In [411]:
df_joined = df.join(df_weather, how="right")

In [412]:
df_joined = df_joined.reset_index()

In [413]:
df_joined = df_joined.fillna(-1)

In [414]:
df_joined = df_joined.sort_values(by=["date", "station_id"])

In [415]:
df_joined = df_joined.drop("index", axis=1)

In [416]:
cols_to_duplicate = ['mean_7_days', 'mean_14_days',
                     'mean_56_days', 'std_7_days', 'exp_mean_7_days', 'exp_std_7_days',
                     'rate_of_change_7_days', 'std_14_days', 'exp_mean_14_days',
                     'exp_std_14_days', 'rate_of_change_14_days', 'std_56_days',
                     'exp_mean_56_days', 'exp_std_56_days', 'rate_of_change_56_days']

In [418]:
df_joined

Unnamed: 0,date,station_id,users_count,mean_7_days,mean_14_days,mean_56_days,std_7_days,exp_mean_7_days,exp_std_7_days,rate_of_change_7_days,...,windgust,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex,moonphase
1471,2022-01-01,2733.03,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,...,0.0,13.1,136.1,1008.3,100.0,8.7,33.1,1.4,1,0.99
412,2022-01-01,2782.02,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,...,0.0,13.1,136.1,1008.3,100.0,8.7,33.1,1.4,1,0.99
743,2022-01-01,2832.03,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,...,0.0,13.1,136.1,1008.3,100.0,8.7,33.1,1.4,1,0.99
402,2022-01-01,2872.02,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,...,0.0,13.1,136.1,1008.3,100.0,8.7,33.1,1.4,1,0.99
311,2022-01-01,2883.03,3.0,1.428571,1.357143,2.285714,0.786796,1.649831,1.018762,200.0,...,0.0,13.1,136.1,1008.3,100.0,8.7,33.1,1.4,1,0.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50317,2022-02-01,8795.01,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,...,33.5,18.1,41.6,1033.7,31.4,16.0,139.1,12.1,6,0.00
49785,2022-02-01,8795.03,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,...,33.5,18.1,41.6,1033.7,31.4,16.0,139.1,12.1,6,0.00
49323,2022-02-01,8799.01,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,...,33.5,18.1,41.6,1033.7,31.4,16.0,139.1,12.1,6,0.00
50463,2022-02-01,8811.01,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,...,33.5,18.1,41.6,1033.7,31.4,16.0,139.1,12.1,6,0.00


In [419]:
previous_date = datetime.strptime(end_date, "%Y-%m-%d") - timedelta(days=1)
previous_date = datetime.strftime(previous_date, "%Y-%m-%d")

In [420]:
previous_date

'2022-01-31'

In [421]:
end_date

'2022-02-01'

In [422]:
df_joined[df_joined.date == previous_date].shape

(1589, 43)

In [423]:
df_joined[df_joined.date == end_date][cols_to_duplicate] = df_joined[df_joined.date == previous_date][cols_to_duplicate]

In [424]:
df_joined = df_joined.dropna(0.).reset_index(drop=True)

In [425]:
df_joined.shape

(50848, 43)

In [389]:
X = df_joined.drop(columns=["station_id", 'timestamp'])

In [390]:
X = X.set_index("date")

In [457]:
X

Unnamed: 0_level_0,mean_7_days,mean_14_days,mean_56_days,std_7_days,exp_mean_7_days,exp_std_7_days,rate_of_change_7_days,std_14_days,exp_mean_14_days,exp_std_14_days,...,windspeed,winddir,sealevelpressure,cloudcover,visibility,solarradiation,solarenergy,uvindex,moonphase,prev_users_count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-01-02,1.571429,1.357143,3.196429,0.786796,1.586753,0.803304,-33.333333,0.633324,1.759843,1.458078,...,22.9,308.9,1004.7,91.5,11.8,53.5,2.2,2,1.0,0.0
2022-01-02,1.428571,1.428571,3.178571,0.534522,1.690064,0.722031,100.000000,0.646206,1.791864,1.360038,...,22.9,308.9,1004.7,91.5,11.8,53.5,2.2,2,1.0,0.0
2022-01-02,1.428571,1.428571,3.017857,0.534522,1.517548,0.703678,0.000000,0.646206,1.686282,1.296576,...,22.9,308.9,1004.7,91.5,11.8,53.5,2.2,2,1.0,0.0
2022-01-02,1.571429,1.428571,3.000000,0.534522,1.638161,0.649837,100.000000,0.646206,1.728111,1.212109,...,22.9,308.9,1004.7,91.5,11.8,53.5,2.2,2,1.0,0.0
2022-01-02,3.142857,2.071429,1.803571,2.968084,2.733510,3.015849,0.000000,2.302650,2.500863,2.590257,...,22.9,308.9,1004.7,91.5,11.8,53.5,2.2,2,1.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-02-01,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,18.1,41.6,1033.7,31.4,16.0,139.1,12.1,6,0.0,0.0
2022-02-01,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,18.1,41.6,1033.7,31.4,16.0,139.1,12.1,6,0.0,0.0
2022-02-01,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,18.1,41.6,1033.7,31.4,16.0,139.1,12.1,6,0.0,0.0
2022-02-01,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,18.1,41.6,1033.7,31.4,16.0,139.1,12.1,6,0.0,0.0


In [391]:
import numpy as np
from sklearn.model_selection import train_test_split

In [392]:
y = X.pop("users_count")

In [393]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

In [394]:
from scipy.stats import pearsonr
from pprint import pprint

feature_target_corr = list()
for col in X_train.columns:
    feature_target_corr.append([col, abs(pearsonr(X_train[col], y_train)[0])])
        
pprint("Feature-Target Correlations")
pprint(sorted(feature_target_corr, key=lambda x: x[1], reverse=True))

'Feature-Target Correlations'
[['exp_mean_7_days', 0.7471939602425101],
 ['exp_mean_14_days', 0.679958328614775],
 ['mean_7_days', 0.6573064922838624],
 ['mean_14_days', 0.637096492477201],
 ['exp_std_7_days', 0.6268562777526233],
 ['exp_std_14_days', 0.6073197442537865],
 ['exp_mean_56_days', 0.6059751087966002],
 ['std_7_days', 0.5944700907472203],
 ['mean_56_days', 0.5887600322646905],
 ['std_14_days', 0.5871950824732397],
 ['prev_users_count', 0.5783492347014236],
 ['exp_std_56_days', 0.5663302267463501],
 ['std_56_days', 0.5613342807769244],
 ['rate_of_change_56_days', 0.41532455885769826],
 ['rate_of_change_7_days', 0.4092688835804321],
 ['rate_of_change_14_days', 0.4072993984383524],
 ['snowdepth', 0.1249591279400063],
 ['feelslike', 0.09260346587406912],
 ['feelslikemin', 0.0895750291978298],
 ['precipcover', 0.08465069513588683],
 ['temp', 0.08246706319972597],
 ['tempmin', 0.07952466594982291],
 ['feelslikemax', 0.0787832371804175],
 ['visibility', 0.07033674306447546],
 ['pr

In [395]:
X_train = X_train.drop(columns=['moonphase', 'cloudcover', 'sealevelpressure', 'solarradiation',
                                'winddir', 'uvindex', 'solarenergy'])

X_test = X_test.drop(columns=['moonphase', 'cloudcover', 'sealevelpressure', 'solarradiation',
                              'winddir', 'uvindex', 'solarenergy'])

In [396]:
X_train

Unnamed: 0_level_0,mean_7_days,mean_14_days,mean_56_days,std_7_days,exp_mean_7_days,exp_std_7_days,rate_of_change_7_days,std_14_days,exp_mean_14_days,exp_std_14_days,...,humidity,precip,precipprob,precipcover,snow,snowdepth,windgust,windspeed,visibility,prev_users_count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-01-10,5.428571,4.571429,8.857143,4.117327,6.247909,4.684291,233.333333,3.994502,6.289701,5.963252,...,47.0,0.000,0,0.0,0.0,0.6,49.8,33.6,15.9,23.0
2022-01-10,12.714286,11.928571,9.125000,9.961832,14.096159,10.528376,525.000000,8.713070,12.545448,9.552091,...,47.0,0.000,0,0.0,0.0,0.6,49.8,33.6,15.9,8.0
2022-01-18,4.000000,2.785714,4.375000,3.511885,3.103944,3.163213,-40.000000,2.805998,3.272898,3.133178,...,46.6,0.000,0,0.0,0.0,0.0,53.0,28.8,16.0,2.0
2022-01-15,6.571429,6.428571,7.303571,3.952094,7.905576,4.834522,225.000000,5.666559,7.197726,5.357901,...,45.3,0.000,0,0.0,0.0,0.0,41.8,22.4,16.0,14.0
2022-01-21,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,42.7,0.000,0,0.0,0.0,0.0,42.5,21.9,16.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-01-09,2.571429,3.357143,3.678571,1.812654,3.261938,2.857009,0.000000,3.894911,3.627884,3.414258,...,66.8,2.765,100,25.0,0.0,7.1,42.5,22.3,14.8,2.0
2022-01-30,5.714286,5.857143,4.446429,4.070802,6.248159,4.154539,9.090909,3.460928,5.629488,3.633455,...,45.5,0.000,0,0.0,1.8,21.0,53.6,29.7,16.0,1.0
2022-01-26,1.714286,1.857143,2.482143,1.253566,1.773689,1.111619,-25.000000,1.027105,1.880985,1.228503,...,37.9,0.000,0,0.0,0.0,0.0,40.7,24.1,16.0,5.0
2022-01-02,13.285714,10.428571,9.428571,10.827654,10.276723,9.147150,-57.692308,9.740591,10.450185,9.571550,...,83.8,2.318,100,12.5,0.0,0.0,49.5,22.9,11.8,8.0


In [397]:
y_train

date
2022-01-10    10.0
2022-01-10    25.0
2022-01-18     3.0
2022-01-15    13.0
2022-01-21     0.0
              ... 
2022-01-09     2.0
2022-01-30    12.0
2022-01-26     3.0
2022-01-02    11.0
2022-01-11     0.0
Name: users_count, Length: 39407, dtype: float64

In [398]:
# NEW DATASET, NEW IDEA!!!!
import xgboost as xg

xgb = xg.XGBRegressor()
 
# Fitting the model
xgb.fit(X_train, y_train)
 
# Predict the model
pred_xgb = xgb.predict(X_test)
 
r2_xgb = r2_score(pred_xgb, y_test.values)
print("R2 score for XGBoost model:", r2_xgb)

R2 score for XGBoost model: 0.9261811172522584


In [82]:
X_train_v2 = X_train.copy()
X_test_v2 = X_test.copy()

cols = ['solarradiation',
        'snowdepth',
        'tempmin',
        'uvindex',
        'cloudcover',
        'feelslikemin',
        'windgust',
        'precip',
        'feelslikemax',
        'precipcover',
        'feelslike', 
        'temp',
        'tempmax', 
        'dew',
        'solarenergy', 
        'holiday',
        'sealevelpressure',
        'humidity',
        'precipprob',
        'moonphase',
        'visibility',
        'winddir',
        'windspeed',
        ]


X_train_v2 = X_train_v2.drop(columns=cols)
X_test_v2 = X_test_v2.drop(columns=cols)

In [83]:
X_train_v2["_new_feature_1"] = X_train_v2["users_count"] * X_train_v2["exp_mean_7_days"] * X_train_v2["exp_mean_14_days"] * 10000
X_train_v2["_new_feature_2"] = X_train_v2["users_count"] * X_train_v2["std_7_days"] * X_train_v2["exp_std_7_days"] * 10000
X_train_v2["_new_feature_3"] = X_train_v2["users_count"] ** 2

X_test_v2["_new_feature_1"] = X_test_v2["users_count"] * X_test_v2["exp_mean_7_days"] * X_test_v2["exp_mean_14_days"] * 10000
X_test_v2["_new_feature_2"] = X_test_v2["users_count"] * X_test_v2["std_7_days"] * X_test_v2["exp_std_7_days"] * 10000
X_test_v2["_new_feature_3"] = X_test_v2["users_count"] ** 2

In [84]:
X_train_v2

Unnamed: 0_level_0,users_count,mean_7_days,mean_14_days,mean_56_days,std_7_days,exp_mean_7_days,exp_std_7_days,rate_of_change_7_days,std_14_days,exp_mean_14_days,exp_std_14_days,rate_of_change_14_days,std_56_days,exp_mean_56_days,exp_std_56_days,rate_of_change_56_days,snow,_new_feature_1,_new_feature_2,_new_feature_3
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2022-01-01,0.007463,0.021918,0.018256,0.011468,0.017237,0.016381,0.013195,0.003882,0.026747,0.017730,0.017575,0.018157,0.021896,0.013221,0.019436,0.007187,0.0,0.021674,0.016974,0.000056
2022-01-01,0.000000,0.023288,0.028398,0.023144,0.026667,0.015870,0.022766,0.000929,0.040107,0.021810,0.028996,0.006036,0.033018,0.022436,0.031044,0.003564,0.0,0.000000,0.000000,0.000000
2022-01-01,0.003731,0.073973,0.072346,0.058382,0.078278,0.050418,0.054010,0.000601,0.079996,0.060020,0.069805,0.000907,0.092783,0.056472,0.090466,0.014434,0.0,0.112914,0.157752,0.000014
2022-01-01,0.011194,0.050685,0.056795,0.070892,0.041171,0.042281,0.033153,0.010444,0.059107,0.049132,0.046289,0.002399,0.086950,0.062278,0.074802,0.009603,0.0,0.232541,0.152792,0.000125
2022-01-01,0.011194,0.049315,0.051386,0.067556,0.050375,0.038621,0.041405,0.003444,0.053226,0.048102,0.049775,0.008056,0.077023,0.061523,0.070161,0.002356,0.0,0.207954,0.233482,0.000125
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-01-20,0.014925,0.043836,0.067613,0.052335,0.041641,0.053694,0.040957,0.019631,0.089942,0.059687,0.061372,0.001352,0.076151,0.054082,0.077164,0.036175,0.0,0.478331,0.254551,0.000223
2022-01-20,0.160448,0.250685,0.216362,0.453920,0.316037,0.243502,0.270388,0.069240,0.318177,0.256733,0.337064,0.005204,0.502298,0.406609,0.522237,0.007532,0.0,100.303796,137.106480,0.025743
2022-01-20,0.018657,0.076712,0.062880,0.117807,0.071207,0.070506,0.065018,0.004240,0.086922,0.073096,0.089140,0.009066,0.153360,0.105410,0.151138,0.014434,0.0,0.961511,0.863761,0.000348
2022-01-20,0.011194,0.083562,0.077755,0.112594,0.078928,0.077789,0.066680,0.015693,0.091510,0.080744,0.085697,0.002179,0.147081,0.103122,0.142590,0.005738,0.0,0.703095,0.589136,0.000125


In [92]:
from scipy.stats import pearsonr
from pprint import pprint

feature_target_corr = list()
for col in X_train.columns:
    feature_target_corr.append([col, abs(pearsonr(X_train[col], y_train)[0])])
        
pprint("Feature-Target Correlations")
pprint(sorted(feature_target_corr, key=lambda x: x[1], reverse=True))

'Feature-Target Correlations'
[['users_count', 0.484838755546809],
 ['exp_mean_7_days', 0.33090971525663027],
 ['exp_mean_14_days', 0.2889110790436145],
 ['exp_std_7_days', 0.28750609835650315],
 ['mean_7_days', 0.2770132803618143],
 ['exp_std_14_days', 0.2705103810170293],
 ['std_7_days', 0.26939764933580657],
 ['mean_14_days', 0.26316106022079644],
 ['rate_of_change_56_days', 0.2604939940793951],
 ['rate_of_change_7_days', 0.25871175968001636],
 ['std_14_days', 0.25720450204289036],
 ['rate_of_change_14_days', 0.24758005895029708],
 ['exp_mean_56_days', 0.23400861456968924],
 ['std_56_days', 0.22857572148252858],
 ['mean_56_days', 0.2261502002454918],
 ['exp_std_56_days', 0.22169212090935506],
 ['snow', 0.09158764113593844],
 ['solarradiation', 0.07650508203675369],
 ['snowdepth', 0.06781787165734736],
 ['tempmin', 0.056931257408618756],
 ['uvindex', 0.05542335572244451],
 ['cloudcover', 0.054962644625736495],
 ['feelslikemin', 0.05292671608100678],
 ['windgust', 0.04956708453901877]

In [86]:
# Modified datafiles
import xgboost as xg

xgb = xg.XGBRegressor()
 
# Fitting the model
xgb.fit(X_train_v2, y_train)
 
# Predict the model
pred_xgb = xgb.predict(X_test_v2)
 
r2_xgb = r2_score(pred_xgb, y_test.values)
print("R2 score for XGBoost model:", r2_xgb)

R2 score for XGBoost model: -0.6847509298367318


In [15]:
# Original datafiles
import xgboost as xg

xgb = xg.XGBRegressor()
 
# Fitting the model
xgb.fit(X_train, y_train)
 
# Predict the model
pred_xgb = xgb.predict(X_test)
 
r2_xgb = r2_score(pred_xgb, y_test.values)
print("R2 score for XGBoost model:", r2_xgb)

R2 score for XGBoost model: -1.2494950375659957


In [16]:
pred_xgb

array([0.04905562, 0.0651885 , 0.08789065, ..., 0.01189198, 0.01578065,
       0.01672888], dtype=float32)

In [17]:
y_test.values

array([0.10507246, 0.03985507, 0.09782609, ..., 0.07971014, 0.00724638,
       0.04347826])

In [13]:
from sklearn.neural_network import MLPRegressor

mlp = MLPRegressor(random_state=1, max_iter=100, activation="logistic", solver="adam",
                    learning_rate_init=0.01).fit(X_train, y_train)
pred_mlp = mlp.predict(X_test)

r2_mlp = r2_score(pred_mlp, y_test.values)
print("R2 score for SVR model:", r2_mlp)

R2 score for SVR model: -0.4536569879282468


### Remember, the purpose of this tutorial is to show your Hopsworks Feature Store functionality, so the perfect model is not our goal.

---

## <span style='color:#ff5f27'>🗄 Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [None]:
mr = project.get_model_registry()

### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/machine-learning-api/latest/generated/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

In [None]:
import joblib

joblib.dump(mlp, 'model.pkl')

In [None]:
model = mr.sklearn.create_model(
    name="citibike_mlp_model",
    version=1,
    metrics={"r2_score": r2_mlp},
    description="MLPRegressor. Citibike Project.",
    input_example=X_train.sample(),
    model_schema=model_schema
)

model.save('model.pkl')