<span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Feature Backfill for Bikes prediciton</span>


### <span style='color:#ff5f27'> 📝 Imports

In [1]:
import datetime
import requests
import pandas as pd
import hopsworks
import datetime
from pathlib import Path
from functions import util
import json
import re
import os
import warnings
import holidays
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm

to inform you about an upcoming change in our API versioning strategy that may affect your
project's dependencies. Starting from version 1.0 onwards, we will be following a loose form of
Semantic Versioning (SemVer, https://semver.org) to provide clearer communication regarding any
potential breaking changes.

This means that while we strive to maintain backward compatibility, there might be occasional
updates that introduce breaking changes to our API. To ensure the stability of your projects,
we highly recommend pinning the version of our API that you rely on. You can pin your current
holidays v0.x dependency (e.g., holidays==0.63) or limit it (e.g., holidays<1.0) in order to
avoid potentially unwanted upgrade to the version 1.0 when it's released (ETA 2025Q1-Q2).

If you have any questions or concerns regarding this change, please don't hesitate to reach out
to us via https://github.com/vacanza/holidays/discussions/1800.



---

In [2]:
csv_file="../../data/bikes_oct.csv"
util.check_file_path(csv_file)

File successfully found at the path: ../../data/bikes_oct.csv


In [3]:
station_id =42

bikes_url = 'https://opendata-ajuntament.barcelona.cat/data/dataset/estat-estacions-bicing/resource/1b215493-9e63-4a12-8980-2d7e0fa19f85/download/recurs.json'

# Station 42 latitude and longitude
latitude = "41.404511"
longitude = "2.189881"
city = "Barcelona"

today = datetime.date.today()

In [4]:
bicing_api_key_file = '../../data/bicing-api-key.txt'
util.check_file_path(bicing_api_key_file)

with open(bicing_api_key_file, 'r') as file:
    BICING_API_KEY = file.read().rstrip()

File successfully found at the path: ../../data/bicing-api-key.txt


## Hopsworks API Key
You need to have registered an account on app.hopsworks.ai.
You will be prompted to enter your API key here, unless you set it as the environment variable HOPSWORKS_API_KEY (my preffered approach).

In [5]:
with open('../../data/hopsworks-api-key.txt', 'r') as file:
    os.environ["HOPSWORKS_API_KEY"] = file.read().rstrip()
    
project = hopsworks.login(project="juls_first_project")

2025-01-07 09:59:33,128 INFO: Initializing external client
2025-01-07 09:59:33,129 INFO: Base URL: https://c.app.hopsworks.ai:443
2025-01-07 09:59:34,372 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1164440


In [6]:
secrets = hopsworks.get_secrets_api()
try:
    secrets.create_secret("BICING_API_KEY", BICING_API_KEY)
except hopsworks.RestAPIError:
    BICING_API_KEY = secrets.get_secret("BICING_API_KEY").value

In [7]:
try:
    bikes_now_df = util.fetch_station_data(bikes_url, BICING_API_KEY, station_id)
except hopsworks.RestAPIError:
    print("It looks like the BICING_API_KEY doesn't work. Is the API key correct? Is the URL correct?")
bikes_now_df.head()

Unnamed: 0,station_id,num_bikes_available,last_reported
0,42,15,2025-01-07 08:58:33+00:00


## <span style='color:#ff5f27'> Read your CSV file into a DataFrame </span>

The cell below will read up historical bikes availability data as a CSV file into a Pandas DataFrame

In [8]:
#get only the station_id=42 from the csv file
df = pd.read_csv(csv_file,  parse_dates=['last_updated'] , skipinitialspace=True)
df = df[df['station_id'] == station_id]
df.head()

Unnamed: 0,station_id,num_bikes_available,num_bikes_available_types.mechanical,num_bikes_available_types.ebike,num_docks_available,last_reported,is_charging_station,status,is_installed,is_renting,is_returning,traffic,last_updated,ttl,V1
39,42.0,1.0,0.0,1.0,21.0,1727733000.0,True,IN_SERVICE,1.0,1.0,1.0,,1727733600,0.0,
552,42.0,0.0,0.0,0.0,22.0,1727734000.0,True,IN_SERVICE,1.0,1.0,1.0,,1727733901,0.0,
1064,42.0,0.0,0.0,0.0,22.0,1727734000.0,True,IN_SERVICE,1.0,1.0,1.0,,1727734198,0.0,
1576,42.0,0.0,0.0,0.0,22.0,1727734000.0,True,IN_SERVICE,1.0,1.0,1.0,,1727734501,0.0,
2088,42.0,0.0,0.0,0.0,22.0,1727735000.0,True,IN_SERVICE,1.0,1.0,1.0,,1727734804,0.0,


## <span style='color:#ff5f27'> Data cleaning</span>

## Check the data types for the columns in the DataFrame

In [None]:
station_df = df[['last_updated', 'num_bikes_available']]
station_df['last_updated'] = pd.to_datetime(
    station_df['last_updated'], unit='s', utc=True, errors='coerce'
)

# Crear columna 'date' con la fecha
station_df['day'] = station_df['last_updated'].dt.strftime('%Y-%m-%d')
# Crear columna 'time' con la hora
station_df['time'] = station_df['last_updated'].dt.strftime('%H')
station_df = station_df.rename(columns={"last_updated": "date"})
hourly_avg_df = station_df.groupby(['day', 'time']).mean(["num_bikes_available"]).reset_index()

## Add a column date with the date based on the columns day and time, type datetime
hourly_avg_df['date'] = hourly_avg_df['day'] + ' ' + hourly_avg_df['time'] + ':00:00'
hourly_avg_df['date'] = pd.to_datetime(hourly_avg_df['date'], format='%Y-%m-%d %H:%M:%S')

## Adding a new boolean column if the date is weekend or not
hourly_avg_df['is_weekend'] = hourly_avg_df['date'].dt.dayofweek > 4
## Adding a new boolean column if the date is holiday or not
holidays_es = holidays.Spain()
hourly_avg_df['is_holiday'] = hourly_avg_df['date'].dt.date.astype(str).map(lambda x: x in holidays_es)

#convert the time to int
hourly_avg_df['time'] = hourly_avg_df['time'].astype(int)
# Get the column with date 2024
# add a column with the previous num available bikes
hourly_avg_df['prev_num_bikes_available'] = hourly_avg_df['num_bikes_available'].shift(1)
hourly_avg_df

Unnamed: 0,day,time,num_bikes_available,date,is_weekend,is_holiday,prev_num_bikes_available
0,2024-09-30,22,0.250000,2024-09-30 22:00:00,False,False,
1,2024-09-30,23,0.333333,2024-09-30 23:00:00,False,False,0.250000
2,2024-10-01,0,1.230769,2024-10-01 00:00:00,False,False,0.333333
3,2024-10-01,1,1.333333,2024-10-01 01:00:00,False,False,1.230769
4,2024-10-01,2,1.166667,2024-10-01 02:00:00,False,False,1.333333
...,...,...,...,...,...,...,...
718,2024-10-31,19,19.818182,2024-10-31 19:00:00,False,False,18.750000
719,2024-10-31,20,18.583333,2024-10-31 20:00:00,False,False,19.818182
720,2024-10-31,21,15.833333,2024-10-31 21:00:00,False,False,18.583333
721,2024-10-31,22,18.166667,2024-10-31 22:00:00,False,False,15.833333


In [None]:
hourly_avg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 723 entries, 0 to 722
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   day                       723 non-null    object        
 1   time                      723 non-null    int64         
 2   num_bikes_available       723 non-null    float64       
 3   date                      723 non-null    datetime64[ns]
 4   is_weekend                723 non-null    bool          
 5   is_holiday                723 non-null    bool          
 6   prev_num_bikes_available  722 non-null    float64       
dtypes: bool(2), datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 29.8+ KB


## <span style='color:#ff5f27'> Drop any rows with missing data </span>
It will make the model training easier if there is no missing data in the rows, so we drop any rows with missing data.

In [11]:
hourly_avg_df.dropna(inplace=True)
hourly_avg_df

Unnamed: 0,day,time,num_bikes_available,date,is_weekend,is_holiday,prev_num_bikes_available
1,2024-09-30,23,0.333333,2024-09-30 23:00:00,False,False,0.250000
2,2024-10-01,0,1.230769,2024-10-01 00:00:00,False,False,0.333333
3,2024-10-01,1,1.333333,2024-10-01 01:00:00,False,False,1.230769
4,2024-10-01,2,1.166667,2024-10-01 02:00:00,False,False,1.333333
5,2024-10-01,3,4.181818,2024-10-01 03:00:00,False,False,1.166667
...,...,...,...,...,...,...,...
718,2024-10-31,19,19.818182,2024-10-31 19:00:00,False,False,18.750000
719,2024-10-31,20,18.583333,2024-10-31 20:00:00,False,False,19.818182
720,2024-10-31,21,15.833333,2024-10-31 21:00:00,False,False,18.583333
721,2024-10-31,22,18.166667,2024-10-31 22:00:00,False,False,15.833333


---

## <span style='color:#ff5f27'> 🌦 Loading Weather Data from [Open Meteo](https://open-meteo.com/en/docs)

## <span style='color:#ff5f27'> Download the Historical Weather Data </span>

https://open-meteo.com/en/docs/historical-weather-api#hourly=&daily=temperature_2m_mean,precipitation_sum,wind_speed_10m_max,wind_direction_10m_dominant

We will download the historical weather data for your `city` from the Open Meteo API.
The weather features we will download are:

 * `temperature (average over the hour)`
 * `precipitation (the total over the hour)`


In [12]:
earliest_bikes_date = pd.Series.min(hourly_avg_df['day'])

weather_df = util.get_historical_weather(city, earliest_bikes_date, str(today), latitude, longitude)


weather_df['time'] = weather_df['date'].dt.strftime('%H')
weather_df['day'] = weather_df['date'].dt.strftime('%Y-%m-%d')
#convert the time to int
weather_df['time'] = weather_df['time'].astype(int)
weather_df

Coordinates 41.37082290649414°N 2.068965435028076°E
Elevation 13.0 m asl
Timezone None None
Timezone difference to GMT+0 0 s


Unnamed: 0,date,precipitation,temperature,city,time,day
0,2024-09-30 00:00:00+00:00,0.0,15.441501,Barcelona,0,2024-09-30
1,2024-09-30 01:00:00+00:00,0.0,15.791500,Barcelona,1,2024-09-30
2,2024-09-30 02:00:00+00:00,0.0,15.791500,Barcelona,2,2024-09-30
3,2024-09-30 03:00:00+00:00,0.0,15.641500,Barcelona,3,2024-09-30
4,2024-09-30 04:00:00+00:00,0.0,15.291500,Barcelona,4,2024-09-30
...,...,...,...,...,...,...
2348,2025-01-05 20:00:00+00:00,0.0,9.741500,Barcelona,20,2025-01-05
2349,2025-01-05 21:00:00+00:00,0.0,8.941501,Barcelona,21,2025-01-05
2350,2025-01-05 22:00:00+00:00,0.0,8.341500,Barcelona,22,2025-01-05
2351,2025-01-05 23:00:00+00:00,0.0,7.991500,Barcelona,23,2025-01-05


In [13]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2353 entries, 0 to 2352
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype              
---  ------         --------------  -----              
 0   date           2353 non-null   datetime64[ns, UTC]
 1   precipitation  2353 non-null   float32            
 2   temperature    2353 non-null   float32            
 3   city           2353 non-null   object             
 4   time           2353 non-null   int64              
 5   day            2353 non-null   object             
dtypes: datetime64[ns, UTC](1), float32(2), int64(1), object(2)
memory usage: 110.3+ KB


---

### <span style="color:#ff5f27;"> Connect to Hopsworks and save the metadata</span>

In [14]:
fs = project.get_feature_store() 

#### Save city, station_id, bikes_url, latitude and longitude as a secret

These will be downloaded from Hopsworks later in the (1) daily feature pipeline and (2) the daily batch inference pipeline

In [15]:
dict_obj = {
    "city": city,
    "station_id": station_id,
    "bikes_url": bikes_url,
    "latitude": latitude,
    "longitude": longitude
}

# Convert the dictionary to a JSON string
str_dict = json.dumps(dict_obj)

try:
    secrets.create_secret("STATION_PARAMS_JSON", str_dict)
except hopsworks.RestAPIError:
    print("STATION_PARAMS_JSON already exists. To update, delete the secret in the UI (https://c.app.hopsworks.ai/account/secrets) and re-run this cell.")
    existing_key = secrets.get_secret("STATION_PARAMS_JSON").value
    print(f"{existing_key}")

STATION_PARAMS_JSON already exists. To update, delete the secret in the UI (https://c.app.hopsworks.ai/account/secrets) and re-run this cell.
{"city": "Barcelona", "station_id": 42, "bikes_url": "https://opendata-ajuntament.barcelona.cat/data/dataset/estat-estacions-bicing/resource/1b215493-9e63-4a12-8980-2d7e0fa19f85/download/recurs.json", "latitude": "41.404511", "longitude": "2.189881"}


### <span style="color:#ff5f27;"> Create the Feature Groups and insert the DataFrames in them </span>

### <span style='color:#ff5f27'> 🌫 Bikes
    
 1. Provide a name, description, and version for the feature group.
 2. Define the `primary_key`: we have to select which columns uniquely identify each row in the DataFrame - by providing them as the `primary_key`. 
 3. Define the `event_time`: We also define which column stores the timestamp or date for the row - `date`.

In [24]:
bikes_fg = fs.get_or_create_feature_group(
    name='bikes',
    description='Bikes available at a station every hour',
    version=1,
    primary_key=['time', 'date', 'day'],
    event_time="date",
)

#### Insert the DataFrame into the Feature Group

In [25]:
bikes_fg.insert(hourly_avg_df)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1164440/fs/1155143/fg/1393514


Uploading Dataframe: 100.00% |██████████| Rows 722/722 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: bikes_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1164440/jobs/named/bikes_1_offline_fg_materialization/executions


(Job('bikes_1_offline_fg_materialization', 'SPARK'), None)

#### Enter a description for each feature in the Feature Group

In [26]:
bikes_fg.update_feature_description("day", "Day of measurement of bikes availability")
bikes_fg.update_feature_description("time", "Hour of measurement of bikes availability")
bikes_fg.update_feature_description("num_bikes_available", "Available bikes at the station")
bikes_fg.update_feature_description("prev_num_bikes_available", "Bikes available the station during the previous hour")
bikes_fg.update_feature_description("date", "Last time the data was updated")
bikes_fg.update_feature_description("is_weekend", "Boolean if the date is weekend or not")
bikes_fg.update_feature_description("is_holiday", "Boolean if the date is holiday or not")

<hsfs.feature_group.FeatureGroup at 0x17a9ea290>

### <span style='color:#ff5f27'> 🌦 Weather Data
    
 1. Provide a name, description, and version for the feature group.
 2. Define the `primary_key`: we have to select which columns uniquely identify each row in the DataFrame - by providing them as the `primary_key`. 
 3. Define the `event_time`: We also define which column stores the timestamp or date for the row - `date`.

In [17]:
# Get or create feature group 
weather_fg = fs.get_or_create_feature_group(
    name='weather',
    description='Weather characteristics of each hour',
    version=1,
    primary_key=['date', 'time', 'day'],
    event_time="date",
) 

#### Insert the DataFrame into the Feature Group

In [18]:
# Insert data
weather_fg.insert(weather_df)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1164440/fs/1155143/fg/1393675


Uploading Dataframe: 100.00% |██████████| Rows 2353/2353 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: weather_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1164440/jobs/named/weather_1_offline_fg_materialization/executions


(Job('weather_1_offline_fg_materialization', 'SPARK'), None)

#### Enter a description for each feature in the Feature Group

In [19]:
weather_fg.update_feature_description("date", "Date of measurement of weather")
weather_fg.update_feature_description("day", "Day of measurement of weather")
weather_fg.update_feature_description("time", "Time of measurement of weather")
weather_fg.update_feature_description("city", "City where weather is measured/forecast for")
weather_fg.update_feature_description("precipitation", "Precipitation (rain/snow) in mm")
weather_fg.update_feature_description("temperature", "Temperature in Celsius 2m above ground")

<hsfs.feature_group.FeatureGroup at 0x17e99c850>

---