# <span style="font-width:bold; font-size: 3rem; color:#1EB182;">**Hopsworks Feature Store** </span> <span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Feature Backfill</span>

**Note**: This tutorial does not support Google Colab.

This is the first part of the advanced series of tutorials about Hopsworks Feature Store. As part of this first module, you will work with data related to citibikes usage and meteorological observations in the NYC. 

The objective of this tutorial is to demonstrate how to work with the **Hopworks Feature Store**  for batch data with a goal of training and deploying a model that can predict citibikes usage per station in the future.

## 🗒️ This notebook is divided in 3 sections:
1. Loading the data and feature engineeing.
2. Connect to the Hopsworks feature store.
3. Create feature groups and upload them to the feature store.

### <span style='color:#ff5f27'> 📝 Imports

In [None]:
!pip install -U hopsworks --quiet
!pip install python-dotenv

In [None]:
from datetime import timedelta, datetime
import pandas as pd
import plotly.express as px
import os

from pandas.tseries.holiday import USFederalHolidayCalendar

from features import (
    citibike, 
    meteorological_measurements,
)

# Mute warnings
import warnings
warnings.filterwarnings("ignore")

---

## <span style="color:#ff5f27;"> 💽 Load the historical data and 🛠️ Perform Feature Engineering</span>

The data you will use comes from three different sources:

- Citi Bike [Trip Histories](https://s3.amazonaws.com/tripdata/index.html);
- Getting US National Holidays from `USFederalHolidayCalendar` (`pandas.tseries.holiday` package);
- Different meteorological observations from [VisualCrossing](https://www.visualcrossing.com/).

### <span style="color:#ff5f27;"> 🚲 Citibike usage info</span>

Downloadable files of Citi Bike trip data are located [here](https://s3.amazonaws.com/tripdata/index.html). Originally data includes:

    Ride ID
    Rideable type
    Started at
    Ended at
    Start station name
    Start station ID
    End station name
    End station ID
    Start latitude
    Start longitude
    End latitude
    End Longitude
    Member or casual ride



Let's download some data [from here](https://s3.amazonaws.com/tripdata/index.html) and perform preprocessing (removal of redundant columns and data grouping)

In [None]:
# Get data for x months
df_raw = citibike.get_citibike_data("01/2023", "04/2023")
df_raw.head(3)

In [None]:
# Convert the 'station_id' column to string type for categorical representation
df_raw.station_id = df_raw.station_id.astype(str)

In [None]:
# Engineer Citibike features
df_enhanced = citibike.engineer_citibike_features(df_raw)

# Drop rows with missing values in the enhanced DataFrame
df_enhanced = df_enhanced.dropna()

# Convert 'station_id' to string type for categorical representation
df_enhanced.station_id = df_enhanced.station_id.astype(str)

# Display the first three rows of the enhanced DataFrame
df_enhanced.head(3)

In [None]:
# Sample a random 'station_id' from the enhanced DataFrame
random_station_id = df_enhanced.station_id.sample(1).values[0]

# Display the first three rows of the enhanced DataFrame for the randomly selected 'station_id'
df_enhanced[df_enhanced.station_id == random_station_id].head(3)

In [None]:
# Display information about the DataFrame, including data types, non-null counts, and memory usage
df_enhanced.info()

### <span style="color:#ff5f27;">📒 Citibike stations info</span>

In [None]:
# Read the CSV file containing station information into a DataFrame
df_stations_info = pd.read_csv("data/stations_info.csv")

In [None]:
# Remove duplicate rows based on the 'station_id' column in the station information DataFrame
df_stations_info = df_stations_info.drop_duplicates(subset=["station_id"])

# Reset the index of the DataFrame and drop any rows with missing values
df_stations_info = df_stations_info.reset_index(drop=True).dropna()

# Convert 'station_id' to string type for categorical representation
df_stations_info.station_id = df_stations_info.station_id.astype(str)

In [None]:
# Display the first three rows of the station information DataFrame
df_stations_info.head(3)

In [None]:
# Create a scatter map using Plotly Express with station information
fig = px.scatter_mapbox(
    df_stations_info, 
    lat="lat", 
    lon="long",
    zoom=9.5,
    hover_name="station_name",
    height=400,
    width=600,
)

# Set the map style to 'open-street-map'
fig.update_layout(mapbox_style="open-street-map")

# Adjust layout margins to remove unnecessary space
fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})

# Display the map
fig.show()

### <span style="color:#ff5f27;"> 📅 US holidays</span>

In [None]:
# Create a US Federal Holiday calendar
cal = USFederalHolidayCalendar()

# Generate a feature for 20 years worth of US holiday days
start_date_for_cal = datetime.strptime('2017-01-01', '%Y-%m-%d')
end_date_for_cal = start_date_for_cal + timedelta(days=365*10)

# Create a DataFrame with holiday dates and a corresponding 'holiday' column
holidays = pd.DataFrame(
    cal.holidays(start=start_date_for_cal, end=end_date_for_cal),
    columns=['date'],
)
holidays['date'] = holidays['date'].dt.strftime('%Y-%m-%d')
holidays['holiday'] = 1

In [None]:
# Create a DataFrame with a date range from start_date_for_cal to end_date_for_cal
df_holidays = pd.DataFrame(
    pd.date_range(start_date_for_cal, end_date_for_cal),
    columns=["date"],
)

# Format the 'date' column to match the '%Y-%m-%d' format
df_holidays['date'] = df_holidays['date'].dt.strftime('%Y-%m-%d')

# Display the first three rows of the DataFrame
df_holidays.head(3)

In [None]:
# Set the 'date' column as the index and join the 'holidays' DataFrame on the 'date' column
# Fill missing values with 0 after the join
df_holidays = df_holidays.set_index("date").join(
    holidays.set_index("date"), 
    how="left",
).fillna(0)

In [None]:
# Convert the 'holiday' column to integer type
df_holidays['holiday'] = df_holidays['holiday'].astype(int)

# Reset the index, bringing the 'date' column back as a regular column
df_holidays = df_holidays.reset_index(drop=False)

# Display the first three rows of the DataFrame
df_holidays.head(3)

In [None]:
df_holidays.tail(3)

### <span style="color:#ff5f27;"> 🌤 Meteorological measurements from VisualCrossing</span>

You will parse weather data so you should get an API key from [VisualCrossing](https://www.visualcrossing.com/). You can use [this link](https://www.visualcrossing.com/weather-api).

#### Don't forget to create an `.env` configuration file inside this directory where all the necessary environment variables will be stored:

`WEATHER_API_KEY = "YOUR_API_KEY"`

> If you done it after you run this notebook, restart the Python Kernel (because `functions.py` does not have these variables in his namespace).

![](images/api_keys_env_file.png)

In [None]:
# Convert the 'date' column to string type
df_enhanced.date = df_enhanced.date.astype(str)

# Find the minimum and maximum dates in the 'date' column
start_date, end_date = df_enhanced.date.min(), df_enhanced.date.max()

In [None]:
# Get weather data for New York City within the specified date range
df_weather = meteorological_measurements.get_weather_data(
    city="nyc",
    start_date=str(start_date).split()[0],
    end_date=str(end_date).split()[0],
)
df_weather.tail(3)

In [None]:
# Unix columns creation
df_enhanced["timestamp"] = df_enhanced["date"].apply(
    meteorological_measurements.convert_date_to_unix
)
df_holidays["timestamp"] = df_holidays ["date"].apply(
    meteorological_measurements.convert_date_to_unix
)
df_weather["timestamp"] = df_weather["date"].apply(
    meteorological_measurements.convert_date_to_unix
)

---

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

---

## <span style="color:#ff5f27;"> 🪄 Creating Feature Groups </span>

A [feature group](https://docs.hopsworks.ai/3.0/concepts/fs/feature_group/fg_overview/) can be seen as a collection of conceptually related features. In this case, you will create next feature groups: CitiBike usage per station, Stations information, Meteorological measurements in NYC and US holidays.

In [None]:
citibike_usage_fg = fs.get_or_create_feature_group(
    name="citibike_usage",
    version=1,
    description="Citibike stations usage data.",
    primary_key=["date", "station_id"],
    event_time="timestamp",
)

In [None]:
citibike_usage_fg.insert(df_enhanced)

In [None]:
citibike_stations_info_fg = fs.get_or_create_feature_group(
    name="citibike_stations_info",
    version=1,
    description="Citibike stations information.",
    primary_key=['station_id'],
)

In [None]:
citibike_stations_info_fg.insert(df_stations_info)

In [None]:
us_holidays_fg = fs.get_or_create_feature_group(
    name="us_holidays",
    version=1,
    description="US holidays calendar.",
    primary_key=["date"],
    event_time="timestamp",
)

In [None]:
us_holidays_fg.insert(df_holidays)

In [None]:
meteorological_measurements_fg = fs.get_or_create_feature_group(
    name="meteorological_measurements",
    version=1,
    description="Meteorological measurements for NYC.",
    primary_key=["date"],
    event_time="timestamp",
)

In [None]:
meteorological_measurements_fg.insert(df_weather)

## <span style="color:#ff5f27;">⏭️ **Next:** Part 02: Feature Pipeline </span>

In the next notebook, you will be parsing new monthly data for the Feature Groups.
