# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Feature Backfill</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/bitcoin/1_bitcoin_feature_backfill.ipynb)

## 🗒️ This notebook is divided into the following sections:
1. Fetch historical data.
2. Connect to the Hopsworks feature store.
3. Create feature groups and insert them to the feature store.

![tutorial-flow](../../images/01_featuregroups.png)

---

### <span style="color:#ff5f27;"> 📝 Imports</span>

In [1]:
# !pip install -U hopsworks --quiet

!pip install -U unicorn-binance-rest-api --quiet
!pip install -U python-dotenv --quiet
!pip install -U textblob --quiet
!pip install -U vaderSentiment --quiet
!pip install -U tweepy --quiet
!pip install -U plotly --quiet

[0m

In [2]:
!pip install -U unicorn-binance-suite --quiet

[0m

In [3]:
# Hosted notebook environments may not have the local features package
import os

def need_download_modules():
    if 'google.colab' in str(get_ipython()):
        return True
    if 'HOPSWORKS_PROJECT_ID' in os.environ:
        return True
    return False

if need_download_modules():
    print("Downloading modules")
    os.system('mkdir -p features')
    os.system('cd features && wget https://raw.githubusercontent.com/logicalclocks/hopsworks-tutorials/master/advanced_tutorials/bitcoin/features/bitcoin_price.py')
    os.system('cd features && wget https://raw.githubusercontent.com/logicalclocks/hopsworks-tutorials/master/advanced_tutorials/bitcoin/features/tweets.py')
else:
    print("Local environment")

Downloading modules


In [4]:
import os

# Uncomment and fill in if you are running on Colab
os.environ['TWITTER_API_KEY'] = ''
os.environ['TWITTER_API_SECRET'] = ''
os.environ['TWITTER_ACCESS_TOKEN'] = ''
os.environ['TWITTER_ACCESS_TOKEN_SECRET'] = ''

os.environ['BINANCE_API_KEY'] = ''
os.environ['BINANCE_API_SECRET'] = ''

In [5]:
import pandas as pd

from features import bitcoin_price, tweets

Importing tweets
 - tweepy
 - vaderSentiment
 - nltk


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/yarnapp/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/yarnapp/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/yarnapp/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


---
## <span style="color:#ff5f27;"> 🧙🏼‍♂️ Parsing Data</span>

You will parse timeseries Bitcoin data from Binance using your own credentials, so you have to get a free Binance account and [create API-keys](https://www.binance.com/en/support/faq/360002502072).

Also, you should [contact Twitter](https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api) for their API-keys.


#### Don't forget to create an `.env` configuration file inside this directory where all the necessary environment variables will be stored:

`TWITTER_API_KEY = "YOUR_API_KEY"`

`TWITTER_API_SECRET = "YOUR_API_KEY"`

`TWITTER_ACCESS_TOKEN = "YOUR_API_KEY"`

`TWITTER_ACCESS_TOKEN_SECRET = "YOUR_API_KEY"`


`BINANCE_API_KEY = "YOUR_API_KEY"`

`BINANCE_API_SECRET = "YOUR_API_KEY"`

> If you done it after you run this notebook, restart the Python Kernel (because `functions.py` does not have these variables in his namespace).

![](images/api_keys_env_file.png)

### <span style='color:#ff5f27'> 📈 Bitcoin Data parsing

In [6]:
# we work with tweets newer then '2021-02-05'
df_bitcoin = bitcoin_price.parse_btc_data(
    start_date="2021-02-05", #"2024-01-05"
    end_date="today",
)

df_bitcoin.reset_index(drop=True,inplace=True)

print()
print(f"Parsed {df_bitcoin.shape[0]} rows.")
print()

df_bitcoin.head(3)

2024-04-20 15:25:36,616 INFO: New instance of unicorn-binance-rest-api_2.2.1-python_3.10.11-compiled on Linux 6.2.0-39-generic for exchange None started ...
2024-04-20 15:25:36,628 INFO: Loading license file `lucit_license.ini`
2024-04-20 15:25:36,663 INFO: Loading profile `LUCIT`
2024-04-20 15:25:36,663 INFO: New instance of lucit-licensing-python_1.8.2-python_3.10.11-compiled on Linux 6.2.0-39-generic started ...
2024-04-20 15:25:36,964 INFO: Initiating `colorama_0.4.6`

Parsed 1171 rows.



Unnamed: 0,date,open,high,low,close,volume,quote_av,trades,tb_base_av,tb_quote_av,unix
0,2021-02-05 00:00:00,36936.65,38310.12,36570.0,38290.24,66681.334275,2509278000.0,1853253,32756.385031,1232714000.0,1612483200000
1,2021-02-06 00:00:00,38289.32,40955.51,38215.94,39186.94,98757.311183,3922095000.0,2291646,52015.513362,2065181000.0,1612569600000
2,2021-02-07 00:00:00,39181.01,39700.0,37351.0,38795.69,84363.679763,3256521000.0,1976357,40764.388959,1574483000.0,1612656000000


In [7]:
df_bitcoin_processed = bitcoin_price.process_btc_data(df_bitcoin)
df_bitcoin_processed.tail(3)

Unnamed: 0,date,open,high,low,close,volume,quote_av,trades,tb_base_av,tb_quote_av,...,exp_std_14_days,momentum_14_days,rate_of_change_14_days,strength_index_14_days,std_56_days,exp_mean_56_days,exp_std_56_days,momentum_56_days,rate_of_change_56_days,strength_index_56_days
1168,2024-04-18,61277.38,64117.09,60803.35,63470.08,43601.60918,2726741000.0,2142511,20870.20705,1305027000.0,...,3272.959638,-5017.71,-6.414775,43.95064,5129.861015,63864.319253,8378.145402,12181.66,25.078615,53.668408
1169,2024-04-19,63470.09,65450.0,59600.01,63818.01,69774.30271,4419893000.0,2828284,34941.50216,2214810000.0,...,3122.122782,-4002.61,-7.370515,44.757668,4705.82424,63862.694367,8229.852416,13073.86,23.754533,53.869795
1170,2024-04-20,63818.01,64268.58,63090.07,63917.51,12889.45438,822262000.0,766949,6521.05787,416020100.0,...,2959.061998,-4978.49,-7.847245,45.001553,4279.097075,63864.617723,8084.186206,12349.29,23.562596,53.928105


> Older records may come with time=11pm or time=9pm, but new ones have time=10pm. Thats because of timezones and daylight saving time. Lets apply this function to make unix column usable.

In [8]:
df_bitcoin_processed.unix = df_bitcoin_processed.unix.apply(bitcoin_price.fix_unix)

In [9]:
df_bitcoin_processed.date = df_bitcoin_processed.date.astype(str)

In [10]:
df_bitcoin_processed.unix.count()

1171

In [11]:
import datetime

datetime.datetime.fromtimestamp(df_bitcoin_processed.unix[0] / 1000, datetime.timezone.utc)

datetime.datetime(2021, 2, 5, 0, 0, tzinfo=datetime.timezone.utc)

### <span style='color:#ff5f27'> 💭 Tweets Data

In [12]:
tweets_textblob = pd.read_csv("https://repo.hops.works/dev/davit/bitcoin/tweets_textblob.csv")
tweets_textblob.unix = tweets_textblob.unix.apply(tweets.fix_unix)

In [13]:
tweets_vader = pd.read_csv("https://repo.hops.works/dev/davit/bitcoin/tweets_vader.csv")
tweets_vader.unix = tweets_vader.unix.apply(tweets.fix_unix)

In [14]:
tweets_textblob.date = tweets_textblob.date.apply(lambda x: x[:10])
tweets_vader.date = tweets_vader.date.apply(lambda x: x[:10])

tweets_textblob.drop(tweets_textblob.columns[0],axis=1,inplace=True)
tweets_vader.drop(tweets_vader.columns[0],axis=1,inplace=True)

In [15]:
tweets_textblob.tail(3)

Unnamed: 0,date,subjectivity,polarity,unix
525,2022-07-15,9579.905834,3556.036103,1657836000000
526,2022-07-16,8612.760526,3231.040465,1657922400000
527,2022-07-17,4215.003529,1331.957065,1658008800000


In [16]:
tweets_vader.tail(3)

Unnamed: 0,date,compound,unix
525,2022-07-15,6349.6604,1657836000000
526,2022-07-16,5737.0585,1657922400000
527,2022-07-17,2427.5174,1658008800000


In [None]:
# datetime.datetime.fromtimestamp(tweets_textblob.unix[0] / 1000, datetime.timezone.utc).replace(hour=0, minute=0, second=0, microsecond=0)

In [None]:
# df_bitcoin_processed.unix = tweets_textblob.unix

len(tweets_textblob.unix)

In [None]:
len(df_bitcoin_processed.unix)

In [21]:
tweets_textblob.unix = df_bitcoin_processed.unix[:len(tweets_textblob.unix)]
tweets_vader.unix = df_bitcoin_processed.unix[:len(tweets_vader.unix)]

In [22]:
tweets_textblob.unix == df_bitcoin_processed.unix[:len(tweets_textblob.unix)]

0      True
1      True
2      True
3      True
4      True
       ... 
523    True
524    True
525    True
526    True
527    True
Name: unix, Length: 528, dtype: bool

In [23]:
tweets_textblob.unix == tweets_vader.unix

0      True
1      True
2      True
3      True
4      True
       ... 
523    True
524    True
525    True
526    True
527    True
Name: unix, Length: 528, dtype: bool

In [24]:
tweets_vader.unix == df_bitcoin_processed.unix[:len(tweets_vader.unix)]

0      True
1      True
2      True
3      True
4      True
       ... 
523    True
524    True
525    True
526    True
527    True
Name: unix, Length: 528, dtype: bool

---

## <span style="color:#ff5f27;"> 📡 Connecting to the Hopsworks Feature Store </span>

In [25]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://snurran.hops.works/p/11394
Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27;"> 🪄 Creating Feature Groups </span>

### <span style='color:#ff5f27'> 📈 Bitcoin Price Feature Group

In [26]:
btc_price_fg = fs.get_or_create_feature_group(
    name='bitcoin_price',
    description='Bitcoin price aggregated for days',
    version=1,
    primary_key=['unix'],
    online_enabled=True,
    event_time='unix',
)

btc_price_fg.insert(df_bitcoin_processed)

Feature Group created successfully, explore it at 
https://snurran.hops.works/p/11394/fs/11342/fg/12444


Uploading Dataframe: 0.00% |          | Rows 0/1171 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: bitcoin_price_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://snurran.hops.works/p/11394/jobs/named/bitcoin_price_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7f7f1be388b0>, None)

### <span style='color:#ff5f27'> 💭 Tweets Feature Groups

In [27]:
tweets_textblob_fg = fs.get_or_create_feature_group(
    name='bitcoin_tweets_textblob',
    version=1,
    primary_key=['unix'],
    online_enabled=True,
    event_time='unix',
)

tweets_textblob_fg.insert(tweets_textblob)

Feature Group created successfully, explore it at 
https://snurran.hops.works/p/11394/fs/11342/fg/12445


Uploading Dataframe: 0.00% |          | Rows 0/528 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: bitcoin_tweets_textblob_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://snurran.hops.works/p/11394/jobs/named/bitcoin_tweets_textblob_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7f7f1bdabdc0>, None)

In [28]:
tweets_vader_fg = fs.get_or_create_feature_group(
    name='bitcoin_tweets_vader',
    version=1,
    primary_key=['unix'],
    online_enabled=True,
    event_time='unix',
)

tweets_vader_fg.insert(tweets_vader)

Feature Group created successfully, explore it at 
https://snurran.hops.works/p/11394/fs/11342/fg/12446


Uploading Dataframe: 0.00% |          | Rows 0/528 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: bitcoin_tweets_vader_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://snurran.hops.works/p/11394/jobs/named/bitcoin_tweets_vader_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7f7f1be070d0>, None)

## <span style="color:#ff5f27;">⏭️ **Next:** Part 02: Feature Pipeline</span>

In the next notebook you will be parsing new monthly data for the Feature Groups.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/bitcoin/2_bitcoin_feature_pipeline.ipynb)