# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Feature Pipeline</span>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/bitcoin/2_bitcoin_feature_pipeline.ipynb)


## 🗒️ This notebook is divided into the following sections:
1. Parsing Data.
2. Feature Group Insertion.

### <span style="color:#ff5f27;"> 📝 Imports</span>

In [1]:
!pip install -U hopsworks --quiet

!pip install -U unicorn-binance-rest-api --quiet
!pip install -U python-dotenv --quiet
!pip install -U textblob --quiet
!pip install -U vaderSentiment --quiet
!pip install -U tweepy --quiet
!pip install -U plotly --quiet

[0m

In [2]:
# Hosted notebook environments may not have the local features package
import os

def need_download_modules():
    if 'google.colab' in str(get_ipython()):
        return True
    if 'HOPSWORKS_PROJECT_ID' in os.environ:
        return True
    return False

if need_download_modules():
    print("Downloading modules")
    os.system('mkdir -p features')
    os.system('cd features && wget https://raw.githubusercontent.com/logicalclocks/hopsworks-tutorials/master/advanced_tutorials/bitcoin/features/bitcoin_price.py')
    os.system('cd features && wget https://raw.githubusercontent.com/logicalclocks/hopsworks-tutorials/master/advanced_tutorials/bitcoin/features/tweets.py')
else:
    print("Local environment")

Downloading modules


In [3]:
import os

# Uncomment and fill in if you are running on Colab
os.environ['TWITTER_API_KEY'] = ''
os.environ['TWITTER_API_SECRET'] = ''
os.environ['TWITTER_ACCESS_TOKEN'] = ''
os.environ['TWITTER_ACCESS_TOKEN_SECRET'] = ''

os.environ['BINANCE_API_KEY'] = ''
os.environ['BINANCE_API_SECRET'] = ''

In [4]:
import pandas as pd
from features import bitcoin_price, tweets

Importing tweets
 - tweepy
 - vaderSentiment
 - nltk


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/yarnapp/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/yarnapp/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/yarnapp/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


---
## <span style="color:#ff5f27;"> 🧙🏼‍♂️ Parsing Data</span>

You will parse timeseries Bitcoin data from Binance using your own credentials, so you have to get a free Binance account and [create API-keys](https://www.binance.com/en/support/faq/360002502072).

Also, you should [contact Twitter](https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api) for their API-keys.


#### Don't forget to create an `.env` configuration file inside this directory where all the necessary environment variables will be stored:

`TWITTER_API_KEY = "YOUR_API_KEY"`

`TWITTER_API_SECRET = "YOUR_API_KEY"`

`TWITTER_ACCESS_TOKEN = "YOUR_API_KEY"`

`TWITTER_ACCESS_TOKEN_SECRET = "YOUR_API_KEY"`


`BINANCE_API_KEY = "YOUR_API_KEY"`

`BINANCE_API_SECRET = "YOUR_API_KEY"`

> If you done it after you run this notebook, restart the Python Kernel (because `functions.py` does not have these variables in his namespace).

![](images/api_keys_env_file.png)

### <span style='color:#ff5f27'> 📈 Bitcoin Data

In [5]:
# we should take 56+ days because of feature engineering with window aggregations.
df_bitcoin = bitcoin_price.parse_btc_data(number_of_days_ago=57)
df_bitcoin.head(3)

2024-04-20 15:31:03,043 INFO: New instance of unicorn-binance-rest-api_2.2.1-python_3.10.11-compiled on Linux 6.2.0-39-generic for exchange None started ...
2024-04-20 15:31:03,056 INFO: Loading license file `lucit_license.ini`
2024-04-20 15:31:03,104 INFO: Loading profile `LUCIT`
2024-04-20 15:31:03,105 INFO: New instance of lucit-licensing-python_1.8.2-python_3.10.11-compiled on Linux 6.2.0-39-generic started ...
2024-04-20 15:31:03,406 INFO: Initiating `colorama_0.4.6`


Unnamed: 0,date,open,high,low,close,volume,quote_av,trades,tb_base_av,tb_quote_av,unix
0,2024-02-23 00:00:00,51288.42,51548.54,50521.0,50744.15,30545.79544,1559144000.0,1487039,14610.80906,745821000.0,1708646400000
1,2024-02-24 00:00:00,50744.15,51698.0,50585.0,51568.22,16560.4211,847793400.0,855015,8462.15627,433239600.0,1708732800000
2,2024-02-25 00:00:00,51568.21,51958.55,51279.8,51728.85,18721.63159,966951100.0,923992,9544.17672,492952500.0,1708819200000


In [6]:
df_bitcoin_processed = bitcoin_price.process_btc_data(df_bitcoin)
df_bitcoin_processed.tail(3)

Unnamed: 0,date,open,high,low,close,volume,quote_av,trades,tb_base_av,tb_quote_av,...,exp_std_14_days,momentum_14_days,rate_of_change_14_days,strength_index_14_days,std_56_days,exp_mean_56_days,exp_std_56_days,momentum_56_days,rate_of_change_56_days,strength_index_56_days
55,2024-04-18,61277.38,64117.09,60803.35,63470.08,43601.60918,2726741000.0,2142511,20870.20705,1305027000.0,...,3259.91581,-5017.71,-6.414775,44.593135,5129.861015,66726.875601,3956.587429,0.0,25.078615,0.0
56,2024-04-19,63470.09,65450.0,59600.01,63818.01,69774.30271,4419893000.0,2828284,34941.50216,2214810000.0,...,3110.61663,-4002.61,-7.370515,45.380545,4705.82424,66609.483564,3918.524698,13073.86,23.754533,56.075637
57,2024-04-20,63818.01,64268.58,63090.07,63958.24,12927.63062,824703300.0,769531,6537.65889,417081600.0,...,2946.10378,-4937.76,-7.788523,45.715393,4278.672431,66503.049082,3874.583771,12390.02,23.641334,56.133849


In [7]:
df_bitcoin_processed.date = df_bitcoin_processed.date.astype(str)

### <span style='color:#ff5f27'> 💭 Tweets Data

In [None]:
df_tweets_parsed = tweets.get_last_tweets()
df_tweets_parsed.head()

In [None]:
tweets_textblob = tweets.textblob_processing(df_tweets_parsed)

In [None]:
tweets_vader = tweets.vader_processing(df_tweets_parsed)

In [None]:
tweets_textblob.date = tweets_textblob.date.apply(lambda x: x[:10])
tweets_vader.date = tweets_vader.date.apply(lambda x: x[:10])

In [None]:
tweets_textblob.head()

---

### <span style="color:#ff5f27;"> 📡 Connecting to the Hopsworks Feature Store </span>

In [8]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://snurran.hops.works/p/11394
Connected. Call `.close()` to terminate connection gracefully.


In [9]:
btc_price_fg = fs.get_or_create_feature_group(
    name='bitcoin_price',
    version=1,
)

tweets_textblob_fg = fs.get_or_create_feature_group(
    name='bitcoin_tweets_textblob',
    version=1,
)

tweets_vader_fg = fs.get_or_create_feature_group(
    name='bitcoin_tweets_vader',
    version=1,
)

---

### <span style='color:#ff5f27'> 💫 Filling the gap in tweets

In [35]:
btc_dates = btc_price_fg.read().date.sort_values().reset_index(drop=True).astype(str)

Finished: Reading data from Hopsworks, using ArrowFlight (0.60s) 


In [36]:
stored_tweets_df = tweets_textblob_fg.read()

Finished: Reading data from Hopsworks, using ArrowFlight (0.46s) 


In [37]:
stored_dates = stored_tweets_df.date.apply(lambda x: str(x)[:10]).drop_duplicates().sort_values().reset_index(drop=True)

In [38]:
btc_dates

0       2021-02-05
1       2021-02-06
2       2021-02-07
3       2021-02-08
4       2021-02-09
           ...    
1166    2024-04-16
1167    2024-04-17
1168    2024-04-18
1169    2024-04-19
1170    2024-04-20
Name: date, Length: 1171, dtype: object

In [39]:
stored_dates

0      2021-02-05
1      2021-02-06
2      2021-02-07
3      2021-02-08
4      2021-02-09
          ...    
523    2022-07-13
524    2022-07-14
525    2022-07-15
526    2022-07-16
527    2022-07-17
Name: date, Length: 528, dtype: object

In [40]:
missing_dates = list(set(btc_dates) - set(stored_dates))

In [41]:
len(missing_dates)

643

In [43]:
tweets_textblob_fix = pd.DataFrame(
    {
        "date": missing_dates,
        "subjectivity": [1.0] * len(missing_dates),
        "polarity": [1.0] * len(missing_dates),
    })

In [45]:
tweets_vader_fix = pd.DataFrame(
    {
        "date": missing_dates,
        "compound": [1.0] * len(missing_dates),
    })

In [46]:
tweets_vader_fix

Unnamed: 0,date,compound
0,2022-12-20,1.0
1,2023-10-22,1.0
2,2023-05-31,1.0
3,2022-08-25,1.0
4,2023-01-19,1.0
...,...,...
638,2022-12-08,1.0
639,2023-04-12,1.0
640,2024-01-15,1.0
641,2023-07-09,1.0


In [47]:
tweets_vader_fix["unix"] = tweets_vader_fix.date.apply(tweets.convert_date_to_unix)
tweets_textblob_fix["unix"] = tweets_textblob_fix.date.apply(tweets.convert_date_to_unix)

In [48]:
tweets_vader_fix.sort_values("date")

Unnamed: 0,date,compound,unix
230,2022-07-18,1.0,1658102400000
85,2022-07-19,1.0,1658188800000
608,2022-07-20,1.0,1658275200000
542,2022-07-21,1.0,1658361600000
70,2022-07-22,1.0,1658448000000
...,...,...,...
634,2024-04-16,1.0,1713225600000
226,2024-04-17,1.0,1713312000000
447,2024-04-18,1.0,1713398400000
186,2024-04-19,1.0,1713484800000


In [49]:
# tweets_vader_batch = pd.concat([tweets_vader_fix, tweets_vader]).sort_values("date").reset_index(drop=True)
# tweets_textblob_batch = pd.concat([tweets_textblob_fix, tweets_textblob]).sort_values("date").reset_index(drop=True)

tweets_vader_batch = tweets_vader_fix.sort_values("date").reset_index(drop=True)
tweets_textblob_batch = tweets_textblob_fix.sort_values("date").reset_index(drop=True)

In [50]:
len(tweets_textblob_batch)

643

In [51]:
import datetime
datetime.datetime.fromtimestamp(tweets_textblob_batch.unix[0] / 1000, datetime.timezone.utc)

datetime.datetime(2022, 7, 18, 0, 0, tzinfo=datetime.timezone.utc)

In [52]:
tweets_textblob_batch.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 643 entries, 0 to 642
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          643 non-null    object 
 1   subjectivity  643 non-null    float64
 2   polarity      643 non-null    float64
 3   unix          643 non-null    int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 20.2+ KB


---

## <span style="color:#ff5f27;">⬆️ Uploading new data to the Feature Store</span>

### <span style='color:#ff5f27'> 📈 Bitcoin Feature Group

In [53]:
btc_price_fg.insert(df_bitcoin_processed)

Uploading Dataframe: 0.00% |          | Rows 0/58 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: bitcoin_price_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://snurran.hops.works/p/11394/jobs/named/bitcoin_price_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7ff55010f520>, None)

### <span style='color:#ff5f27'> 💭 Tweets Feature Groups

In [54]:
tweets_textblob_fg.insert(tweets_textblob_batch)

Uploading Dataframe: 0.00% |          | Rows 0/643 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: bitcoin_tweets_textblob_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://snurran.hops.works/p/11394/jobs/named/bitcoin_tweets_textblob_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7ff5501753c0>, None)

In [55]:
tweets_vader_fg.insert(tweets_vader_batch)

Uploading Dataframe: 0.00% |          | Rows 0/643 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: bitcoin_tweets_vader_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://snurran.hops.works/p/11394/jobs/named/bitcoin_tweets_vader_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x7ff55d1a8520>, None)

## <span style="color:#ff5f27;">⏭️ **Next:** Part 03: Training Pipeline </span>

In the next notebook you will create a feature view, training dataset, train a model and register it in Hopsworks Model Registry.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/bitcoin/3_bitcoin_training_pipeline.ipynb)