## Data processing for BOW

This notebooks receives the cleaned tweets from the file tweets_nlp_modelling_V2. Which is a second version with stratified sampling.

This file will be processed to obtain the tokens and then apply the BOW function.
Tokens will be grouped by date, in order to include all the tokens for each day.

Libraries and stopwords:

In [24]:
import pandas as pd
from itertools import chain
import nltk
import joblib
from functions.tweets_tokenization import \
    tokenize_tweets, \
    dictionary_tweets, \
    bow_tweets

stopwords = nltk.corpus.stopwords.words(['english'])

Import the data from CSV:

Import the bitcoin and tweets files.

In [25]:
directory = '~/PycharmProjects/tfm_hugopobil'
df = pd.read_csv(f'{directory}/data/sampled_data/tweets_nlp_modelling_v2.csv')
btc_usd_grouped = pd.read_csv(f'{directory}/data/sampled_data/btc_usd_grouped_v2.csv')
df = df.set_index('date_clean')

In [26]:
print(df.shape)
df.head()

(22120, 7)


Unnamed: 0_level_0,tweets,cleaned_tweets,crypto_sentiment,subjectivity,polarity,sentiment,target
date_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2021-02-05,Bitcoin and ETH both have bullish setups for a...,Bitcoin ETH bullish setup move higher BTC woul...,,0.416667,0.35,neutral,False
2021-02-05,4⃣ 🎙️ Bloomberg LP CryptoOutlook 2021 with ⬇️...,Bloomberg LP CryptoOutlook cryptocurrency bitc...,,0.0,0.0,positive,False
2021-02-05,⬇️⬇️ $BTC SELLING PRESSURE ALERT 📉 Price tradi...,BTC SELLING PRESSURE ALERT Price trading aroun...,,0.0,0.0,positive,False
2021-02-05,"If hyperinflation does hit again, think of the...",If hyperinflation hit think inflation like flo...,,0.541667,-0.291667,negative,False
2021-02-05,DeriBot Daily Trading Report 5.02.2021 11:42 U...,DeriBot Daily Trading Report UTC Bitcoin Tradi...,,0.0,0.0,positive,False


### Tweets tokenization without grouping

### Group TOKENS by date:

This will create a dataframe with the accumulation of all tokens with the same date to obtain a grouped tokens.

In [27]:
df['tokens'] = tokenize_tweets(df.tweets.to_list())

In [34]:
df_grouped = df.groupby(df.index).agg({'tokens': lambda x: list(chain(*x.to_list()))})

### Model Data Preparation

X = BOW for grouped tweets by days
Y (target) = Bitcoin returns

In [51]:
df_index = df_grouped.index

Calculate the daily return for Bitcoin as a percentage.

In [44]:
btc_usd_grouped['return'] = btc_usd_grouped['Adj Close'].pct_change()
btc_usd_grouped_returns = btc_usd_grouped.set_index('Date')['return']

Join both dataframes and drop NA values, so we end up with a dataframe that includes the tokens and daily returns as our target variable for Tweets and Bitcoin.

In [62]:
model_data = df_grouped.join(btc_usd_grouped_returns).dropna()
print(model_data.shape)
model_data.head()

(107, 2)


Unnamed: 0_level_0,tokens,return
date_clean,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-02-07,"[crypto, trader, stressing, cryptocurrency, bi...",-0.009234
2021-02-08,"[btc, going, signal, minute, chart, price, bit...",0.187465
2021-02-09,"[psychological, barrier, know, real, one, bitc...",0.006162
2021-02-10,"[join, nai, 1swimhw, write, articles, earn, ea...",-0.033625
2021-02-13,"[dash, expected, move, beginning, upper, targe...",-0.008406


Calculate BOW and Train Set for documents with less than 3 words.

In [78]:
dictionary_model_data = dictionary_tweets(model_data['tokens'])
X_model_data, doc2bow_model_data = bow_tweets(model_data['tokens'], dictionary_model_data)
X_model_data.shape

(107, 43433)

Save to local:

In [75]:
joblib.dump(doc2bow_model_data, '/Users/hpp/PycharmProjects/tfm_hugopobil/data/model_data/doc2bow.joblib')
model_data.to_csv('/Users/hpp/PycharmProjects/tfm_hugopobil/data/model_data/model_data.csv')
joblib.dump(X_model_data, '/Users/hpp/PycharmProjects/tfm_hugopobil/data/model_data/X_model_data.joblib')

['/Users/hpp/PycharmProjects/tfm_hugopobil/data/model_data/X_model_data.joblib']