## Data processing for BOW

This notebooks receives the cleaned tweets from the file tweets_nlp_modelling_V2. Which is a second version with stratified sampling.

This file will be processed to obtain the tokens and then apply the BOW function.
Tokens will be grouped by date, in order to include all the tokens for each day.

Libraries and stopwords:

In [13]:
import pandas as pd
from itertools import chain
import nltk
import joblib
from functions.tweets_tokenization import \
    tokenize_tweets, \
    dictionary_tweets, \
    bow_tweets

stopwords = nltk.corpus.stopwords.words(['english'])

Import the data from CSV:

Import the bitcoin and tweets files.

In [14]:
directory = '~/PycharmProjects/tfm_hugopobil'
df = pd.read_csv(f'{directory}/data/sampled_data/tweets_nlp_modelling_v3.csv')
btc_usd_grouped = pd.read_csv(f'{directory}/data/sampled_data/btc_usd_grouped_v2.csv')
df = df.set_index('date_clean')

In [15]:
print(df.shape)
df.head()

(20565, 7)


Unnamed: 0_level_0,tweets,cleaned_tweets,crypto_sentiment,subjectivity,polarity,sentiment,target
date_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2021-02-06,"""Will Institutional Investment Keep Pouring In...",Will Institutional Investment Keep Pouring Int...,positive,0.0,0.0,positive,True
2021-02-06,BTC Bitcoin You know where the WSB money is g...,BTC Bitcoin You know WSB money going WallStree...,positive,0.0,0.0,positive,True
2021-02-06,"🔼🔼 ₿1 = $38,868 (00:56 UTC)\n$BTC prices conti...",UTC BTC price continue rise Change since midni...,positive,0.0,0.0,positive,True
2021-02-06,BTC Bitcoin All the way up! 🚀 🚀 💵 💵 /xVyLbbWRiu,BTC Bitcoin All way xVyLbbWRiu,positive,0.0,0.0,positive,True
2021-02-06,Keep going BTC bitcoin,Keep going BTC bitcoin,positive,0.0,0.0,positive,True


### Tweets tokenization without grouping

### Group TOKENS by date:

This will create a dataframe with the accumulation of all tokens with the same date to obtain a grouped tokens.

In [16]:
df['tokens'] = tokenize_tweets(df.tweets.to_list())

In [5]:
# df_grouped = df.groupby(df.index).agg({'tokens': lambda x: list(chain(*x.to_list()))})

### Model Data Preparation

X = BOW for grouped tweets by days
Y (target) = Bitcoin returns

In [17]:
df

Unnamed: 0_level_0,tweets,cleaned_tweets,crypto_sentiment,subjectivity,polarity,sentiment,target,tokens
date_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-02-06,"""Will Institutional Investment Keep Pouring In...",Will Institutional Investment Keep Pouring Int...,positive,0.000000,0.000000,positive,True,"[institutional, investment, keep, pouring, bit..."
2021-02-06,BTC Bitcoin You know where the WSB money is g...,BTC Bitcoin You know WSB money going WallStree...,positive,0.000000,0.000000,positive,True,"[btc, bitcoin, know, wsb, money, going, wallst..."
2021-02-06,"🔼🔼 ₿1 = $38,868 (00:56 UTC)\n$BTC prices conti...",UTC BTC price continue rise Change since midni...,positive,0.000000,0.000000,positive,True,"[utc, btc, prices, continue, rise, change, sin..."
2021-02-06,BTC Bitcoin All the way up! 🚀 🚀 💵 💵 /xVyLbbWRiu,BTC Bitcoin All way xVyLbbWRiu,positive,0.000000,0.000000,positive,True,"[btc, bitcoin, way, xvylbbwriu]"
2021-02-06,Keep going BTC bitcoin,Keep going BTC bitcoin,positive,0.000000,0.000000,positive,True,"[keep, going, btc, bitcoin]"
...,...,...,...,...,...,...,...,...
2022-01-22,Let’s be honest after watching this video plea...,Let honest watching video please tell bullish ...,positive,0.775000,0.475000,neutral,False,"[let, honest, watching, video, please, tell, b..."
2022-01-22,🚀 We got hashing! Alot more to onboard but sup...,We got hashing Alot onboard super excited new ...,positive,0.445202,0.074116,neutral,False,"[got, hashing, alot, onboard, super, excited, ..."
2022-01-22,What surprises me most is that people are will...,What surprise people willing spend NFT Bitcoin,positive,0.750000,0.250000,neutral,False,"[surprises, people, willing, spend, nft, bitcoin]"
2022-01-22,"BNB dumped -20.243% 1d , current price is $ 34...",BNB dumped current price Want buy dip Sigh Up ...,positive,0.400000,0.000000,positive,True,"[bnb, dumped, 1d, current, price, want, buy, d..."


Join both dataframes and drop NA values, so we end up with a dataframe that includes the tokens and daily returns as our target variable for Tweets and Bitcoin.

Calculate BOW and Train Set for documents with less than 3 words.

In [18]:
dictionary_model_data = dictionary_tweets(df['tokens'])
X_model_data, doc2bow_model_data = bow_tweets(df['tokens'], dictionary_model_data)
X_model_data.shape

(20565, 43663)

Save to local:

In [19]:
joblib.dump(dictionary_model_data, '/Users/hpp/PycharmProjects/tfm_hugopobil/models/topic_analisis/dictionary.joblib')

['/Users/hpp/PycharmProjects/tfm_hugopobil/models/topic_analisis/dictionary.joblib']

In [20]:
joblib.dump(doc2bow_model_data, '/Users/hpp/PycharmProjects/tfm_hugopobil/models/topic_analisis/doc2bow.joblib')
df.to_csv('/Users/hpp/PycharmProjects/tfm_hugopobil//models/topic_analisis/model_data.csv')
joblib.dump(X_model_data, '/Users/hpp/PycharmProjects/tfm_hugopobil/models/topic_analisis/X_model_data.joblib')

['/Users/hpp/PycharmProjects/tfm_hugopobil/models/topic_analisis/X_model_data.joblib']