# Tweet Sentiment Analyzer for Binance Coin (BNB)

Market sentiment may be one of the important indicators for predicting the price of the token on the market. We hope to build a sentiment classifier on the tweets of Binance Coin using various models including Logistic Regression, SVM, Decision Tree and Deep Learning. The data is collected from OpenBlender.io and the data contains around 1,300 tweets from `BZ_Binance`. 

In [8]:
import os
import pandas as pd
import numpy as np
import json
import plotly.express as px
import seaborn as sns

from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score, roc_curve, auc, confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer

#Tensorflow / Keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

#Test
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import plot_confusion_matrix
from sklearn import svm

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [9]:
df = pd.read_csv(os.path.join('data', 'binance', 'binance_tweets_combined.csv'), index_col='date')
df

Unnamed: 0_level_0,author,text,timestamp,datetime,text_cleaned,vader_sentiment,flair_sentiment,textBlob_sentiment,Date,Close/Last,Volume,Open,High,Low
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2022-06-02 00:00:00,CZ Binance,"I love most tihings in the world, but I hate j...",1654195218,2022-06-02 18:40:18,love tihings world hate jet lag,-0.7717,0.995204,0.066667,06/02/2022,308.21,,300.50,310.20,306.60
2022-06-02 00:00:00,CZ Binance,#MalaysiaBoleh,1654181362,2022-06-02 14:49:22,malaysiaboleh,0.0000,0.583073,0.000000,06/02/2022,308.21,,300.50,310.20,306.60
2022-06-02 00:00:00,CZ Binance,Just finished a live interview with CNBC Squaw...,1654167185,2022-06-02 10:53:05,finish live interview cnbc squawk box studio haha,0.4588,-0.642573,0.168182,06/02/2022,308.21,,300.50,310.20,306.60
2022-06-02 00:00:00,CZ Binance,I'll be speaking at the Point Zero Forum in Zu...,1654163469,2022-06-02 09:51:09,ill speak point zero forum zurich switzerland ...,0.0000,0.716512,0.000000,06/02/2022,308.21,,300.50,310.20,306.60
2022-06-02 00:00:00,CZ Binance,Humbled to receive this #cryptowarrior gift fr...,1654153218,2022-06-02 07:00:18,humble receive cryptowarrior gift malaysia par...,0.4404,0.997130,0.000000,06/02/2022,308.21,,300.50,310.20,306.60
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-06-04 00:00:00,CZ Binance,Tweets that hurt other people's finances are n...,1622799722,2021-06-04 09:42:02,tweet hurt people finance funny irresponsible,-0.5267,-0.999880,-0.125000,06/04/2021,410.64,,406.32,415.05,388.13
2021-06-04 00:00:00,CZ Binance,"Greater power with great responsibility, great...",1622798109,2021-06-04 09:15:09,great power great responsibility great great p...,0.9552,-0.973885,0.267857,06/04/2021,410.64,,406.32,415.05,388.13
2021-06-04 00:00:00,CZ Binance,Don't be manipulated. #HODL Not financial advice.,1622789397,2021-06-04 06:49:57,dont manipulate hodl financial advice,0.2924,-0.998453,0.000000,06/04/2021,410.64,,406.32,415.05,388.13
2021-06-03 00:00:00,CZ Binance,"Want to see photos from Miami, of real people ...",1622733347,2021-06-03 15:15:47,want see photo miami real people hang,0.0772,0.999000,0.200000,06/03/2021,409.11,,393.57,428.71,394.86


In [5]:
df.columns

Index(['author', 'text', 'timestamp', 'datetime', 'text_cleaned',
       'vader_sentiment', 'flair_sentiment', 'textBlob_sentiment', 'Date',
       'Close/Last', 'Volume', 'Open', 'High', 'Low'],
      dtype='object')

In [20]:
df.drop(columns=['author', 'text', 'datetime', 'Date', 'Volume', 'High', 'Low', 'timestamp'], inplace=True)

In [21]:
df

Unnamed: 0_level_0,text_cleaned,vader_sentiment,flair_sentiment,textBlob_sentiment,Close/Last,Open,movement
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2022-06-02,love tihings world hate jet lag,-0.7717,0.995204,0.066667,308.21,300.50,1
2022-06-02,malaysiaboleh,0.0000,0.583073,0.000000,308.21,300.50,1
2022-06-02,finish live interview cnbc squawk box studio haha,0.4588,-0.642573,0.168182,308.21,300.50,1
2022-06-02,ill speak point zero forum zurich switzerland ...,0.0000,0.716512,0.000000,308.21,300.50,1
2022-06-02,humble receive cryptowarrior gift malaysia par...,0.4404,0.997130,0.000000,308.21,300.50,1
...,...,...,...,...,...,...,...
2021-06-04,tweet hurt people finance funny irresponsible,-0.5267,-0.999880,-0.125000,410.64,406.32,1
2021-06-04,great power great responsibility great great p...,0.9552,-0.973885,0.267857,410.64,406.32,1
2021-06-04,dont manipulate hodl financial advice,0.2924,-0.998453,0.000000,410.64,406.32,1
2021-06-03,want see photo miami real people hang,0.0772,0.999000,0.200000,409.11,393.57,1


### Preprocessing

The process of pre-processing on the text is done on the `binance_price_analyzer.ipynb`.

### Check the temporal distribution of tweets

In [22]:
fig = px.histogram(df, x=df.index)
fig.show()

### Create labels
Here we labeled price increased as `1` and price decreased as `0` in a day

In [23]:
# Label 1 for price increase, 0 for price decrease
df['movement'] = df.apply(lambda x: 1 if x['Close/Last'] - x['Open'] >= 0 else 0 ,axis=1)

In [24]:
# Converting index as datetime object
df.index = pd.to_datetime(df.index)

In [25]:
# Optional: We may perform sentiment analysis on weekly basis
df_resample = df.resample('D').mean().dropna()

In [26]:
# Check the tweet distribution
fig = px.histogram(df_resample, x=df_resample.index)
fig.show()

### Generate Embeddings using Vectorizer

We first attempt by using the easiest way of generating a vector for a sentence. 

#### Tfidf Vectorizer
There are several commmonly used vectorizer such as `CountVectorizer`, `TfidfVectorizer`, `Word2Vec` (pre-trained), `GloVe` (pre-trained). To my understanding, the first two vectorizer is the easier and simplest method to directly create a vector for a **sentence**. The latter vectorizers are used to create vectors for each **words** of a sentence which would lead to further complexity. 

In [27]:
# Type conversion
df['text_cleaned'] = df['text_cleaned'].astype(str)

In [15]:
# We may tune the number of features by adjusting the following parameters
tfidf_vectorizer = TfidfVectorizer(min_df=5, max_df=100, ngram_range=(1,3))

In [16]:
# X = vectorizer.fit_transform(df_combine['text_cleaned'])
X1 = tfidf_vectorizer.fit_transform(df['text_cleaned'])

Checking the number of features is important since it determines the amount of resources and the training speed of the model, particulary for deep learning. After experimenting, it is found that 200,000 * 10,000 is the upper limit for a 32GB RAM Intel i5 machine. 

In [16]:
len(tfidf_vectorizer.get_feature_names())

543

In [17]:
# tfidf_vectorizer.get_feature_names()
X1 = X1.toarray()

In [18]:
X1.shape

(1397, 543)

In [19]:
df_vec = pd.DataFrame(X1)

In [21]:
df_vec.set_index(pd.to_datetime(df.index), inplace=True)

In [22]:
df_vec

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,533,534,535,536,537,538,539,540,541,542
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-06-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022-06-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022-06-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022-06-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2022-06-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-06-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-06-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-06-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-06-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Resampling the embeddings
There could be multiple tweets in one day. Different methods could be chosen to weight them such as `max()`, `min()`, `mean()`. Here we use `mean()` for capturing the semantic meaning from all the samples. However, the sentiment may be diluted by the amount of tweets in a day. 

In [23]:
df_vec_resample = df_vec.resample('D').max().dropna(how='all')

In [24]:
df_vec_resample

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,533,534,535,536,537,538,539,540,541,542
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-06-03,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2021-06-04,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2021-06-05,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2021-06-06,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2021-06-07,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.304608,0.0,0.0,0.506814,0.637186,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-05-29,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.506814,0.637186,0.0,0.0,0.0
2022-05-30,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.423463,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2022-05-31,0.493306,0.513813,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2022-06-01,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.491908,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0


In [25]:
df_vec_resample

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,533,534,535,536,537,538,539,540,541,542
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-06-03,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2021-06-04,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2021-06-05,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2021-06-06,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2021-06-07,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.304608,0.0,0.0,0.506814,0.637186,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-05-29,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.506814,0.637186,0.0,0.0,0.0
2022-05-30,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.423463,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2022-05-31,0.493306,0.513813,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2022-06-01,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.491908,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0


In [27]:
df_train_test = pd.merge(df_vec_resample, df_resample, left_index=True, right_index=True, how='inner')
df_train_test.drop(columns=['vader_sentiment', 'flair_sentiment', 'textBlob_sentiment', 'Close/Last', 'Open'], inplace=True)

### Train Test Split

In [29]:
X = df_train_test.iloc[:, :-1]
y = df_train_test.iloc[:, -1:].values.ravel()

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Logistic Regression

In [31]:
def scoring(train, pred):
    
    accuracy = accuracy_score(train, pred)
    f1 = f1_score(train, pred)
    precision = precision_score(train, pred)
    recall = recall_score(train, pred)
    
    print(f'Accuracy score is: {accuracy}')
    print('---')
    print(f'F1 score is: {f1}')
    print(f'Precision score is: {precision}')
    print(f'Recall score is: {recall}')

In [32]:
# Using Grid Search for finding the best parameter
grid={"C": np.logspace(-10, 10, 7), "penalty":["l1","l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression(fit_intercept=False, solver='liblinear')
logreg_cv=GridSearchCV(logreg, grid, cv=5)
logreg_cv.fit(X_train, y_train)

print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy :",logreg_cv.best_score_)

tuned hpyerparameters :(best parameters)  {'C': 10000000000.0, 'penalty': 'l1'}
accuracy : 0.5592592592592592


In [33]:
log_reg = LogisticRegression(fit_intercept=False, C=4641588, solver='liblinear', penalty="l2")
model_log = log_reg.fit(X_train, y_train)
log_test_preds = log_reg.predict(X_test)
scoring(y_test, log_test_preds)

Accuracy score is: 0.5588235294117647
---
F1 score is: 0.5161290322580646
Precision score is: 0.5333333333333333
Recall score is: 0.5


### Decision Tree

In [34]:
grid = {'n_estimators': np.arange(1, 100, 2)}

In [35]:
forest_classifier = RandomForestClassifier()
forest_model = forest_classifier.fit(X_train, y_train)

In [36]:
logreg_cv=GridSearchCV(forest_classifier, grid, cv=5)
logreg_cv.fit(X_train, y_train)
print("tuned hpyerparameters :(best parameters) ", logreg_cv.best_params_)
print("accuracy :", logreg_cv.best_score_)

tuned hpyerparameters :(best parameters)  {'n_estimators': 23}
accuracy : 0.5555555555555556


In [37]:
forest_classifier_tuned = RandomForestClassifier(n_estimators=80)
forest_model = forest_classifier_tuned.fit(X_train, y_train)
forest_preds = forest_model.predict(X_test)
scoring(y_test, forest_preds)


Accuracy score is: 0.6323529411764706
---
F1 score is: 0.626865671641791
Precision score is: 0.6
Recall score is: 0.65625


### SVM

In [38]:
clf = svm.SVC()
clf.fit(X_train, y_train)
svm_pred = clf.predict(X_test)
scoring(y_test, svm_pred)

Accuracy score is: 0.6029411764705882
---
F1 score is: 0.6666666666666666
Precision score is: 0.5510204081632653
Recall score is: 0.84375


### Deep Learning: Sequential Model

In [29]:
df

Unnamed: 0_level_0,text_cleaned,vader_sentiment,flair_sentiment,textBlob_sentiment,Close/Last,Open,movement
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2022-06-02,love tihings world hate jet lag,-0.7717,0.995204,0.066667,308.21,300.50,1
2022-06-02,malaysiaboleh,0.0000,0.583073,0.000000,308.21,300.50,1
2022-06-02,finish live interview cnbc squawk box studio haha,0.4588,-0.642573,0.168182,308.21,300.50,1
2022-06-02,ill speak point zero forum zurich switzerland ...,0.0000,0.716512,0.000000,308.21,300.50,1
2022-06-02,humble receive cryptowarrior gift malaysia par...,0.4404,0.997130,0.000000,308.21,300.50,1
...,...,...,...,...,...,...,...
2021-06-04,tweet hurt people finance funny irresponsible,-0.5267,-0.999880,-0.125000,410.64,406.32,1
2021-06-04,great power great responsibility great great p...,0.9552,-0.973885,0.267857,410.64,406.32,1
2021-06-04,dont manipulate hodl financial advice,0.2924,-0.998453,0.000000,410.64,406.32,1
2021-06-03,want see photo miami real people hang,0.0772,0.999000,0.200000,409.11,393.57,1


#### Generating word embeddings by Keras Tokenizer

In [30]:
tokenizer_obj = Tokenizer()
tokenizer_obj.fit_on_texts(df['text_cleaned'])

In [31]:
max_length = max([len(s.split()) for s in df['text_cleaned']])

In [32]:
vocab_size = len(tokenizer_obj.word_index) + 1

In [34]:
df_deep = df[['text_cleaned', 'movement']]

In [36]:
X_deep_train, X_deep_test, y_deep_train, y_deep_test = train_test_split(df_deep.iloc[:, :-1], df_deep.iloc[:, -1:], test_size=0.2)

In [47]:
X_deep_train

Unnamed: 0_level_0,text_cleaned
date,Unnamed: 1_level_1
2022-05-23,alright time upgrade something say
2022-05-21,please help upvote good question downvote blat...
2021-09-09,happy birthday binancetr
2021-12-22,big step bnb move close dao structure exchange...
2022-05-15,always right perspective failure canwill happe...
...,...
2021-07-11,get taxi driver say thank make app honestly ma...
2022-01-22,suspect bad guy create account bank crypto exc...
2021-10-16,burn bnb
2021-07-18,look back pro improve time


In [50]:
X_deep_train_tokens = tokenizer_obj.texts_to_sequences(X_deep_train['text_cleaned'])
X_deep_test_tokens = tokenizer_obj.texts_to_sequences(X_deep_test['text_cleaned'])

X_deep_train_pad = pad_sequences(X_deep_train_tokens, maxlen=max_length, padding='post')
X_deep_test_pad = pad_sequences(X_deep_test_tokens, maxlen=max_length, padding='post')

In [76]:
X_deep_train_pad

array([[1390,    9,  456, ...,    0,    0,    0],
       [ 326,   37, 1415, ...,    0,    0,    0],
       [ 111,  836, 2611, ...,    0,    0,    0],
       ...,
       [ 149,    5,    0, ...,    0,    0,    0],
       [  63,   82, 2853, ...,    0,    0,    0],
       [   2,   39,  852, ...,    0,    0,    0]])

In [61]:
model = tf.keras.models.Sequential([
     tf.keras.layers.Embedding(vocab_size, 100, input_length=max_length),
     tf.keras.layers.GlobalAveragePooling1D(),
     tf.keras.layers.Dense(6, activation='relu'),
     tf.keras.layers.Dense(1, activation='sigmoid')

])
model.compile(
     loss='sparse_categorical_crossentropy',
     optimizer='adam',
     metrics=['accuracy']
)

In [63]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 29, 100)           308800    
_________________________________________________________________
global_average_pooling1d (Gl (None, 100)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 6)                 606       
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 7         
Total params: 309,413
Trainable params: 309,413
Non-trainable params: 0
_________________________________________________________________


In [62]:
h = model.fit(
     X_deep_train_pad, y_deep_train,
     validation_data=(X_deep_test_pad, y_deep_test),
     epochs=15,
     callbacks=[tf.keras.callbacks.EarlyStopping(monitor='accuracy', patience=5)]
)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15


### Visualization t-SNE

In [65]:
tsne = TSNE(n_components=3, verbose=0, perplexity=40, n_iter=300)


In [68]:
np.concatenate((X_deep_test_pad, X_deep_train_pad))

array([[  48,  406,  304, ...,    0,    0,    0],
       [   5,    0,    0, ...,    0,    0,    0],
       [ 402, 3087,    0, ...,    0,    0,    0],
       ...,
       [ 149,    5,    0, ...,    0,    0,    0],
       [  63,   82, 2853, ...,    0,    0,    0],
       [   2,   39,  852, ...,    0,    0,    0]])

In [69]:
df_embeddings = pd.DataFrame(np.concatenate((X_deep_test_pad, X_deep_train_pad)))

In [71]:
tsne_results = tsne.fit_transform(df_embeddings)

In [72]:
df_subset = df.copy()
df_subset['tsne-2d-one'] = tsne_results[:, 0]
df_subset['tsne-2d-two'] = tsne_results[:, 1]


In [73]:
df_subset

Unnamed: 0_level_0,text_cleaned,vader_sentiment,flair_sentiment,textBlob_sentiment,Close/Last,Open,movement,tsne-2d-one,tsne-2d-two
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2022-06-02,love tihings world hate jet lag,-0.7717,0.995204,0.066667,308.21,300.50,1,0.614203,0.791021
2022-06-02,malaysiaboleh,0.0000,0.583073,0.000000,308.21,300.50,1,-13.264234,2.525984
2022-06-02,finish live interview cnbc squawk box studio haha,0.4588,-0.642573,0.168182,308.21,300.50,1,11.259660,-1.540276
2022-06-02,ill speak point zero forum zurich switzerland ...,0.0000,0.716512,0.000000,308.21,300.50,1,-0.562552,-1.592566
2022-06-02,humble receive cryptowarrior gift malaysia par...,0.4404,0.997130,0.000000,308.21,300.50,1,-11.062160,-1.518840
...,...,...,...,...,...,...,...,...,...
2021-06-04,tweet hurt people finance funny irresponsible,-0.5267,-0.999880,-0.125000,410.64,406.32,1,10.381190,3.661389
2021-06-04,great power great responsibility great great p...,0.9552,-0.973885,0.267857,410.64,406.32,1,-0.635920,-5.695401
2021-06-04,dont manipulate hodl financial advice,0.2924,-0.998453,0.000000,410.64,406.32,1,-9.905621,-3.049061
2021-06-03,want see photo miami real people hang,0.0772,0.999000,0.200000,409.11,393.57,1,8.728004,6.010097


In [74]:
len(df_subset.text_cleaned.value_counts())

1332

In [75]:
fig = px.scatter(df_subset, x='tsne-2d-one', y='tsne-2d-two', color='movement', hover_data=['text_cleaned'])
fig.show()

#### PCA

In [51]:
pca = PCA(n_components=3)
pca_result = pca.fit_transform(X1)
df_subset['pca-one'] = pca_result[:,0]
df_subset['pca-two'] = pca_result[:,1] 
df_subset['pca-three'] = pca_result[:,2]


In [52]:
fig = px.scatter_3d(df_subset, x='pca-one', y='pca-two', z='pca-three', color='vader_sentiment', hover_data=['text_cleaned'])
fig.show()

### Findings
- The deep learning model converges after few epochs which implies the data may be insufficient. 
- The deep learning model does not provide a great benefit for accuracy. There are a few reasons such as the limited amount of data, the quality of the tweets. Generally people use around 1.5GB tweets and perform manual labeling. The complexity of the data is better to be used on a simpler model.
- Logistic Regression and SVM is suprisingly better than deep learning with around 60% accuracy.
- Manual labelling would be relatively critical for a good accuracy.
- Predicting price movement would not be an easy task as the underlying assumption may not be true (i.e. positive sentiment could still lead to drop in price) 

### Suggestions
- Introduce the label for neutral sentiment instead of a binary label
- Consider another metrics to label the data such as positive when `price_change` is greater than a certain threshold and vice versa. Tuning can be done on how large `price_change` should be.
- Having properly labelled data would likely boost the accuracy to a great extent. It also allow more training samples since vectors do not need to merge into one day and label them by price movement.
- Try to use pre-trained model such as `GloVe' and 'Word2Vec'. Reference: https://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/ 

### Reference

To build a full-fledged NLP model, https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/blocks/bert-encoder is very helpful in understanding how advanced models (BERT) take semantic meaning into account. 