# Twitter Sentiment Analysis - 04 Modeling

The stock market is a focus for investors to maximize their potential profits and consequently, the interest shown from the technical and financial sides in stock market prediction is always on the rise. However, stock market prediction is a problem known for its challenging nature due to its dependency on diverse factors that affect the market, these factors are unpredictable and cannot be taken into consideration such as political variables, and social media effects such as twitter on the stock market.

In this final part of this project, we will combine the stock data and its features, with vectorized representation of the tweets for the month of December 2022 to predict whether or not the adjusted closing price at the end of a trading-day is greater than or less than the previous trading-day. Models run are Logistic Regression, KNN, SVM, Random Forest, and K-means Clustering.

Two types of predictions: 
1. Using overall daily tweet sentiment scores to predict adjusted closing price
2. Using vectorized representation of tweets to predict adjusted closing price

**Link(s) to previous notebook(s)**: \
00_Historical_Data_2014: https://github.com/parisvu07/Springboard_Data_Science/tree/main/Capstone_2_Twitter_Sentiment_Analysis \
01_Data_Wrangling:
https://github.com/parisvu07/Springboard_Data_Science/blob/main/Capstone_2_Twitter_Sentiment_Analysis/01_Data_Wrangling.ipynb \
02_Exploratory_Data_Analysis: https://github.com/parisvu07/Springboard_Data_Science/blob/main/Capstone_2_Twitter_Sentiment_Analysis/02_Exploratory_Data_Analysis.ipynb \
03_Preprocessing_and_Training_Data: https://github.com/parisvu07/Springboard_Data_Science/blob/main/Capstone_2_Twitter_Sentiment_Analysis/03_Preprocessing_and_Training_Data.ipynb

Quick fix for "Unable to render rich display": copy and paste the notebook link to https://nbviewer.org

## 4.1 Importing

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#ignore warning messages to ensure clean outputs
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA

import gensim
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Word2Vec
LabeledSentence = gensim.models.doc2vec.TaggedDocument
from tqdm import tqdm
import string
import spacy
np.random.seed(42)

from gensim.models import KeyedVectors 
from nltk import word_tokenize
from nltk.corpus import stopwords

In [2]:
#Importing stock data from notebook "02_Exploratory_Data_Analysis"
stock_data = pd.read_csv('03_stock_data.csv', encoding='latin-1')
stock_data = stock_data.set_index('Dates')
stock_data.head()

Unnamed: 0_level_0,Adj Close,stock_volume,%_change_Open,%_change_High,%_change_Low,%_change_Close,%_change_Volume,twitter_volume
Dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2022-12-01,148.309998,71250400,,,,,,1451
2022-12-02,147.809998,65447400,-1.518116,-0.757731,-0.654803,-0.337132,-8.144516,1551
2022-12-05,146.630005,68826400,1.240064,1.972972,0.082396,-0.798317,5.162925,1738
2022-12-06,142.910004,64727200,-0.473707,-2.398619,-2.641151,-2.536999,-5.955854,2072
2022-12-07,140.940002,69721100,-3.318151,-2.66803,-1.352874,-1.378491,7.715304,1912


In [3]:
#Importing tweet data from previous notebook "03_Preprocessing_and_Training_Data"
tweets_data = pd.read_csv('03_tweets_data.csv', lineterminator='\n')
tweets_data = tweets_data.dropna()
tweets_data.head()

Unnamed: 0,Dates,Time,user,likes,source,text,Subjectivity,Polarity,Analysis,Sentiment,...,mention_count,punct_count,avg_wordlength,unique_vs_words,stopwords_vs_words,clean_text,tokens,tweet_without_stopwords,tweet_lemmatized,vec
0,2022-12-30,20:29:43,LlcBillionaire,0,Twitter Web App,New Yearâs food traditions around the world,0.454545,0.136364,Positive,1.0,...,0,"{'! count': 0, '"" count': 0, '# count': 0, '$ ...",6.714286,1.0,0.142857,new yearâ food tradition around the world,"['new', 'yearâ', 'food', 'tradition', 'around'...",new yearâ food tradition around world,"['new', 'yearâ\x80\x99', 'food', 'tradition', ...",[-2.31491498e-01 6.51704955e-02 1.90656667e-...
1,2022-12-30,20:29:32,skitontop1,0,Twitter Web App,Entries &amp; exits Daily! \nDiscord link belo...,0.5,0.3,Positive,1.0,...,0,"{'! count': 1, '"" count': 0, '# count': 0, '$ ...",7.428571,1.0,0.0,entrie amp exit daily \ndi cord link belo...,"['entrie', 'amp', 'exit', 'daily', 'di', 'cord...",entrie amp exit daily di cord link belowð,"['entrie', 'amp', 'exit', 'daily', 'di', 'cord...",[ 3.40714295e-02 9.44165736e-02 -8.08280031e-...
2,2022-12-30,20:29:28,StockJobberOG,0,Twitter Web App,$AAPL $MSFT $SPY $TSLA $AMZN $BRK.B\n\n,0.0,0.0,Neutral,0.0,...,0,"{'! count': 0, '"" count': 0, '# count': 0, '$ ...",6.166667,1.0,0.0,aapl m ft py t la amzn brk b\n\n,"['aapl', 'm', 'ft', 'py', 't', 'la', 'amzn', '...",aapl ft py la amzn brk b,"['aapl', 'ft', 'py', 'la', 'amzn', 'brk', 'b']",[-0.11772024 0.12171325 0.28293075 -0.131354...
3,2022-12-30,20:29:11,LlcBillionaire,0,Twitter Web App,The biggest â and maybe the best â financi...,0.15,0.5,Positive,1.0,...,0,"{'! count': 0, '"" count': 0, '# count': 0, '$ ...",5.2,0.9,0.433333,the bigge t â and maybe the be t â financi...,"['the', 'bigge', 't', 'â', 'and', 'maybe', 'th...",bigge â maybe â financial olution hould u ...,"['bigge', 'â\x80\x94', 'maybe', 'â\x80\x94', '...",[-5.57102112e-02 1.48102408e-01 7.54352845e-...
4,2022-12-30,20:28:29,skitontop1,0,Twitter Web App,"# Chatroom interms of \n\nalert,calls,Analysis...",1.0,0.6,Positive,1.0,...,0,"{'! count': 0, '"" count': 0, '# count': 1, '$ ...",9.2,1.0,0.2,chatroom interm of \n\nalert call analy i ...,"['chatroom', 'interm', 'of', 'alert', 'call', ...",chatroom interm alert call analy,"['chatroom', 'interm', 'alert', 'call', 'analy']",[-1.83531667e-01 2.19245007e-01 -1.40175003e-...


In [4]:
#Importing merged dataframes from previous notebook "03_Preprocessing_and_Training_Data"
merged_dataframes = pd.read_csv('03_merged_dataframes.csv', lineterminator='\n')
merged_dataframes = merged_dataframes.set_index('Dates')
merged_dataframes.head()

Unnamed: 0_level_0,Adj Close,stock_volume,twitter_volume,likes,Subjectivity,Polarity,Sentiment,open_trend,high_trend,low_trend,close_trend,volume_trend,Sentiment_Score
Dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2022-12-01,148.309998,71250400,1451,3.35827,0.341031,0.16663,0.418668,0,0,0,0,0,Positive
2022-12-02,147.809998,65447400,1551,2.422508,0.336724,0.179263,0.434727,0,0,0,0,0,Positive
2022-12-05,146.630005,68826400,1738,16.589788,0.285005,0.119601,0.320138,1,1,1,0,1,Negative
2022-12-06,142.910004,64727200,2072,3.363636,0.308533,0.138852,0.345839,0,0,0,0,0,Negative
2022-12-07,140.940002,69721100,1912,3.910183,0.306545,0.141816,0.385379,0,0,0,0,1,Negative


## 4.2 Method 1: Using overall daily tweet sentiment score to predict adjusted closing price

This is a classification problem, in unsupervised learning. Here we have used the following classification models:

Logistic Regression \
K-Nearest Neighbor (KNN) \
Support vector machine (SVM) \
Random Forest \
K-means Clustering 

Evaluating the performance of a model by training and testing on the same dataset can lead to the overfitting. Hence the model evaluation is based on splitting the dataset into train and validation set. But the performance of the prediction result depends upon the random choice of the pair of (train,validation) set. Inorder to overcome that, the Cross-Validation procedure is used where under the k-fold CV approach, the training set is split into k smaller sets, where a model is trained using k-1 of the folds as training data and the model is validated on the remaining part.

Classification/ Confusion Matrix: This matrix summarizes the correct and incorrect classifications that a classifier produced for a certain dataset. Rows and columns of the classification matrix correspond to the true and predicted classes respectively. The two diagonal cells (upper left, lower right) give the number of correct classifications, where the predicted class coincides with the actual class of the observation. The off diagonal cells gives the count of the misclassification. The classification matrix gives estimates of the true classification and misclassification rates.


In [5]:
#Transforming "Sentiment_Score" in the merged_dataframes dataset into binary codes
merged_dataframes = pd.get_dummies(merged_dataframes, columns = ['Sentiment_Score'])
merged_dataframes = merged_dataframes.drop(['Polarity'], axis=1)
merged_dataframes.head()

Unnamed: 0_level_0,Adj Close,stock_volume,twitter_volume,likes,Subjectivity,Sentiment,open_trend,high_trend,low_trend,close_trend,volume_trend,Sentiment_Score_Negative,Sentiment_Score_Positive
Dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2022-12-01,148.309998,71250400,1451,3.35827,0.341031,0.418668,0,0,0,0,0,0,1
2022-12-02,147.809998,65447400,1551,2.422508,0.336724,0.434727,0,0,0,0,0,0,1
2022-12-05,146.630005,68826400,1738,16.589788,0.285005,0.320138,1,1,1,0,1,1,0
2022-12-06,142.910004,64727200,2072,3.363636,0.308533,0.345839,0,0,0,0,0,1,0
2022-12-07,140.940002,69721100,1912,3.910183,0.306545,0.385379,0,0,0,0,1,1,0


### 4.2.1 Train Test Split

1 means adj. closing price rose compare to its yesterday closing price. 0 means it fell.

In [6]:
#Defining our X and y 
X = merged_dataframes.drop('close_trend', axis=1)
y = merged_dataframes['close_trend'].values.reshape(-1,1)
y[:10]

array([[0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0]])

In [7]:
#Splitting data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### 4.2.2 Logistic Regression

In [8]:
#Making pipeline
C_param_range = [0.001,0.01,0.1,1,10,100]

table = pd.DataFrame(columns = ['C_parameter','Accuracy'])
table['C_parameter'] = C_param_range

j = 0
for i in C_param_range:
    
    # Apply logistic regression model to training data
    pipe = make_pipeline( 
    StandardScaler(),
    LogisticRegression(penalty = 'l2', C = i,random_state = 40)
)

    pipe.fit(X_train, y_train)
    
    # Predict using model
    y_pred_lr = pipe.predict(X_test)
    
    # Saving accuracy score in table
    table.iloc[j,1] = accuracy_score(y_test,y_pred_lr)
    j += 1
    
table   

Unnamed: 0,C_parameter,Accuracy
0,0.001,0.8
1,0.01,0.8
2,0.1,1.0
3,1.0,0.8
4,10.0,0.8
5,100.0,0.8


In [9]:
cnf_matrix = confusion_matrix(y_test,y_pred_lr)
print(cnf_matrix)
accuracy_lr = pipe.score(X_test,y_test)

print(accuracy_lr)

[[3 1]
 [0 1]]
0.8


In [10]:
cv_scores_test= cross_val_score(pipe,X_test,y_test,cv=2,scoring='roc_auc')
cv_scores_train= cross_val_score(pipe,X_train,y_train,cv=2,scoring='roc_auc')
print(cv_scores_test)
cv_scores_lr_test= cv_scores_test.mean()
cv_scores_lr_train= cv_scores_train.mean()
cv_scores_std_test_lr= cv_scores_test.std()
print ('Mean cross validation test score: ' +str(cv_scores_lr_test))
print ('Mean cross validation train score: ' +str(cv_scores_lr_train))
print ('Standard deviation in cv test scores: ' +str(cv_scores_std_test_lr))

[nan nan]
Mean cross validation test score: nan
Mean cross validation train score: 0.8916666666666667
Standard deviation in cv test scores: nan


We can see that the merged_dataframes (tweets combined with stock sentiments) do not yield meaningful results. We could have create more binary codes for other features such as stock_volume, twitter_volukesm. But because our focus is on Natural Language Processing, we will stop here and move on to a text classfication model using Word2Vec.

## 4.3 Method 2: Using vectorized representation of tweets to predict adjusted closing price

Word2Vec is a collection of algorithms which can produce word embeddings. Word embeddings are vectors which describe the semantic meaning of words as points in space.



In [11]:
#loading pre-trained embedding
wv = KeyedVectors.load('glove-twitter-200.kv')

In [12]:
tweets_data.head()

Unnamed: 0,Dates,Time,user,likes,source,text,Subjectivity,Polarity,Analysis,Sentiment,...,mention_count,punct_count,avg_wordlength,unique_vs_words,stopwords_vs_words,clean_text,tokens,tweet_without_stopwords,tweet_lemmatized,vec
0,2022-12-30,20:29:43,LlcBillionaire,0,Twitter Web App,New Yearâs food traditions around the world,0.454545,0.136364,Positive,1.0,...,0,"{'! count': 0, '"" count': 0, '# count': 0, '$ ...",6.714286,1.0,0.142857,new yearâ food tradition around the world,"['new', 'yearâ', 'food', 'tradition', 'around'...",new yearâ food tradition around world,"['new', 'yearâ\x80\x99', 'food', 'tradition', ...",[-2.31491498e-01 6.51704955e-02 1.90656667e-...
1,2022-12-30,20:29:32,skitontop1,0,Twitter Web App,Entries &amp; exits Daily! \nDiscord link belo...,0.5,0.3,Positive,1.0,...,0,"{'! count': 1, '"" count': 0, '# count': 0, '$ ...",7.428571,1.0,0.0,entrie amp exit daily \ndi cord link belo...,"['entrie', 'amp', 'exit', 'daily', 'di', 'cord...",entrie amp exit daily di cord link belowð,"['entrie', 'amp', 'exit', 'daily', 'di', 'cord...",[ 3.40714295e-02 9.44165736e-02 -8.08280031e-...
2,2022-12-30,20:29:28,StockJobberOG,0,Twitter Web App,$AAPL $MSFT $SPY $TSLA $AMZN $BRK.B\n\n,0.0,0.0,Neutral,0.0,...,0,"{'! count': 0, '"" count': 0, '# count': 0, '$ ...",6.166667,1.0,0.0,aapl m ft py t la amzn brk b\n\n,"['aapl', 'm', 'ft', 'py', 't', 'la', 'amzn', '...",aapl ft py la amzn brk b,"['aapl', 'ft', 'py', 'la', 'amzn', 'brk', 'b']",[-0.11772024 0.12171325 0.28293075 -0.131354...
3,2022-12-30,20:29:11,LlcBillionaire,0,Twitter Web App,The biggest â and maybe the best â financi...,0.15,0.5,Positive,1.0,...,0,"{'! count': 0, '"" count': 0, '# count': 0, '$ ...",5.2,0.9,0.433333,the bigge t â and maybe the be t â financi...,"['the', 'bigge', 't', 'â', 'and', 'maybe', 'th...",bigge â maybe â financial olution hould u ...,"['bigge', 'â\x80\x94', 'maybe', 'â\x80\x94', '...",[-5.57102112e-02 1.48102408e-01 7.54352845e-...
4,2022-12-30,20:28:29,skitontop1,0,Twitter Web App,"# Chatroom interms of \n\nalert,calls,Analysis...",1.0,0.6,Positive,1.0,...,0,"{'! count': 0, '"" count': 0, '# count': 1, '$ ...",9.2,1.0,0.2,chatroom interm of \n\nalert call analy i ...,"['chatroom', 'interm', 'of', 'alert', 'call', ...",chatroom interm alert call analy,"['chatroom', 'interm', 'alert', 'call', 'analy']",[-1.83531667e-01 2.19245007e-01 -1.40175003e-...


word2vec can’t create a vector from a word that’s not in its vocabulary. Because of this, we need to specify “if word in model.vocab” when creating the full list of word vectors.

In [13]:
# Grab all the tweets
tweets = tweets_data['clean_text']
print(tweets.shape)
tweets_data[tweets_data.isnull().any(axis=1)]

(117841,)


Unnamed: 0,Dates,Time,user,likes,source,text,Subjectivity,Polarity,Analysis,Sentiment,...,mention_count,punct_count,avg_wordlength,unique_vs_words,stopwords_vs_words,clean_text,tokens,tweet_without_stopwords,tweet_lemmatized,vec


In [14]:
# Create a list of strings, where each string is a tweet
tweets_list = [tweet for tweet in tweets]

# Collapse the list of strings into a single long string for processing
big_tweet_string = ' '.join(tweets_list)

# Tokenize the string into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(big_tweet_string)

# Remove non-alphabetic tokens, such as punctuation
words = [word.lower() for word in tokens if word.isalpha()]

# Filter out stopwords
stop_words = set(stopwords.words('english'))

words = [word for word in words if not word in stop_words]

# Print first 10 words
words[:10]

['new',
 'food',
 'tradition',
 'around',
 'world',
 'entrie',
 'amp',
 'exit',
 'daily',
 'di']

In [15]:
# Filter the list of vectors to include only those that Word2Vec has a vector for
vector_list = [wv[word] for word in words if word in wv.key_to_index]

# Create a list of the words corresponding to these vectors
words_filtered = [word for word in words if word in wv.key_to_index]

# Zip the words together with their vector representations
word_vec_zip = zip(words_filtered, vector_list)

# Cast to a dict so we can turn it into a DataFrame
word_vec_dict = dict(word_vec_zip)
df = pd.DataFrame.from_dict(word_vec_dict, orient='index')
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
new,0.27554,0.15505,-0.39506,0.35,0.018967,-0.43623,0.65921,-0.17615,-0.28261,-0.50848,...,0.067459,-0.17782,0.049174,0.26724,-0.061817,0.34782,-0.58347,-0.3004,0.28612,-0.063445
food,-0.69175,-0.14259,0.38653,-0.23141,-0.20408,-0.21565,0.77839,0.002269,-0.072446,-0.60134,...,-0.25049,-0.33623,0.18491,-0.48235,0.31425,0.24499,-0.24404,0.080309,0.3406,0.70451
tradition,-0.46827,-0.077617,0.37846,0.035308,-0.092955,-0.27471,0.39512,-0.16638,0.12507,0.04185,...,0.38377,-0.020489,0.80381,-0.17868,0.05453,0.2103,0.70303,-0.29521,0.29471,-0.60142
around,-0.54024,-0.17328,0.49958,-0.2198,0.18734,0.45666,0.86513,-0.28611,-0.45031,0.46856,...,0.2157,0.20454,-0.50304,-0.14797,0.25776,0.26054,0.32295,0.18986,0.022764,0.073641
world,0.035771,0.62946,0.27443,-0.36455,0.39189,-0.41298,0.12398,-0.34995,0.27725,0.000376,...,0.43318,-0.23037,0.019838,-0.21725,0.16818,0.61857,0.009801,0.11341,0.029805,-0.61934


### 4.3.1 Dimensionality Reduction with t-SNE

In [None]:
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

In [None]:
predicted_y_test = classifier.predict(X_test)
print("Logistic Regression Accuracy:", metrics.accuracy_score(y_test, predicted_y_test))
print("Logistic Regression Precision:", metrics.precision_score(y_test, predicted_y_test, average='micro'))
print("Logistic Regression Recall:", metrics.recall_score(y_test, predicted_y_test, average='micro'))