# Twitter Sentiment Analysis - 04 Modeling

The stock market is a focus for investors to maximize their potential profits and consequently, the interest shown from the technical and financial sides in stock market prediction is always on the rise. However, stock market prediction is a problem known for its challenging nature due to its dependency on diverse factors that affect the market, these factors are unpredictable and cannot be taken into consideration such as political variables, and social media effects such as twitter on the stock market.

In this final part of this project, we will combine the stock data and its features, with vectorized representation of the tweets for the month of December 2022 to predict whether or not the adjusted closing price at the end of a trading-day is greater than or less than the previous trading-day. Models run are KNN, Logistic Regression, Decision Tree, Random Forest. SVM and ANN.

**Link(s) to previous notebook(s)**: \
00_Historical_Data_2014: https://github.com/parisvu07/Springboard_Data_Science/tree/main/Capstone_2_Twitter_Sentiment_Analysis \
01_Data_Wrangling:
https://github.com/parisvu07/Springboard_Data_Science/blob/main/Capstone_2_Twitter_Sentiment_Analysis/01_Data_Wrangling.ipynb \
02_Exploratory_Data_Analysis: https://github.com/parisvu07/Springboard_Data_Science/blob/main/Capstone_2_Twitter_Sentiment_Analysis/02_Exploratory_Data_Analysis.ipynb \
03_Preprocessing_and_Training_Data: https://github.com/parisvu07/Springboard_Data_Science/blob/main/Capstone_2_Twitter_Sentiment_Analysis/03_Preprocessing_and_Training_Data.ipynb

Quick fix for "Unable to render rich display": copy and paste the notebook link to https://nbviewer.org

## 4.1 Importing

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#ignore warning messages to ensure clean outputs
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

import gensim
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Word2Vec
LabeledSentence = gensim.models.doc2vec.TaggedDocument
from tqdm import tqdm
import string
import spacy
np.random.seed(42)

In [4]:
#Importing stock data from notebook "02_Exploratory_Data_Analysis"
stock_data = pd.read_csv('03_stock_data.csv', encoding='latin-1')
stock_data = stock_data.drop('Time', axis=1)
stock_data = stock_data.set_index('Dates')
stock_data.head()

Unnamed: 0,Adj Close,stock_volume,%_change_Open,%_change_High,%_change_Low,%_change_Close,%_change_Volume,twitter_volume
0,148.309998,71250400,,,,,,1451
1,147.809998,65447400,-1.518116,-0.757731,-0.654803,-0.337132,-8.144516,1551
2,146.630005,68826400,1.240064,1.972972,0.082396,-0.798317,5.162925,1738
3,142.910004,64727200,-0.473707,-2.398619,-2.641151,-2.536999,-5.955854,2072
4,140.940002,69721100,-3.318151,-2.66803,-1.352874,-1.378491,7.715304,1912


In [5]:
#Importing tweet data from previous notebook "03_Preprocessing_and_Training_Data"
tweets_data = pd.read_csv('03_tweets_data.csv', lineterminator='\n')
tweets_data.head()

Unnamed: 0,Dates,Time,user,source,text,Subjectivity,Polarity,Analysis,Sentiment,char_count,...,mention_count,punct_count,avg_wordlength,avg_sentlength,unique_vs_words,clean_text,tokens,tweet_without_stopwords,tweet_lemmatized,vec
0,2022-12-30,20:29:43,LlcBillionaire,Twitter Web App,10 New Yearâs food traditions around the world,0.454545,0.136364,Positive,1.0,49,...,0,"{'! count': 0, '"" count': 0, '# count': 0, '$ ...",6.125,8.0,1.0,new yearâ food tradition around the world,"['new', 'yearâ', 'food', 'tradition', 'around'...",new yearâ food tradition around world,"['new', 'yearâ\x80\x99', 'food', 'tradition', ...",[-2.31491498e-01 6.51704955e-02 1.90656667e-...
1,2022-12-30,20:29:32,skitontop1,Twitter Web App,Entries &amp; exits Daily! \nDiscord link belo...,0.5,0.3,Positive,1.0,52,...,0,"{'! count': 1, '"" count': 0, '# count': 0, '$ ...",7.428571,3.5,1.0,entrie amp exit daily \ndi cord link belo...,"['entrie', 'amp', 'exit', 'daily', 'di', 'cord...",entrie amp exit daily di cord link belowð,"['entrie', 'amp', 'exit', 'daily', 'di', 'cord...",[ 3.40714295e-02 9.44165736e-02 -8.08280031e-...
2,2022-12-30,20:29:28,StockJobberOG,Twitter Web App,$AAPL $MSFT $SPY $TSLA $AMZN $BRK.B\n\n,0.0,0.0,Neutral,0.0,37,...,0,"{'! count': 0, '"" count': 0, '# count': 0, '$ ...",6.166667,6.0,1.0,aapl m ft py t la amzn brk b\n\n,"['aapl', 'm', 'ft', 'py', 't', 'la', 'amzn', '...",aapl ft py la amzn brk b,"['aapl', 'ft', 'py', 'la', 'amzn', 'brk', 'b']",[-0.11772024 0.12171325 0.28293075 -0.131354...
3,2022-12-30,20:29:11,LlcBillionaire,Twitter Web App,The biggest â and maybe the best â financi...,0.15,0.5,Positive,1.0,160,...,0,"{'! count': 0, '"" count': 0, '# count': 0, '$ ...",5.16129,31.0,0.903226,the bigge t â and maybe the be t â financi...,"['the', 'bigge', 't', 'â', 'and', 'maybe', 'th...",bigge â maybe â financial olution hould u ...,"['bigge', 'â\x80\x94', 'maybe', 'â\x80\x94', '...",[-5.57102112e-02 1.48102408e-01 7.54352845e-...
4,2022-12-30,20:28:29,skitontop1,Twitter Web App,"#1 Chatroom interms of \n\nalert,calls,Analysi...",1.0,0.6,Positive,1.0,47,...,0,"{'! count': 0, '"" count': 0, '# count': 1, '$ ...",9.4,5.0,1.0,chatroom interm of \n\nalert call analy i ...,"['chatroom', 'interm', 'of', 'alert', 'call', ...",chatroom interm alert call analy,"['chatroom', 'interm', 'alert', 'call', 'analy']",[-1.83531667e-01 2.19245007e-01 -1.40175003e-...


In [7]:
#Importing merged dataframes from previous notebook "03_Preprocessing_and_Training_Data"
merged_dataframes = pd.read_csv('03_merged_dataframes.csv', lineterminator='\n')
merged_dataframes.head()

Unnamed: 0,Adj Close,stock_volume,twitter_volume,likes,Subjectivity,Polarity,Sentiment,open_trend,high_trend,low_trend,close_trend,volume_trend,Sentiment_Score
0,148.309998,71250400,1451,3.35827,0.341031,0.16663,0.418668,0,0,0,0,0,Positive
1,147.809998,65447400,1551,2.422508,0.336724,0.179263,0.434727,0,0,0,0,0,Positive
2,146.630005,68826400,1738,16.589788,0.285005,0.119601,0.320138,1,1,1,0,1,Negative
3,142.910004,64727200,2072,3.363636,0.308533,0.138852,0.345839,0,0,0,0,0,Negative
4,140.940002,69721100,1912,3.910183,0.306545,0.141816,0.385379,0,0,0,0,1,Negative


## 4.3 Word2Vec

In [None]:
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

In [None]:
predicted_y_test = classifier.predict(X_test)
print("Logistic Regression Accuracy:", metrics.accuracy_score(y_test, predicted_y_test))
print("Logistic Regression Precision:", metrics.precision_score(y_test, predicted_y_test, average='micro'))
print("Logistic Regression Recall:", metrics.recall_score(y_test, predicted_y_test, average='micro'))

### 4.4 Stock

This is a classification problem, in unsupervised learning. Here we have used the following classification models:

Logistic Regression
K-Nearest Neighbor (KNN)
Support vector machine (SVM)
Random Forest
K-means Clustering

Evaluating the performance of a model by training and testing on the same dataset can lead to the overfitting. Hence the model evaluation is based on splitting the dataset into train and validation set. But the performance of the prediction result depends upon the random choice of the pair of (train,validation) set. Inorder to overcome that, the Cross-Validation procedure is used where under the k-fold CV approach, the training set is split into k smaller sets, where a model is trained using k-1 of the folds as training data and the model is validated on the remaining part.

Classification/ Confusion Matrix: This matrix summarizes the correct and incorrect classifications that a classifier produced for a certain dataset. Rows and columns of the classification matrix correspond to the true and predicted classes respectively. The two diagonal cells (upper left, lower right) give the number of correct classifications, where the predicted class coincides with the actual class of the observation. The off diagonal cells gives the count of the misclassification. The classification matrix gives estimates of the true classification and misclassification rates.
