# Stock News Sentiment Analysis with AvgWord2vec from Scratch

This notebook focuses on predicting whether stock prices will increase or decrease based on sentimental analysis of news headlines. In this study I will use data available on Kaggle https://www.kaggle.com/datasets/avisheksood/stock-news-sentiment-analysismassive-dataset?select=Sentiment_Stock_data.csv. 
This is a huge dataset with 108,301 unique values

We are going to solve the classificaion problem using Natural Language Processing with the following steps:
- Text preprocessing applying tokenization, stopwords, lemmatization
- Converting text to vectors using Average Word2vec model built from scratch 
- Training the RandomForest model  
- Model perfomance evaluation  

We will use Python and ML

### Importing the libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
import tensorflow as tf
tf.__version__

'2.9.1'

In [3]:
import nltk
import re
from nltk.corpus import stopwords

In [4]:
nltk.download('stopwords')
nltk.download('all')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nl

[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package mte_teip5 to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package mte_teip5 is already up-to-date!
[nltk_data]    | Downloading package mwa_ppdb to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package mwa_ppdb is already up-to-date!
[nltk_data]    | Downloading package names to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package nombank.1.0 to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package nombank.1.0 is already up-to-date!
[nltk_data]    | Downloading package nonbreaking_prefixes to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package nonbreaking_prefixes is already up-to-date!
[nltk_data]    | Downloading packag

[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package udhr to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package udhr is already up-to-date!
[nltk_data]    | Downloading package udhr2 to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package unicode_samples is already up-to-date!
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package universal_tagset is already up-to-date!
[nltk_data]    | Downloading package

True

In [5]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [6]:
from tensorflow.keras.layers import Embedding  #embedding layer helps us with the word to word implementation
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

### Importing and preprocessing the data

In [7]:
df = pd.read_csv('Sentiment_Stock_data.csv')
df = df.head(5000)

In [8]:
df.head()

Unnamed: 0.1,Unnamed: 0,Sentiment,Sentence
0,0,0,"According to Gran , the company has no plans t..."
1,1,1,"For the last quarter of 2010 , Componenta 's n..."
2,2,1,"In the third quarter of 2010 , net sales incre..."
3,3,1,Operating profit rose to EUR 13.1 mn from EUR ...
4,4,1,"Operating profit totalled EUR 21.1 mn , up fro..."


In [9]:
df = df[['Sentiment', 'Sentence']]

In [10]:
df.shape

(5000, 2)

In [11]:
df.isnull().sum()

Sentiment    0
Sentence     0
dtype: int64

In [12]:
# Dropping null values

df.dropna(inplace=True)

In [13]:
df.reset_index(inplace=True)

In [14]:
# specifying the independent and dependent features
X = df['Sentence']
y = df['Sentiment']

In [15]:
# copying X for preprocessing

sentences = X.copy()

In [16]:
# Lemmatization to convert words in sentences to their meaningful root

from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

corpus = []
for i in range(0, len(sentences)):
    text = re.sub('[^a-zA-Z]', ' ', sentences[i])
    text = text.lower()
    text = text.split()
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    all_stopwords.remove('no')
    
    text = [lemmatizer.lemmatize(word) for word in text if not word in set(all_stopwords)]
    text = ' '.join(text)
    corpus.append(text)

In [17]:
corpus[0]

'according gran company no plan move production russia although company growing'

In [18]:
len(corpus)

5000

In [19]:
X[0]

'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing '

In [20]:
# Tokenization : to convert paragraph to sentences to words
words=[]
for sent in corpus:
    sent_token=sent_tokenize(sent)
    #for sent in sent_token:
    words.append(simple_preprocess(sent))  #converts a document to a list of lowercase tokens

In [21]:
words

[['according',
  'gran',
  'company',
  'no',
  'plan',
  'move',
  'production',
  'russia',
  'although',
  'company',
  'growing'],
 ['last',
  'quarter',
  'componenta',
  'net',
  'sale',
  'doubled',
  'eur',
  'eur',
  'period',
  'year',
  'earlier',
  'moved',
  'zero',
  'pre',
  'tax',
  'profit',
  'pre',
  'tax',
  'loss',
  'eur'],
 ['third',
  'quarter',
  'net',
  'sale',
  'increased',
  'eur',
  'mn',
  'operating',
  'profit',
  'eur',
  'mn'],
 ['operating',
  'profit',
  'rose',
  'eur',
  'mn',
  'eur',
  'mn',
  'corresponding',
  'period',
  'representing',
  'net',
  'sale'],
 ['operating',
  'profit',
  'totalled',
  'eur',
  'mn',
  'eur',
  'mn',
  'representing',
  'net',
  'sale'],
 ['finnish',
  'talentum',
  'report',
  'operating',
  'profit',
  'increased',
  'eur',
  'mn',
  'eur',
  'mn',
  'net',
  'sale',
  'totaled',
  'eur',
  'mn',
  'eur',
  'mn'],
 ['clothing',
  'retail',
  'chain',
  'sepp',
  'sale',
  'increased',
  'eur',
  'mn',
  'opera

In [22]:
len(words)

5000

### Training an AvgWord2vec vectorizer from scratch

In [23]:
import gensim

In [24]:
### Lets train Word2vec from scratch
model=gensim.models.Word2Vec(words,window=5,min_count=2)
model.wv.index_to_key  #viewing the vocabulary

['eur',
 'mn',
 'company',
 'sale',
 'profit',
 'share',
 'year',
 'net',
 'million',
 'finnish',
 'said',
 'operating',
 'mln',
 'period',
 'quarter',
 'group',
 'market',
 'finland',
 'euro',
 'service',
 'business',
 'first',
 'new',
 'oyj',
 'loss',
 'also',
 'compared',
 'operation',
 'co',
 'corresponding',
 'contract',
 'per',
 'product',
 'price',
 'percent',
 'helsinki',
 'total',
 'today',
 'financial',
 'order',
 'http',
 'stock',
 'report',
 'solution',
 'capital',
 'increased',
 'month',
 'unit',
 'rose',
 'value',
 'system',
 'not',
 'plant',
 'customer',
 'bank',
 'based',
 'result',
 'well',
 'second',
 'investment',
 'corporation',
 'nokia',
 'mobile',
 'january',
 'earlier',
 'technology',
 'project',
 'construction',
 'board',
 'increase',
 'area',
 'third',
 'decreased',
 'september',
 'building',
 'production',
 'hel',
 'expected',
 'industry',
 'plc',
 'according',
 'deal',
 'right',
 'last',
 'agreement',
 'usd',
 'oy',
 'pct',
 'part',
 'network',
 'ceo',
 'term

In [25]:
#total vocabulary size
model.corpus_count

5000

In [28]:
model.wv.similar_by_word('operating')

[('period', 0.9989944696426392),
 ('quarter', 0.9988884329795837),
 ('decreased', 0.9987687468528748),
 ('increased', 0.9986798763275146),
 ('loss', 0.9985779523849487),
 ('fell', 0.9985073208808899),
 ('earlier', 0.9983648657798767),
 ('profit', 0.9981814026832581),
 ('eur', 0.9976856112480164),
 ('compared', 0.9976353645324707)]

In [29]:
#viewing the vector of a word in our created vocabulary
model.wv['president'].shape

(100,)

In [30]:
#Computing AvgWord2vec for each sentence
#So, by default, each word has 100 dimensions. To reduce the number of dimensions in the dataset, average word2vec would be reqd for each sentence
#this is the mean of the vectors in every sentence

def avg_word2vec(doc):  
    return np.mean([model.wv[word] for word in doc if word in model.wv.index_to_key],axis=0)
                #or [np.zeros(len(model.wv.index_to_key))], axis=0)
        

In [31]:
!pip install tqdm

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [32]:
from tqdm import tqdm

In [33]:
 #applying the avg_word2vec function to all sentences
    
words_vec=[]
for i in tqdm(range(len(words))):
    words_vec.append(avg_word2vec(words[i]))

100%|██████████| 5000/5000 [00:00<00:00, 5298.21it/s]


In [34]:
type(words_vec)

list

In [35]:
words_vec

[array([-0.24757478,  0.42571878,  0.31095597, -0.04447806, -0.09454096,
        -0.6227726 ,  0.08542449,  0.89853877, -0.22477524, -0.21829048,
        -0.07160708, -0.62665665,  0.05249409,  0.17669888,  0.1952278 ,
        -0.3951819 ,  0.10501365, -0.4720787 , -0.0404615 , -0.71343744,
         0.2971852 ,  0.20867932,  0.3509106 , -0.13666669, -0.3406262 ,
         0.07821681, -0.3183104 , -0.22895892, -0.28279334, -0.03514183,
         0.25738803,  0.01363556,  0.05279075, -0.29751316, -0.1318129 ,
         0.37287122,  0.02685004, -0.33794004, -0.32138693, -0.631899  ,
         0.12959264, -0.21197276, -0.11478321, -0.11618725,  0.27312934,
        -0.17905982, -0.4142777 , -0.03254316,  0.18004897,  0.39149898,
         0.07812975, -0.36824816, -0.2424777 , -0.0964099 , -0.08583249,
         0.17742594,  0.26543695, -0.00825998, -0.29126212,  0.09174905,
        -0.00370977,  0.01524464, -0.08630229,  0.05916343, -0.34955046,
         0.5089164 ,  0.02779514,  0.29919243, -0.4

In [36]:
#converting the list to an array

X_new = np.array(words_vec)

In [37]:
#viewing vectors for the entire X dataset. This becomes our input feature
X_new

array([[-0.24757478,  0.42571878,  0.31095597, ..., -0.3182377 ,
         0.1982293 , -0.10232978],
       [-0.37557673,  0.534475  ,  0.14846532, ..., -0.38985902,
        -0.21506481,  0.05807285],
       [-0.57945544,  0.8262501 ,  0.18602258, ..., -0.588685  ,
        -0.41258904,  0.11419427],
       ...,
       [-0.49729508,  0.7367208 ,  0.23090349, ..., -0.5227668 ,
        -0.24519973,  0.05375622],
       [-0.41269735,  0.6199984 ,  0.33289528, ..., -0.46271572,
         0.03666983, -0.05312057],
       [-0.06729066,  0.11395471,  0.08154456, ..., -0.08654394,
         0.05470617, -0.02518323]], dtype=float32)

In [38]:
#viewing vector for the first sentence
X_new[0]

array([-0.24757478,  0.42571878,  0.31095597, -0.04447806, -0.09454096,
       -0.6227726 ,  0.08542449,  0.89853877, -0.22477524, -0.21829048,
       -0.07160708, -0.62665665,  0.05249409,  0.17669888,  0.1952278 ,
       -0.3951819 ,  0.10501365, -0.4720787 , -0.0404615 , -0.71343744,
        0.2971852 ,  0.20867932,  0.3509106 , -0.13666669, -0.3406262 ,
        0.07821681, -0.3183104 , -0.22895892, -0.28279334, -0.03514183,
        0.25738803,  0.01363556,  0.05279075, -0.29751316, -0.1318129 ,
        0.37287122,  0.02685004, -0.33794004, -0.32138693, -0.631899  ,
        0.12959264, -0.21197276, -0.11478321, -0.11618725,  0.27312934,
       -0.17905982, -0.4142777 , -0.03254316,  0.18004897,  0.39149898,
        0.07812975, -0.36824816, -0.2424777 , -0.0964099 , -0.08583249,
        0.17742594,  0.26543695, -0.00825998, -0.29126212,  0.09174905,
       -0.00370977,  0.01524464, -0.08630229,  0.05916343, -0.34955046,
        0.5089164 ,  0.02779514,  0.29919243, -0.4757278 ,  0.38

In [39]:
X_new[0].shape

(100,)

In [40]:
#viewing the first sentence
words[0]

['according',
 'gran',
 'company',
 'no',
 'plan',
 'move',
 'production',
 'russia',
 'although',
 'company',
 'growing']

### Training a RandomForest model on the training data

In [41]:
y_new=np.array(y)

In [42]:
type(X_new)

numpy.ndarray

In [43]:
X_new.shape

(5000, 100)

In [44]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, test_size=0.3, random_state=42)

In [45]:
y_train

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [46]:
#implementing RandomForest Classifier
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=200, criterion='entropy')
rnd_clf.fit(X_train, y_train)


RandomForestClassifier(criterion='entropy', n_estimators=200)

### Model predictions and performance evaluation

In [48]:
# Predicting the test data
y_pred = rnd_clf.predict(X_test)
y_pred

array([0, 1, 0, ..., 0, 1, 0], dtype=int64)

In [49]:
# Confusion matrix

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)

array([[989,  49],
       [264, 198]], dtype=int64)

In [50]:
# Getting the accuracy score

from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.7913333333333333

In [51]:
# Getting the classification report

from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.79      0.95      0.86      1038
           1       0.80      0.43      0.56       462

    accuracy                           0.79      1500
   macro avg       0.80      0.69      0.71      1500
weighted avg       0.79      0.79      0.77      1500

