# Stock News Sentiment Analysis with LSTM RNN - Deep Learning NLP

This notebook focuses on predicting whether stock prices will increase or decrease based on sentimental analysis of news headlines. In this study I will use data available on Kaggle https://www.kaggle.com/datasets/avisheksood/stock-news-sentiment-analysismassive-dataset?select=Sentiment_Stock_data.csv. 
This is a huge dataset with 108,301 unique values, so I will only use a sample of 5000 observations.

We are going to solve the classificaion problem using Natural Language Processing with the following steps:
- Text preprocessing applying tokenization, stopwords, lemmatization
- Converting text to vectors: OneHot representation and word embedding to convert each sentence into an array of vocabulary indices   
- Building and training the LSTM model  
- Model perfomance evaluation  

We will use Python and ML

### Importing the libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
import tensorflow as tf
tf.__version__

'2.9.1'

In [3]:
import nltk
import re
from nltk.corpus import stopwords

In [4]:
nltk.download('stopwords')
nltk.download('all')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nl

[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package mte_teip5 to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package mte_teip5 is already up-to-date!
[nltk_data]    | Downloading package mwa_ppdb to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package mwa_ppdb is already up-to-date!
[nltk_data]    | Downloading package names to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package nombank.1.0 to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package nombank.1.0 is already up-to-date!
[nltk_data]    | Downloading package nonbreaking_prefixes to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package nonbreaking_prefixes is already up-to-date!
[nltk_data]    | Downloading packag

[nltk_data]    |   Package universal_treebanks_v20 is already up-to-
[nltk_data]    |       date!
[nltk_data]    | Downloading package vader_lexicon to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package vader_lexicon is already up-to-date!
[nltk_data]    | Downloading package verbnet to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package verbnet is already up-to-date!
[nltk_data]    | Downloading package verbnet3 to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package verbnet3 is already up-to-date!
[nltk_data]    | Downloading package webtext to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package webtext is already up-to-date!
[nltk_data]    | Downloading package wmt15_eval to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package wmt15_eval is already up-to-date!
[nltk_data]    | 

True

In [5]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [6]:
from tensorflow.keras.layers import Embedding  #embedding layer helps us with the word to word implementation
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

### Importing and preprocessing the data

In [7]:
df = pd.read_csv('Sentiment_Stock_data.csv')
df = df.head(5000)

In [8]:
df.head()

Unnamed: 0.1,Unnamed: 0,Sentiment,Sentence
0,0,0,"According to Gran , the company has no plans t..."
1,1,1,"For the last quarter of 2010 , Componenta 's n..."
2,2,1,"In the third quarter of 2010 , net sales incre..."
3,3,1,Operating profit rose to EUR 13.1 mn from EUR ...
4,4,1,"Operating profit totalled EUR 21.1 mn , up fro..."


In [9]:
df = df[['Sentiment', 'Sentence']]

In [10]:
df.shape

(5000, 2)

In [11]:
df.isnull().sum()

Sentiment    0
Sentence     0
dtype: int64

In [12]:
# Dropping null values

df.dropna(inplace=True)

In [13]:
df.reset_index(inplace=True)

In [14]:
# specifying the independent and dependent features
X = df['Sentence']
y = df['Sentiment']

In [15]:
# copying X for preprocessing

sentences = X.copy()

In [16]:
# Lemmatization to convert words in sentences to their meaningful root

from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

corpus = []
for i in range(0, len(sentences)):
    text = re.sub('[^a-zA-Z]', ' ', sentences[i])
    text = text.lower()
    text = text.split()
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    all_stopwords.remove('no')
    
    text = [lemmatizer.lemmatize(word) for word in text if not word in set(all_stopwords)]
    text = ' '.join(text)
    corpus.append(text)

In [17]:
corpus[0]

'according gran company no plan move production russia although company growing'

In [18]:
len(corpus)

5000

In [19]:
corpus

['according gran company no plan move production russia although company growing',
 'last quarter componenta net sale doubled eur eur period year earlier moved zero pre tax profit pre tax loss eur',
 'third quarter net sale increased eur mn operating profit eur mn',
 'operating profit rose eur mn eur mn corresponding period representing net sale',
 'operating profit totalled eur mn eur mn representing net sale',
 'finnish talentum report operating profit increased eur mn eur mn net sale totaled eur mn eur mn',
 'clothing retail chain sepp l sale increased eur mn operating profit rose eur mn eur mn',
 'consolidated net sale increased reach eur operating profit amounted eur compared loss eur prior year period',
 'foundry division report sale increased eur mn eur mn corresponding period sale machine shop division increased eur mn eur mn corresponding period',
 'helsinki afx share closed higher led nokia announced plan team sanyo manufacture g handset nokian tyre fourth quarter earnings re

In [20]:
X[0]

'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing '

### Vectorization

#### Onehot representation

In [26]:
### Vocabulary size
voc_size=50000

In [27]:
onehot_repr=[one_hot(words,voc_size)for words in corpus] 
onehot_repr

[[22225, 19818, 10323, 782, 10399, 20337, 24390, 6166, 23172, 10323, 4803],
 [39182,
  13264,
  35745,
  15352,
  18777,
  25870,
  12560,
  12560,
  21136,
  8919,
  26709,
  26887,
  44711,
  13101,
  41953,
  42698,
  13101,
  41953,
  44947,
  12560],
 [40484, 13264, 15352, 18777, 46706, 12560, 28780, 25817, 42698, 12560, 28780],
 [25817,
  42698,
  16403,
  12560,
  28780,
  12560,
  28780,
  45889,
  21136,
  32378,
  15352,
  18777],
 [25817, 42698, 29477, 12560, 28780, 12560, 28780, 32378, 15352, 18777],
 [46106,
  42926,
  11171,
  25817,
  42698,
  46706,
  12560,
  28780,
  12560,
  28780,
  15352,
  18777,
  19199,
  12560,
  28780,
  12560,
  28780],
 [31105,
  47203,
  15432,
  32449,
  27313,
  18777,
  46706,
  12560,
  28780,
  25817,
  42698,
  16403,
  12560,
  28780,
  12560,
  28780],
 [47135,
  15352,
  18777,
  46706,
  40589,
  12560,
  25817,
  42698,
  14369,
  12560,
  15101,
  44947,
  12560,
  33065,
  8919,
  21136],
 [14696,
  15481,
  11171,
  18777,
  4

In [28]:
corpus[1]

'last quarter componenta net sale doubled eur eur period year earlier moved zero pre tax profit pre tax loss eur'

In [29]:
onehot_repr[1]

[39182,
 13264,
 35745,
 15352,
 18777,
 25870,
 12560,
 12560,
 21136,
 8919,
 26709,
 26887,
 44711,
 13101,
 41953,
 42698,
 13101,
 41953,
 44947,
 12560]

#### Word Embedding

In [30]:
#making the sentences equal length/size
sent_length=30
embedded_docs=pad_sequences(onehot_repr,padding='post',maxlen=sent_length)
print(embedded_docs)

[[22225 19818 10323 ...     0     0     0]
 [39182 13264 35745 ...     0     0     0]
 [40484 13264 15352 ...     0     0     0]
 ...
 [36613 12387 24525 ...     0     0     0]
 [45675 33537 12070 ...     0     0     0]
 [  693 25347    74 ...     0     0     0]]


In [31]:
embedded_docs[1]

array([39182, 13264, 35745, 15352, 18777, 25870, 12560, 12560, 21136,
        8919, 26709, 26887, 44711, 13101, 41953, 42698, 13101, 41953,
       44947, 12560,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0])

### Building the LSTM model

In [32]:
## Creating model
embedding_vector_features=40 ##features representation. Every index in embedded_docs will be represented by 40 dimensions
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model.add(LSTM(500))  #try different values
model.add(Dense(1,activation='sigmoid')) #sigmoid since the output is binary
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 30, 40)            2000000   
                                                                 
 lstm (LSTM)                 (None, 500)               1082000   
                                                                 
 dense (Dense)               (None, 1)                 501       
                                                                 
Total params: 3,082,501
Trainable params: 3,082,501
Non-trainable params: 0
_________________________________________________________________
None


In [33]:
len(embedded_docs),y.shape

(5000, (5000,))

### Model training

In [34]:
import numpy as np
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [35]:
X_final.shape,y_final.shape

((5000, 30), (5000,))

In [36]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.3, random_state=42)

In [37]:
y_train

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [38]:
# Model Training
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=75)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2d6ec0aa1f0>

### Predictions and model performance

In [40]:
# Predicting the test data
y_pred = model.predict(X_test)
y_pred



array([[0.2876182 ],
       [0.9970237 ],
       [0.00162018],
       ...,
       [0.18684569],
       [0.9346901 ],
       [0.00169083]], dtype=float32)

In [42]:
# making the predictions binary

y_pred=np.where(y_pred > 0.6, 1,0) ##AUC ROC Curve
y_pred

array([[0],
       [1],
       [0],
       ...,
       [0],
       [1],
       [0]])

In [43]:
# Confusion matrix

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)

array([[996,  42],
       [203, 259]], dtype=int64)

In [44]:
# Getting the accuracy score

from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.8366666666666667

In [45]:
# Getting the classification report

from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.83      0.96      0.89      1038
           1       0.86      0.56      0.68       462

    accuracy                           0.84      1500
   macro avg       0.85      0.76      0.78      1500
weighted avg       0.84      0.84      0.83      1500

