# Stock News Sentiment Analysis with ML - NLP BoW

This notebook focuses on predicting whether stock prices will increase or decrease based on sentimental analysis of news headlines. In this study I will use data available on Kaggle https://www.kaggle.com/datasets/avisheksood/stock-news-sentiment-analysismassive-dataset?select=Sentiment_Stock_data.csv. 
This is a huge dataset with 108,301 unique values, so I will only use a sample of 5000 observations.

We are going to solve the classificaion problem using Natural Language Processing with the following steps:
- Text preprocessing applying tokenization, stopwords, lemmatization
- Converting text to vectors using Bag of Words  
- Training a RandomForest classifier  
- Measuring the model performance on the test data  

We will use Python and ML

### Importing the libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
import tensorflow as tf
tf.__version__

'2.9.1'

In [3]:
import nltk
import re
from nltk.corpus import stopwords

In [4]:
nltk.download('stopwords')
nltk.download('all')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nl

[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package mte_teip5 to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package mte_teip5 is already up-to-date!
[nltk_data]    | Downloading package mwa_ppdb to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package mwa_ppdb is already up-to-date!
[nltk_data]    | Downloading package names to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package nombank.1.0 to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package nombank.1.0 is already up-to-date!
[nltk_data]    | Downloading package nonbreaking_prefixes to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package nonbreaking_prefixes is already up-to-date!
[nltk_data]    | Downloading packag

[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package unicode_samples is already up-to-date!
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package universal_tagset is already up-to-date!
[nltk_data]    | Downloading package universal_treebanks_v20 to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package universal_treebanks_v20 is already up-to-
[nltk_data]    |       date!
[nltk_data]    | Downloading package vader_lexicon to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package vader_lexicon is already up-to-date!
[nltk_data]    | Downloading package verbnet to
[nltk_data]    |     C:\Users\mngembu\AppData\Roaming\nltk_data...
[nltk_data]    |   Package verbnet 

True

In [5]:
#from nltk import sent_tokenize
#from gensim.utils import simple_preprocess

### Importing and preprocessing the data

In [7]:
df = pd.read_csv('Sentiment_Stock_data.csv')
df = df.head(5000)

In [8]:
df.head()

Unnamed: 0.1,Unnamed: 0,Sentiment,Sentence
0,0,0,"According to Gran , the company has no plans t..."
1,1,1,"For the last quarter of 2010 , Componenta 's n..."
2,2,1,"In the third quarter of 2010 , net sales incre..."
3,3,1,Operating profit rose to EUR 13.1 mn from EUR ...
4,4,1,"Operating profit totalled EUR 21.1 mn , up fro..."


In [9]:
df = df[['Sentiment', 'Sentence']]

In [10]:
df.shape

(5000, 2)

In [11]:
df.isnull().sum()

Sentiment    0
Sentence     0
dtype: int64

In [12]:
# Dropping null values

df.dropna(inplace=True)

In [13]:
df.reset_index(inplace=True)

In [14]:
# specifying the independent and dependent features
X = df['Sentence']
y = df['Sentiment']

In [15]:
# copying X for preprocessing

sentences = X.copy()

In [16]:
# Lemmatization to convert words in sentences to their meaningful root

from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

corpus = []
for i in range(0, len(sentences)):
    text = re.sub('[^a-zA-Z]', ' ', sentences[i])
    text = text.lower()
    text = text.split()
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    all_stopwords.remove('no')
    
    text = [lemmatizer.lemmatize(word) for word in text if not word in set(all_stopwords)]
    text = ' '.join(text)
    corpus.append(text)

In [17]:
corpus[0]

'according gran company no plan move production russia although company growing'

In [18]:
len(corpus)

5000

In [19]:
corpus

['according gran company no plan move production russia although company growing',
 'last quarter componenta net sale doubled eur eur period year earlier moved zero pre tax profit pre tax loss eur',
 'third quarter net sale increased eur mn operating profit eur mn',
 'operating profit rose eur mn eur mn corresponding period representing net sale',
 'operating profit totalled eur mn eur mn representing net sale',
 'finnish talentum report operating profit increased eur mn eur mn net sale totaled eur mn eur mn',
 'clothing retail chain sepp l sale increased eur mn operating profit rose eur mn eur mn',
 'consolidated net sale increased reach eur operating profit amounted eur compared loss eur prior year period',
 'foundry division report sale increased eur mn eur mn corresponding period sale machine shop division increased eur mn eur mn corresponding period',
 'helsinki afx share closed higher led nokia announced plan team sanyo manufacture g handset nokian tyre fourth quarter earnings re

In [20]:
X[0]

'According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing '

### Using Bag of Words to convert words to vector

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
     

In [23]:
cv = CountVectorizer(binary=True, ngram_range=(2,2))
X_new = cv.fit_transform(corpus)


In [24]:
X_new[0]

<1x28655 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [25]:
X_new.shape

(5000, 28655)

### Training a RandomForest Classifier

In [26]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.3, random_state=42)

In [27]:
X_train.shape

(3500, 28655)

In [28]:
y_train.shape

(3500,)

In [29]:
#implementing RandomForest Classifier
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=200, criterion='entropy')
rnd_clf.fit(X_train, y_train)


RandomForestClassifier(criterion='entropy', n_estimators=200)

### Model Perfomance

In [30]:
# Making predictions on the test set

y_pred=rnd_clf.predict(X_test)

In [31]:
y_pred

array([0, 1, 0, ..., 0, 1, 0], dtype=int64)

In [32]:
# Confusion matrix

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)

array([[1025,   13],
       [ 249,  213]], dtype=int64)

In [33]:
# Getting the accuracy score

from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.8253333333333334

In [34]:
# Getting the classification report

from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.80      0.99      0.89      1038
           1       0.94      0.46      0.62       462

    accuracy                           0.83      1500
   macro avg       0.87      0.72      0.75      1500
weighted avg       0.85      0.83      0.80      1500

