## Stock Sentiment Analysis 

In this notebook, I will be conducting a sentiment analysis on the top 25 headlines per day ranging from 03-01-2000 to 30-06-2016, which I will be using to predict the movement of the stock market on each of the days. 
I have divided the notebook in the following parts:

- Loading the data and libraries
- Sentiment analysis (NLTK and BERT)
- Manipulate the dataset
- Run a classification algorithm on data
- Predict the results

### 1) Data and Library Loading

In [1]:
#Importing pandas and numpy to manipulate datasets
import pandas as pd
import numpy as np

#Importing nltk library
import nltk
#nltk.download('vader_lexicon')

#Importing the librairies from transformers libarry
#!pip install transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
#Importing torch
import torch

In [2]:
#Importing the data
#Importing the csv files of the Top25 headlines from 2000-01-03 to 2016-06-30

df=pd.read_csv('Data.csv', encoding = "ISO-8859-1")
df['Date'] = pd.to_datetime(df['Date']).dt.normalize()
df.head()


Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,2000-01-04,0,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite
2,2000-01-05,0,Coventry caught on counter by Flo,United's rivals on the road to Rio,Thatcher issues defence before trial by video,Police help Smith lay down the law at Everton,Tale of Trautmann bears two more retellings,England on the rack,Pakistan retaliate with call for video of Walsh,Cullinan continues his Cape monopoly,...,South Melbourne (Australia),Necaxa (Mexico),Real Madrid (Spain),Raja Casablanca (Morocco),Corinthians (Brazil),Tony's pet project,Al Nassr (Saudi Arabia),Ideal Holmes show,Pinochet leaves hospital after tests,Useful links
3,2000-01-06,1,Pilgrim knows how to progress,Thatcher facing ban,McIlroy calls for Irish fighting spirit,Leicester bin stadium blueprint,United braced for Mexican wave,"Auntie back in fashion, even if the dress look...",Shoaib appeal goes to the top,Hussain hurt by 'shambles' but lays blame on e...,...,Putin admits Yeltsin quit to give him a head s...,BBC worst hit as digital TV begins to bite,How much can you pay for...,Christmas glitches,"Upending a table, Chopping a line and Scoring ...","Scientific evidence 'unreliable', defence claims",Fusco wins judicial review in extradition case,Rebels thwart Russian advance,Blair orders shake-up of failing NHS,Lessons of law's hard heart
4,2000-01-07,1,Hitches and Horlocks,Beckham off but United survive,Breast cancer screening,Alan Parker,Guardian readers: are you all whingers?,Hollywood Beyond,Ashes and diamonds,Whingers - a formidable minority,...,Most everywhere: UDIs,Most wanted: Chloe lunettes,Return of the cane 'completely off the agenda',From Sleepy Hollow to Greeneland,Blunkett outlines vision for over 11s,"Embattled Dobson attacks 'play now, pay later'...",Doom and the Dome,What is the north-south divide?,Aitken released from jail,Gone aloft


The dataset has been loaded. The dataframe (df) contains the top 25 headlines for each of the days and the Label associated with them. The Label is either 1/0. A score of 1 would mean that thestock price ended higher than it opened, where as a score of 0 would mean the opposite. 

In the following lines of code, I will be combining the headlines and making a column, Combined_Headlines, which will hold the combined headlines for each of the days in the dataset. 

In [4]:
#df.head()
#Combining the top 25 headlines of the day into a single column
df['Combined_Headlines'] = df[df.columns[2:]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
df.head()
df_cleaned = df[['Date', 'Label','Combined_Headlines']]
df_cleaned.head()
stock_data = df_cleaned
stock_data.head()

Unnamed: 0,Date,Label,Combined_Headlines
0,2000-01-03,0,A 'hindrance to operations': extracts from the...
1,2000-01-04,0,"Scorecard,The best lake scene,Leader: German s..."
2,2000-01-05,0,"Coventry caught on counter by Flo,United's riv..."
3,2000-01-06,1,"Pilgrim knows how to progress,Thatcher facing ..."
4,2000-01-07,1,"Hitches and Horlocks,Beckham off but United su..."


### 2) Sentiment Analysis 

#### 2.1) NLTK


In [5]:
#NLTK Sentiment Analyser
# importing requires libraries to analyze the sentiments
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import unicodedata

# instantiating the Sentiment Analyzer
sid = SentimentIntensityAnalyzer()
stock_data['NLTK_Scores'] = stock_data['Combined_Headlines'].apply(lambda x: sid.polarity_scores(x)['compound'])
stock_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stock_data['NLTK_Scores'] = stock_data['Combined_Headlines'].apply(lambda x: sid.polarity_scores(x)['compound'])


Unnamed: 0,Date,Label,Combined_Headlines,NLTK_Scores
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,-0.1531
1,2000-01-04,0,"Scorecard,The best lake scene,Leader: German s...",-0.9942
2,2000-01-05,0,"Coventry caught on counter by Flo,United's riv...",0.7783
3,2000-01-06,1,"Pilgrim knows how to progress,Thatcher facing ...",-0.9313
4,2000-01-07,1,"Hitches and Horlocks,Beckham off but United su...",-0.977


#### 2.2) BERT Sentiment Analyser

In the below section, I will be using the pre-trained BERT Language Model to understand the sentiment scores of the headlines. 

In [None]:
#Importing the AutoTokenizer module from the pre-trained BERT model
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

#Instantiating the model from the AutoModelForSequenceClassificationmodule of the pre-trained BERT Model
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

In [None]:
#Defining the function, to tokenize the string and calculate the sentiment of the string
def sentiment_score(review):
    tokens = tokenizer.encode(review, return_tensors='pt')
    result = model(tokens)
    return int(torch.argmax(result.logits))+1

In [None]:
stock_data['BERT_sentiment'] = stock_data['Combined_Headlines'].apply(lambda x: sentiment_score(x[:512]))
stock_data.head()