# Problem Statment: Stock Sentiment Analysis using News Headlines

- The dataset in consideration is a combination of the world news and stock price shifts available on Kaggle.
- There are 25 columns of top news headlines for each day in the dataframe.
- The data ranges from 2008 to 2016 and the data from 2000 to 2008 was scrapped from Yahoo Finance.
- Labels are based on the Dow Jones Industrial Average stock index.
- Class 1 ---> the stock price increased.
- Class 0 ---> the stock price stayed the same or decreased.

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords

In [2]:
data=pd.read_csv('data.csv', encoding = "ISO-8859-1")
data.head(2)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,2000-01-04,0,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite


In [3]:
# Spliting data into traing and test set
train = data[data['Date'] < '20150101']
test = data[data['Date'] > '20141231']

In [4]:
# Removing punctuations
train_data = train.iloc[:,2:27]
train_data.replace("[^a-zA-Z]", " ", regex=True, inplace=True)

In [5]:
#Renaming the cloumn name for ease of access
col_name = [i for i in range(25)]
new_index = [str(i) for i in col_name]
train_data.columns = new_index
train_data.head(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,A hindrance to operations extracts from the...,Scorecard,Hughes instant hit buoys Blues,Jack gets his skates on at ice cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,Derby raise a glass to Strupar s debut double,Southgate strikes Leeds pay the penalty,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl s successor drawn into scandal,The difference between men and women,Sara Denver nurse turned solicitor,Diana s landmine crusade put Tories in a panic,Yeltsin s resignation caught opposition flat f...,Russian roulette,Sold out,Recovering a title


In [6]:
#Converting Headlines to lowecase
for index in new_index:
    train_data[index] = train_data[index].str.lower()
    
train_data.head(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
0,a hindrance to operations extracts from the...,scorecard,hughes instant hit buoys blues,jack gets his skates on at ice cold alex,chaos as maracana builds up for united,depleted leicester prevail as elliott spoils e...,hungry spurs sense rich pickings,gunners so wide of an easy target,derby raise a glass to strupar s debut double,southgate strikes leeds pay the penalty,...,flintoff injury piles on woe for england,hunters threaten jospin with new battle of the...,kohl s successor drawn into scandal,the difference between men and women,sara denver nurse turned solicitor,diana s landmine crusade put tories in a panic,yeltsin s resignation caught opposition flat f...,russian roulette,sold out,recovering a title


In [7]:
headlines = []
for row in range(0,len(train_data)):
    headlines.append(' '.join(str(x) for x in train_data.iloc[row,0:25]))

In [8]:
headlines[500]

'judge frederick motz has thrown england take the one day plunge summit called as violence spreads players go back to school for scandal fa charges fulham over mass brawl a fateful month for the fancied few sun set to leave china for newcastle in     m deal infamous five are let off lightly doubts grow over istabraq mancini resigns from fiorentina and the name of the next man u manager is       woman stabbed to death at euston station one day squad and itinerary south african open  rookie butterfield turns up heat thursday s racing results napster reboots with trial service zimbabwe to  accept international vote monitors  you saw the hampster dance usa today has been to is the consumer electronics show councils in need of a marketing makeover dea birkett  an end to apartheid schooling mps rebel over health bill musharraf strikes a blow in propaganda war climbi    ink attacker charged'

In [9]:
## implemention using TF-IDF Model
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvector = TfidfVectorizer(ngram_range=(5,5))
traindataset2 = tfidfvector.fit_transform(headlines)

In [77]:
## implemention using BAG OF WORDS Model
from sklearn.feature_extraction import CountVectorizer
countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(headlines)

In [10]:
# implement RandomForest Classifier
from sklearn.ensemble import RandomForestClassifier
randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(traindataset2,train['Label'])

RandomForestClassifier(criterion='entropy', n_estimators=200)

In [11]:
## Predict for the Test Dataset
test_transform= []
for row in range(0,len(test.index)):
    test_transform.append(' '.join(str(x) for x in test.iloc[row,2:27]))
test_dataset = tfidfvector.transform(test_transform)
predictions = randomclassifier.predict(test_dataset)

In [12]:
## Import library to check accuracy
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [13]:
matrix=confusion_matrix(test['Label'],predictions)
print(matrix)
score=accuracy_score(test['Label'],predictions)
print(score)
report=classification_report(test['Label'],predictions)
print(report)

[[120  66]
 [  1 191]]
0.8227513227513228
              precision    recall  f1-score   support

           0       0.99      0.65      0.78       186
           1       0.74      0.99      0.85       192

    accuracy                           0.82       378
   macro avg       0.87      0.82      0.82       378
weighted avg       0.87      0.82      0.82       378

