### STOCK PRICE PREDICTION BASED ON NEWS HEADLINES

Input Dataset: Daily News for Stock Market Prediction
https://www.kaggle.com/aaron7sun/stocknews

Input data is the historical news headlines from Reddit WorldNews Channel(top 25 news per date) in addition to the data scraped from Yahoo Finance.

Output Data: Stock Index

    1 -> Increase
    0 -> Decrease

In [1]:
# import statements

import pandas as pd
import re
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

import warnings
warnings.filterwarnings('ignore')

### I. Data Load

In [2]:
path = r'Combined_News_DJIA/Combined_News_DJIA.csv'
news_data = pd.read_csv(path, encoding= 'UTF-8')
news_data.head(1)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""


### II. Data Reorder

In [3]:
columns_list = ["Date","Top1","Top2","Top3","Top4","Top5","Top6","Top7","Top8","Top9","Top10","Top11","Top12","Top13","Top14","Top15","Top16","Top17","Top18","Top19","Top20","Top21","Top22","Top23","Top24","Top25", "Label"]
news_data = news_data[columns_list]

### III. Data Cleaning

#### III. a.  Drop null records

In [4]:
#checking for null values
news_data.isnull().sum()

Date     0
Top1     0
Top2     0
Top3     0
Top4     0
Top5     0
Top6     0
Top7     0
Top8     0
Top9     0
Top10    0
Top11    0
Top12    0
Top13    0
Top14    0
Top15    0
Top16    0
Top17    0
Top18    0
Top19    0
Top20    0
Top21    0
Top22    0
Top23    1
Top24    3
Top25    3
Label    0
dtype: int64

In [5]:
# records with null values
news_data[news_data.isnull().any(axis=1)][news_data.columns[news_data.isnull().any()]]

Unnamed: 0,Top23,Top24,Top25
277,,,
348,"b""Ayatollah Montazeri's Legacy: In death he m...",,
681,Prince Charles wins some kind of a record,,


In [6]:
# dropping the null records
news_data = news_data.dropna()
news_data.reset_index(drop=True, inplace=True)

#checking for null values
print(news_data.isnull().sum())

Date     0
Top1     0
Top2     0
Top3     0
Top4     0
Top5     0
Top6     0
Top7     0
Top8     0
Top9     0
Top10    0
Top11    0
Top12    0
Top13    0
Top14    0
Top15    0
Top16    0
Top17    0
Top18    0
Top19    0
Top20    0
Top21    0
Top22    0
Top23    0
Top24    0
Top25    0
Label    0
dtype: int64


#### III. b.  Train Test Split

In [7]:
train = news_data[news_data['Date'] < '20141231']
test = news_data[news_data['Date'] > '20141231']

In [8]:
# selecting the textual data in data
data_train = train.iloc[:,1:-1]
data_test = test.iloc[:,1:-1]

#### III c. Remove punctuation marks from news columns

In [9]:
# replace character b from news text
data_train.replace("^b", '',regex=True, inplace =True)
data_test.replace("^b", '',regex=True, inplace =True)

# remove \' from news
data_train.replace("\\'", '',regex=True, inplace =True)
data_test.replace("\\'", '',regex=True, inplace =True)

# replace extra spaces from start of news text
data_train.replace('^[\s"]', '',regex = True, inplace = True)
data_test.replace('^[\s"]', '',regex = True, inplace = True)

# replace quotes from start of news text
data_train.replace('^"', '',regex = True, inplace = True)
data_test.replace('^"', '',regex = True, inplace = True)

# replace quotes from end of news text
data_train.replace('"$', '',regex = True, inplace = True)
data_test.replace('"$', '',regex = True, inplace = True)

In [10]:
# storing punctuation marks
RE_PUNCTUATION = '|'.join([re.escape(x) for x in string.punctuation])

for col in data_train.columns:
    data_train[col] = data_train[col].str.replace(RE_PUNCTUATION, '')

for col in data_test.columns:
    data_test[col] = data_test[col].str.replace(RE_PUNCTUATION, '')

#### III d. Normalise

In [11]:
for col in data_train.columns:
    data_train[col] = data_train[col].str.lower()
    
for col in data_test.columns:
    data_test[col] = data_test[col].str.lower()

#### III e. Combining each row to create bag of words

In [12]:
news_train = []

for row in data_train.index:
    news_train.append(' '.join(data_train.iloc[row]))

In [13]:
news_test = []

for row in data_test.index:
    news_test.append(' '.join(data_test.loc[row]))

### IV. Bag of Words Creation

In [14]:
countVectorizer = CountVectorizer(ngram_range=(1,1))
traindataset = countVectorizer.fit_transform(news_train)
testdataset = countVectorizer.transform(news_test)

### V. Training the Model 

In [15]:
randomforestClassifier = RandomForestClassifier(n_estimators= 200, criterion='entropy')
randomforestClassifier.fit(traindataset, train['Label'])

RandomForestClassifier(criterion='entropy', n_estimators=200)

### VI. Predictions

In [16]:
predictions = randomforestClassifier.predict(testdataset)

### VII. Evaluation

In [17]:
matrix = confusion_matrix(test['Label'], predictions)
matrix

array([[ 22, 164],
       [ 24, 168]], dtype=int64)

In [18]:
acc_score = accuracy_score(test['Label'], predictions)
acc_score

0.5026455026455027

In [19]:
report = classification_report(test['Label'], predictions)
print(report)

              precision    recall  f1-score   support

           0       0.48      0.12      0.19       186
           1       0.51      0.88      0.64       192

    accuracy                           0.50       378
   macro avg       0.49      0.50      0.42       378
weighted avg       0.49      0.50      0.42       378

