# SENTIMENT ANALYSIS FOR STOCK MARKET PREDICTION

## 1. Data Setup

### Importing DJIA Stock Indices

In this section we do the following tasks:
1. Read the DJIA index data and load it into a dataframe
2. We set date as the index
3. We just keep the closing prices.

In [205]:
import numpy as np
import csv, json
import pandas as pd

#################################################################################################
# Preparing DJIA data
# Reading DJIA index prices csv file
with open('data/DJIA_table.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    # Converting the csv file reader to a lists 
    data_list = list(spamreader)

header = data_list[0] 
data_list = data_list[1:] 

data_list = np.asarray(data_list)

# Selecting date and close value for each day
selected_data = data_list[:, [0, 4, 6]]



df = pd.DataFrame(data=selected_data[0:,1:],
             index=selected_data[0:,0],
                                columns=['close', 'adj close'],
                                        dtype='float64')


df.head()

Unnamed: 0,close,adj close
2016-07-01,17949.369141,17949.369141
2016-06-30,17929.990234,17929.990234
2016-06-29,17694.679688,17694.679688
2016-06-28,17409.720703,17409.720703
2016-06-27,17140.240234,17140.240234


### Importing Reddit News Data

1. We simply load the reddit news articles data into a dataframe
2. We set date as the index.

In [206]:
with open('data/RedditNews.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    # Converting the csv file reader to a lists 
    newsdata_list = list(spamreader)

# Separating header from the data
header1 = newsdata_list[0] 
newsdata_list = newsdata_list[1:] 
newsdata_list = np.asarray(newsdata_list)

selected_data = newsdata_list[:, [0, 1]]
newsdf = pd.DataFrame(data=selected_data[0:,1:],
             index=selected_data[0:,0],columns=['articles'],
                                        dtype='str')

newsdf.head()

Unnamed: 0,articles
2016-07-01,A 117-year-old woman in Mexico City finally re...
2016-07-01,IMF chief backs Athens as permanent Olympic host
2016-07-01,"The president of France says if Brexit won, so..."
2016-07-01,British Man Who Must Give Police 24 Hours' Not...
2016-07-01,100+ Nobel laureates urge Greenpeace to stop o...


### Merging DJIA and News Data

1. We merge all the headlines from a single day into one single string.
2. We create a new dataframe with date, the price and the merged headline.
3. We then save the dataframe as a csv file.

In [207]:
df['articles'] = ''

In [238]:
datelist = df.index.values

In [210]:
newsdict = {}
for day in datelist:
    currentday = day
    print "Processing: " + str(currentday)
    currentstr = ''
    count = 0
    for index, row in newsdf.iterrows():
        if (index == currentday):
            currentstr = currentstr + row['articles'] + ' '
            count = count+1
    newsdict[day] = currentstr
    print str(count) + 'Articles from ' + str(day) + ' processed.'
    


Processing: 2016-07-01
25Articles from 2016-07-01 processed.
Processing: 2016-06-30
25Articles from 2016-06-30 processed.
Processing: 2016-06-29
25Articles from 2016-06-29 processed.
Processing: 2016-06-28
25Articles from 2016-06-28 processed.
Processing: 2016-06-27
25Articles from 2016-06-27 processed.
Processing: 2016-06-24
25Articles from 2016-06-24 processed.
Processing: 2016-06-23
25Articles from 2016-06-23 processed.
Processing: 2016-06-22
25Articles from 2016-06-22 processed.
Processing: 2016-06-21
25Articles from 2016-06-21 processed.
Processing: 2016-06-20
25Articles from 2016-06-20 processed.
Processing: 2016-06-17
25Articles from 2016-06-17 processed.
Processing: 2016-06-16
25Articles from 2016-06-16 processed.
Processing: 2016-06-15
25Articles from 2016-06-15 processed.
Processing: 2016-06-14
25Articles from 2016-06-14 processed.
Processing: 2016-06-13
25Articles from 2016-06-13 processed.
Processing: 2016-06-10
25Articles from 2016-06-10 processed.
Processing: 2016-06-09
2

25Articles from 2015-12-18 processed.
Processing: 2015-12-17
25Articles from 2015-12-17 processed.
Processing: 2015-12-16
25Articles from 2015-12-16 processed.
Processing: 2015-12-15
25Articles from 2015-12-15 processed.
Processing: 2015-12-14
25Articles from 2015-12-14 processed.
Processing: 2015-12-11
25Articles from 2015-12-11 processed.
Processing: 2015-12-10
25Articles from 2015-12-10 processed.
Processing: 2015-12-09
25Articles from 2015-12-09 processed.
Processing: 2015-12-08
25Articles from 2015-12-08 processed.
Processing: 2015-12-07
25Articles from 2015-12-07 processed.
Processing: 2015-12-04
25Articles from 2015-12-04 processed.
Processing: 2015-12-03
25Articles from 2015-12-03 processed.
Processing: 2015-12-02
25Articles from 2015-12-02 processed.
Processing: 2015-12-01
25Articles from 2015-12-01 processed.
Processing: 2015-11-30
25Articles from 2015-11-30 processed.
Processing: 2015-11-27
25Articles from 2015-11-27 processed.
Processing: 2015-11-25
25Articles from 2015-11-

25Articles from 2015-06-09 processed.
Processing: 2015-06-08
25Articles from 2015-06-08 processed.
Processing: 2015-06-05
25Articles from 2015-06-05 processed.
Processing: 2015-06-04
25Articles from 2015-06-04 processed.
Processing: 2015-06-03
25Articles from 2015-06-03 processed.
Processing: 2015-06-02
25Articles from 2015-06-02 processed.
Processing: 2015-06-01
25Articles from 2015-06-01 processed.
Processing: 2015-05-29
25Articles from 2015-05-29 processed.
Processing: 2015-05-28
25Articles from 2015-05-28 processed.
Processing: 2015-05-27
25Articles from 2015-05-27 processed.
Processing: 2015-05-26
25Articles from 2015-05-26 processed.
Processing: 2015-05-22
25Articles from 2015-05-22 processed.
Processing: 2015-05-21
25Articles from 2015-05-21 processed.
Processing: 2015-05-20
25Articles from 2015-05-20 processed.
Processing: 2015-05-19
25Articles from 2015-05-19 processed.
Processing: 2015-05-18
25Articles from 2015-05-18 processed.
Processing: 2015-05-15
25Articles from 2015-05-

25Articles from 2014-11-21 processed.
Processing: 2014-11-20
25Articles from 2014-11-20 processed.
Processing: 2014-11-19
25Articles from 2014-11-19 processed.
Processing: 2014-11-18
25Articles from 2014-11-18 processed.
Processing: 2014-11-17
25Articles from 2014-11-17 processed.
Processing: 2014-11-14
25Articles from 2014-11-14 processed.
Processing: 2014-11-13
25Articles from 2014-11-13 processed.
Processing: 2014-11-12
25Articles from 2014-11-12 processed.
Processing: 2014-11-11
25Articles from 2014-11-11 processed.
Processing: 2014-11-10
25Articles from 2014-11-10 processed.
Processing: 2014-11-07
25Articles from 2014-11-07 processed.
Processing: 2014-11-06
25Articles from 2014-11-06 processed.
Processing: 2014-11-05
25Articles from 2014-11-05 processed.
Processing: 2014-11-04
25Articles from 2014-11-04 processed.
Processing: 2014-11-03
25Articles from 2014-11-03 processed.
Processing: 2014-10-31
25Articles from 2014-10-31 processed.
Processing: 2014-10-30
25Articles from 2014-10-

25Articles from 2014-05-13 processed.
Processing: 2014-05-12
25Articles from 2014-05-12 processed.
Processing: 2014-05-09
25Articles from 2014-05-09 processed.
Processing: 2014-05-08
25Articles from 2014-05-08 processed.
Processing: 2014-05-07
25Articles from 2014-05-07 processed.
Processing: 2014-05-06
25Articles from 2014-05-06 processed.
Processing: 2014-05-05
25Articles from 2014-05-05 processed.
Processing: 2014-05-02
25Articles from 2014-05-02 processed.
Processing: 2014-05-01
25Articles from 2014-05-01 processed.
Processing: 2014-04-30
25Articles from 2014-04-30 processed.
Processing: 2014-04-29
25Articles from 2014-04-29 processed.
Processing: 2014-04-28
25Articles from 2014-04-28 processed.
Processing: 2014-04-25
25Articles from 2014-04-25 processed.
Processing: 2014-04-24
25Articles from 2014-04-24 processed.
Processing: 2014-04-23
25Articles from 2014-04-23 processed.
Processing: 2014-04-22
25Articles from 2014-04-22 processed.
Processing: 2014-04-21
25Articles from 2014-04-

25Articles from 2013-10-28 processed.
Processing: 2013-10-25
25Articles from 2013-10-25 processed.
Processing: 2013-10-24
25Articles from 2013-10-24 processed.
Processing: 2013-10-23
25Articles from 2013-10-23 processed.
Processing: 2013-10-22
25Articles from 2013-10-22 processed.
Processing: 2013-10-21
25Articles from 2013-10-21 processed.
Processing: 2013-10-18
25Articles from 2013-10-18 processed.
Processing: 2013-10-17
25Articles from 2013-10-17 processed.
Processing: 2013-10-16
25Articles from 2013-10-16 processed.
Processing: 2013-10-15
25Articles from 2013-10-15 processed.
Processing: 2013-10-14
25Articles from 2013-10-14 processed.
Processing: 2013-10-11
25Articles from 2013-10-11 processed.
Processing: 2013-10-10
25Articles from 2013-10-10 processed.
Processing: 2013-10-09
25Articles from 2013-10-09 processed.
Processing: 2013-10-08
25Articles from 2013-10-08 processed.
Processing: 2013-10-07
25Articles from 2013-10-07 processed.
Processing: 2013-10-04
25Articles from 2013-10-

25Articles from 2013-04-17 processed.
Processing: 2013-04-16
25Articles from 2013-04-16 processed.
Processing: 2013-04-15
25Articles from 2013-04-15 processed.
Processing: 2013-04-12
25Articles from 2013-04-12 processed.
Processing: 2013-04-11
25Articles from 2013-04-11 processed.
Processing: 2013-04-10
25Articles from 2013-04-10 processed.
Processing: 2013-04-09
25Articles from 2013-04-09 processed.
Processing: 2013-04-08
25Articles from 2013-04-08 processed.
Processing: 2013-04-05
25Articles from 2013-04-05 processed.
Processing: 2013-04-04
25Articles from 2013-04-04 processed.
Processing: 2013-04-03
25Articles from 2013-04-03 processed.
Processing: 2013-04-02
25Articles from 2013-04-02 processed.
Processing: 2013-04-01
25Articles from 2013-04-01 processed.
Processing: 2013-03-28
25Articles from 2013-03-28 processed.
Processing: 2013-03-27
25Articles from 2013-03-27 processed.
Processing: 2013-03-26
25Articles from 2013-03-26 processed.
Processing: 2013-03-25
25Articles from 2013-03-

25Articles from 2012-09-28 processed.
Processing: 2012-09-27
25Articles from 2012-09-27 processed.
Processing: 2012-09-26
25Articles from 2012-09-26 processed.
Processing: 2012-09-25
25Articles from 2012-09-25 processed.
Processing: 2012-09-24
25Articles from 2012-09-24 processed.
Processing: 2012-09-21
25Articles from 2012-09-21 processed.
Processing: 2012-09-20
25Articles from 2012-09-20 processed.
Processing: 2012-09-19
25Articles from 2012-09-19 processed.
Processing: 2012-09-18
25Articles from 2012-09-18 processed.
Processing: 2012-09-17
25Articles from 2012-09-17 processed.
Processing: 2012-09-14
25Articles from 2012-09-14 processed.
Processing: 2012-09-13
25Articles from 2012-09-13 processed.
Processing: 2012-09-12
25Articles from 2012-09-12 processed.
Processing: 2012-09-11
25Articles from 2012-09-11 processed.
Processing: 2012-09-10
25Articles from 2012-09-10 processed.
Processing: 2012-09-07
25Articles from 2012-09-07 processed.
Processing: 2012-09-06
25Articles from 2012-09-

25Articles from 2012-03-19 processed.
Processing: 2012-03-16
25Articles from 2012-03-16 processed.
Processing: 2012-03-15
25Articles from 2012-03-15 processed.
Processing: 2012-03-14
25Articles from 2012-03-14 processed.
Processing: 2012-03-13
25Articles from 2012-03-13 processed.
Processing: 2012-03-12
25Articles from 2012-03-12 processed.
Processing: 2012-03-09
25Articles from 2012-03-09 processed.
Processing: 2012-03-08
25Articles from 2012-03-08 processed.
Processing: 2012-03-07
25Articles from 2012-03-07 processed.
Processing: 2012-03-06
25Articles from 2012-03-06 processed.
Processing: 2012-03-05
25Articles from 2012-03-05 processed.
Processing: 2012-03-02
25Articles from 2012-03-02 processed.
Processing: 2012-03-01
25Articles from 2012-03-01 processed.
Processing: 2012-02-29
25Articles from 2012-02-29 processed.
Processing: 2012-02-28
25Articles from 2012-02-28 processed.
Processing: 2012-02-27
25Articles from 2012-02-27 processed.
Processing: 2012-02-24
25Articles from 2012-02-

25Articles from 2011-09-02 processed.
Processing: 2011-09-01
25Articles from 2011-09-01 processed.
Processing: 2011-08-31
25Articles from 2011-08-31 processed.
Processing: 2011-08-30
25Articles from 2011-08-30 processed.
Processing: 2011-08-29
25Articles from 2011-08-29 processed.
Processing: 2011-08-26
25Articles from 2011-08-26 processed.
Processing: 2011-08-25
25Articles from 2011-08-25 processed.
Processing: 2011-08-24
25Articles from 2011-08-24 processed.
Processing: 2011-08-23
25Articles from 2011-08-23 processed.
Processing: 2011-08-22
25Articles from 2011-08-22 processed.
Processing: 2011-08-19
25Articles from 2011-08-19 processed.
Processing: 2011-08-18
25Articles from 2011-08-18 processed.
Processing: 2011-08-17
25Articles from 2011-08-17 processed.
Processing: 2011-08-16
25Articles from 2011-08-16 processed.
Processing: 2011-08-15
25Articles from 2011-08-15 processed.
Processing: 2011-08-12
25Articles from 2011-08-12 processed.
Processing: 2011-08-11
25Articles from 2011-08-

25Articles from 2011-02-22 processed.
Processing: 2011-02-18
25Articles from 2011-02-18 processed.
Processing: 2011-02-17
25Articles from 2011-02-17 processed.
Processing: 2011-02-16
25Articles from 2011-02-16 processed.
Processing: 2011-02-15
25Articles from 2011-02-15 processed.
Processing: 2011-02-14
25Articles from 2011-02-14 processed.
Processing: 2011-02-11
25Articles from 2011-02-11 processed.
Processing: 2011-02-10
25Articles from 2011-02-10 processed.
Processing: 2011-02-09
25Articles from 2011-02-09 processed.
Processing: 2011-02-08
25Articles from 2011-02-08 processed.
Processing: 2011-02-07
25Articles from 2011-02-07 processed.
Processing: 2011-02-04
25Articles from 2011-02-04 processed.
Processing: 2011-02-03
25Articles from 2011-02-03 processed.
Processing: 2011-02-02
25Articles from 2011-02-02 processed.
Processing: 2011-02-01
25Articles from 2011-02-01 processed.
Processing: 2011-01-31
25Articles from 2011-01-31 processed.
Processing: 2011-01-28
25Articles from 2011-01-

25Articles from 2010-08-10 processed.
Processing: 2010-08-09
25Articles from 2010-08-09 processed.
Processing: 2010-08-06
25Articles from 2010-08-06 processed.
Processing: 2010-08-05
25Articles from 2010-08-05 processed.
Processing: 2010-08-04
25Articles from 2010-08-04 processed.
Processing: 2010-08-03
25Articles from 2010-08-03 processed.
Processing: 2010-08-02
25Articles from 2010-08-02 processed.
Processing: 2010-07-30
25Articles from 2010-07-30 processed.
Processing: 2010-07-29
25Articles from 2010-07-29 processed.
Processing: 2010-07-28
25Articles from 2010-07-28 processed.
Processing: 2010-07-27
25Articles from 2010-07-27 processed.
Processing: 2010-07-26
25Articles from 2010-07-26 processed.
Processing: 2010-07-23
25Articles from 2010-07-23 processed.
Processing: 2010-07-22
25Articles from 2010-07-22 processed.
Processing: 2010-07-21
25Articles from 2010-07-21 processed.
Processing: 2010-07-20
25Articles from 2010-07-20 processed.
Processing: 2010-07-19
25Articles from 2010-07-

25Articles from 2010-01-27 processed.
Processing: 2010-01-26
25Articles from 2010-01-26 processed.
Processing: 2010-01-25
25Articles from 2010-01-25 processed.
Processing: 2010-01-22
25Articles from 2010-01-22 processed.
Processing: 2010-01-21
25Articles from 2010-01-21 processed.
Processing: 2010-01-20
25Articles from 2010-01-20 processed.
Processing: 2010-01-19
25Articles from 2010-01-19 processed.
Processing: 2010-01-15
25Articles from 2010-01-15 processed.
Processing: 2010-01-14
25Articles from 2010-01-14 processed.
Processing: 2010-01-13
25Articles from 2010-01-13 processed.
Processing: 2010-01-12
25Articles from 2010-01-12 processed.
Processing: 2010-01-11
25Articles from 2010-01-11 processed.
Processing: 2010-01-08
25Articles from 2010-01-08 processed.
Processing: 2010-01-07
25Articles from 2010-01-07 processed.
Processing: 2010-01-06
25Articles from 2010-01-06 processed.
Processing: 2010-01-05
25Articles from 2010-01-05 processed.
Processing: 2010-01-04
25Articles from 2010-01-

25Articles from 2009-07-15 processed.
Processing: 2009-07-14
25Articles from 2009-07-14 processed.
Processing: 2009-07-13
25Articles from 2009-07-13 processed.
Processing: 2009-07-10
25Articles from 2009-07-10 processed.
Processing: 2009-07-09
25Articles from 2009-07-09 processed.
Processing: 2009-07-08
25Articles from 2009-07-08 processed.
Processing: 2009-07-07
25Articles from 2009-07-07 processed.
Processing: 2009-07-06
25Articles from 2009-07-06 processed.
Processing: 2009-07-02
25Articles from 2009-07-02 processed.
Processing: 2009-07-01
25Articles from 2009-07-01 processed.
Processing: 2009-06-30
25Articles from 2009-06-30 processed.
Processing: 2009-06-29
25Articles from 2009-06-29 processed.
Processing: 2009-06-26
25Articles from 2009-06-26 processed.
Processing: 2009-06-25
25Articles from 2009-06-25 processed.
Processing: 2009-06-24
25Articles from 2009-06-24 processed.
Processing: 2009-06-23
25Articles from 2009-06-23 processed.
Processing: 2009-06-22
25Articles from 2009-06-

25Articles from 2008-12-30 processed.
Processing: 2008-12-29
25Articles from 2008-12-29 processed.
Processing: 2008-12-26
25Articles from 2008-12-26 processed.
Processing: 2008-12-24
25Articles from 2008-12-24 processed.
Processing: 2008-12-23
25Articles from 2008-12-23 processed.
Processing: 2008-12-22
25Articles from 2008-12-22 processed.
Processing: 2008-12-19
25Articles from 2008-12-19 processed.
Processing: 2008-12-18
25Articles from 2008-12-18 processed.
Processing: 2008-12-17
25Articles from 2008-12-17 processed.
Processing: 2008-12-16
25Articles from 2008-12-16 processed.
Processing: 2008-12-15
25Articles from 2008-12-15 processed.
Processing: 2008-12-12
25Articles from 2008-12-12 processed.
Processing: 2008-12-11
25Articles from 2008-12-11 processed.
Processing: 2008-12-10
25Articles from 2008-12-10 processed.
Processing: 2008-12-09
25Articles from 2008-12-09 processed.
Processing: 2008-12-08
25Articles from 2008-12-08 processed.
Processing: 2008-12-05
25Articles from 2008-12-

In [240]:
for index, row in df.iterrows():
    text = newsdict[index]
    df.at[index, 'articles'] = text

In [228]:
df["articles"] = df['articles'].str.replace('[^\w\s]','')

In [243]:
df['prices'] = df['adj close'].apply(np.int64)
df

Unnamed: 0,close,adj close,articles,prices
2016-07-01,17949.369141,17949.369141,A 117-year-old woman in Mexico City finally re...,17949
2016-06-30,17929.990234,17929.990234,Jamaica proposes marijuana dispensers for tour...,17929
2016-06-29,17694.679688,17694.679688,Explosion At Airport In Istanbul Yemeni former...,17694
2016-06-28,17409.720703,17409.720703,"2,500 Scientists To Australia: If You Want To ...",17409
2016-06-27,17140.240234,17140.240234,Barclays and RBS shares suspended from trading...,17140
2016-06-24,17400.750000,17400.750000,David Cameron to Resign as PM After EU Referen...,17400
2016-06-23,18011.070312,18011.070312,Today The United Kingdom decides whether to re...,18011
2016-06-22,17780.830078,17780.830078,German government agrees to ban fracking indef...,17780
2016-06-21,17829.730469,17829.730469,An Australian athlete who has competed in six ...,17829
2016-06-20,17804.869141,17804.869141,A staggering 87 percent of Venezuelans say the...,17804


In [None]:
df_stocks = df[['prices', 'articles']]

In [None]:
df_stocks.to_csv('combinedDataFile.csv')