# Assignment 4 - Using NLP to play the stock market

In this assignment, we'll use everything we've learned to analyze corporate news and pick stocks. Be aware that in this assignment, we're trying to beat the benchmark of random chance (aka better than 50%).

This assignment will involve building three models:

**1. An RNN based on word inputs**

**2. A CNN based on character inputs**

**3. A neural net architecture that merges the previous two models**

You will apply these models to predicting whether a stock return will be positive or negative in the same day of a news publication.

## Your X - Reuters news data

Reuters is a news outlet that reports on corporations, among many other things. Stored in the `news_reuters.csv` file is news data listed in columns. The corresponding columns are the `ticker`, `name of company`, `date of publication`, `headline`, `first sentence`, and `news category`.

In this assignment it is up to you to decide how to clean this dataset. For instance, many of the first sentences contain a location name showing where the reporting is done. This is largely irrevant information and will probably just make your data noisier. You can also choose to subset on a certain news category, which might enhance your model performance and also limit the size of your data.

## Your Y - Stock information from Yahoo! Finance

Trading data from Yahoo! Finance was collected and then normalized using the [S&P 500](https://en.wikipedia.org/wiki/S%26P_500_Index). This is stored in the `stockReturns.json` file. 

In our dataset, the ticker for the S&P is `^GSPC`. Each ticker is compared the S&P and then judged on whether it is outperforming (positive value) or under-performing (negative value) the S&P. Each value is reported on a daily interval from 2004 to now.

Below is a diagram of the data in the json file. Note there are three types of data: short: 1 day return, mid: 7 day return, long 28 day return.

```
          term (short/mid/long)
         /         |         \
   ticker A   ticker B   ticker C
      /   \      /   \      /   \
  date1 date2 date1 date2 date1 date2
```

You will need to pick a length of time to focus on (day, week, month). You are welcome to train models on each dataset as well.  

Transform the return data such that the outcome will be binary:

```
label[y < 0] = 0
label[y >= 0] = 1
```

Finally, this data needs needs to be joined on the date and ticker - For each date of news publication, we want to join the corresponding corporation's news on its return information. We make the assumption that the day's return will reflect the sentiment of the news, regardless of timing.


# Your models - RNN, CNN, and RNN+CNN

For your RNN model, it needs to be based on word inputs, embedding the word inputs, encoding them with an RNN layer, and finally a decoding step (such as softmax or some other choice).

Your CNN model will be based on characters. For reference on how to do this, look at the CNN class demonstration in the course repository.

Finally you will combine the architecture for both of these models, either [merging](https://github.com/ShadyF/cnn-rnn-classifier) using the [Functional API](https://keras.io/getting-started/functional-api-guide/) or [stacking](http://www.aclweb.org/anthology/S17-2134). See the links for reference.

For each of these models, you will need to:
1. Create a train and test set, retaining the same test set for every model
2. Show the architecture for each model, printing it in your python notebook
2. Report the peformance according to some metric
3. Compare the performance of all of these models in a table (precision and recall)
4. Look at your labeling and print out the underlying data compared to the labels - for each model print out 2-3 examples of a good classification and a bad classification. Make an assertion why your model does well or poorly on those outputs.
5. For each model, calculate the return from the three most probable positive stock returns. Compare it to the actual return. Print this information in a table.

### Good luck!

## Model 1: RNN

In [1]:
import json
import pandas as pd
from sklearn.metrics import precision_recall_fscore_support
#json.loads()

In [2]:
## Reading in the New reuters dataset and assigning it to X variable
X = pd.read_csv('news_reuters.csv')
X.columns = ['ticker', 'name of company', 'date of publication', 'headline', 'first sentence', 'news category']

In [3]:
X.head()

Unnamed: 0,ticker,name of company,date of publication,headline,first sentence,news category
0,AA,Alcoa Corporation,20110708,Global markets weekahead: Lacking conviction,LONDON Investors are unlikely to gain strong c...,normal
1,AA,Alcoa Corporation,20110708,Jobs halt Wall Street rally investors eye ear...,NEW YORK Stocks fell on Friday as a weak jobs ...,topStory
2,AA,Alcoa Corporation,20110708,REFILE-TABLE-Australia's top carbon polluters,CANBERRA July 8 Following is a list of Austr...,normal
3,AA,Alcoa Corporation,20110708,US STOCKS-Jobs data hits stocks but earnings ...,* Google slumps on downgrade one of Nasdaq's...,normal
4,AA,Alcoa Corporation,20110708,US STOCKS-Jobs halt Wall St rally investors e...,* Dow off 0.5 pct S&P down 0.7 pct Nasdaq o...,normal


In [4]:
X.shape

(215287, 6)

In [5]:
##Reading in the stock returns and assinging it to stock
stock = pd.read_json('stockReturns.json')
stock.head()

Unnamed: 0,long,mid,short
AAPL,"{'20040106': -0.0023, '20040107': -0.0016, '20...","{'20040106': 0.06760000000000001, '20040107': ...","{'20040106': -0.0013000000000000002, '20040107..."
ABB,"{'20040106': 0.09630000000000001, '20040107': ...","{'20040106': 0.09340000000000001, '20040107': ...","{'20040106': 0.0015, '20040107': -0.0107000000..."
ABMD,"{'20040106': 0.08360000000000001, '20040107': ...","{'20040106': 0.039400000000000004, '20040107':...","{'20040106': 0.0102, '20040107': 0.0217, '2004..."
ABR,"{'20040413': 0.0367, '20040414': 0.0053, '2004...","{'20040413': 0.0082, '20040414': 0.01970000000...","{'20040413': 0.013900000000000001, '20040414':..."
ACAD,"{'20040602': -0.049300000000000004, '20040603'...","{'20040602': -0.0821, '20040603': -0.0611, '20...","{'20040602': -0.0346, '20040603': -0.0005, '20..."


In [6]:
stock.index

Index(['AAPL', 'ABB', 'ABMD', 'ABR', 'ACAD', 'ACAT', 'ACFC', 'ACRX', 'ADMA',
       'ADMS',
       ...
       'WPXP', 'WSFSL', 'WSO.B', 'WU', 'XCO', 'XLNX', 'ZBK', 'ZBRA', 'ZIXI',
       '^GSPC'],
      dtype='object', length=501)

In [7]:
##Assinging the column short in stock to the variable y as the target
y = stock['short']

In [8]:
y.head()

AAPL    {'20040106': -0.0013000000000000002, '20040107...
ABB     {'20040106': 0.0015, '20040107': -0.0107000000...
ABMD    {'20040106': 0.0102, '20040107': 0.0217, '2004...
ABR     {'20040413': 0.013900000000000001, '20040414':...
ACAD    {'20040602': -0.0346, '20040603': -0.0005, '20...
Name: short, dtype: object

In [9]:
y.keys

<bound method Series.keys of AAPL     {'20040106': -0.0013000000000000002, '20040107...
ABB      {'20040106': 0.0015, '20040107': -0.0107000000...
ABMD     {'20040106': 0.0102, '20040107': 0.0217, '2004...
ABR      {'20040413': 0.013900000000000001, '20040414':...
ACAD     {'20040602': -0.0346, '20040603': -0.0005, '20...
ACAT                                                    {}
ACFC     {'20041007': 0.0019, '20041008': 0.0302, '2004...
ACRX     {'20110215': -0.0129, '20110216': -0.022600000...
ADMA     {'20131022': 0.0002, '20131023': 0.0095, '2013...
ADMS     {'20140415': -0.0223, '20140416': -0.0247, '20...
AEP      {'20040106': -0.0108, '20040107': 0.0065000000...
AFA                                                     {}
AGIO     {'20130726': 0.0258, '20130730': 0.01560000000...
AGN      {'20040106': 0.0023, '20040107': -0.0076, '200...
AGR      {'20151222': 0.028, '20151223': 0.039900000000...
AGU                                                     {}
AHL      {'20040106': -0.01

In [10]:
##Convert the y value to dataframe
y= pd.DataFrame.from_dict(y)

In [11]:
y.head()

Unnamed: 0,short
AAPL,"{'20040106': -0.0013000000000000002, '20040107..."
ABB,"{'20040106': 0.0015, '20040107': -0.0107000000..."
ABMD,"{'20040106': 0.0102, '20040107': 0.0217, '2004..."
ABR,"{'20040413': 0.013900000000000001, '20040414':..."
ACAD,"{'20040602': -0.0346, '20040603': -0.0005, '20..."


In [12]:
## Reset y index
y.reset_index(inplace=True)

In [13]:
##convert column short in y to a series
ww = y['short'].apply(pd.Series)

In [14]:
ww.head()

Unnamed: 0,20040106,20040107,20040108,20040109,20040113,20040114,20040115,20040116,20040121,20040122,...,20180405,20180406,20180410,20180411,20180412,20180413,20180417,20180418,20180419,20180420
0,-0.0013,0.0162,0.0311,-0.0089,0.0226,-0.0083,-0.063,-0.0068,-0.0169,-0.0153,...,0.0001,-0.0037,0.0021,0.0009,0.0016,0.0063,0.0031,-0.0031,-0.023,-0.0333
1,0.0015,-0.0107,0.1233,-0.0059,0.0376,0.044,0.026,-0.0482,0.0041,-0.0062,...,0.0114,0.0057,-0.0017,-0.0036,-0.0047,0.0038,0.0015,0.0064,0.0397,0.0057
2,0.0102,0.0217,0.013,0.0103,-0.0086,0.0153,-0.011,0.0272,-0.0065,0.0549,...,0.0056,0.0017,0.0247,-0.0075,-0.0064,0.0022,0.0164,0.0096,0.0019,0.0068
3,,,,,,,,,,,...,-0.0091,0.0176,-0.02,0.0009,-0.0025,-0.0098,-0.0048,-0.0054,-0.0,0.0132
4,,,,,,,,,,,...,-0.0163,-0.0053,0.0559,0.0794,0.0017,-0.0175,0.0357,0.0042,-0.0104,0.0207


In [15]:
#assigning the column index in y variable as ticker in ww
ww['ticker'] = y['index']

In [16]:
ww.head()

Unnamed: 0,20040106,20040107,20040108,20040109,20040113,20040114,20040115,20040116,20040121,20040122,...,20180406,20180410,20180411,20180412,20180413,20180417,20180418,20180419,20180420,ticker
0,-0.0013,0.0162,0.0311,-0.0089,0.0226,-0.0083,-0.063,-0.0068,-0.0169,-0.0153,...,-0.0037,0.0021,0.0009,0.0016,0.0063,0.0031,-0.0031,-0.023,-0.0333,AAPL
1,0.0015,-0.0107,0.1233,-0.0059,0.0376,0.044,0.026,-0.0482,0.0041,-0.0062,...,0.0057,-0.0017,-0.0036,-0.0047,0.0038,0.0015,0.0064,0.0397,0.0057,ABB
2,0.0102,0.0217,0.013,0.0103,-0.0086,0.0153,-0.011,0.0272,-0.0065,0.0549,...,0.0017,0.0247,-0.0075,-0.0064,0.0022,0.0164,0.0096,0.0019,0.0068,ABMD
3,,,,,,,,,,,...,0.0176,-0.02,0.0009,-0.0025,-0.0098,-0.0048,-0.0054,-0.0,0.0132,ABR
4,,,,,,,,,,,...,-0.0053,0.0559,0.0794,0.0017,-0.0175,0.0357,0.0042,-0.0104,0.0207,ACAD


In [17]:
ww.set_index('ticker', inplace=True)

In [18]:
ww.head()

Unnamed: 0_level_0,20040106,20040107,20040108,20040109,20040113,20040114,20040115,20040116,20040121,20040122,...,20180405,20180406,20180410,20180411,20180412,20180413,20180417,20180418,20180419,20180420
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAPL,-0.0013,0.0162,0.0311,-0.0089,0.0226,-0.0083,-0.063,-0.0068,-0.0169,-0.0153,...,0.0001,-0.0037,0.0021,0.0009,0.0016,0.0063,0.0031,-0.0031,-0.023,-0.0333
ABB,0.0015,-0.0107,0.1233,-0.0059,0.0376,0.044,0.026,-0.0482,0.0041,-0.0062,...,0.0114,0.0057,-0.0017,-0.0036,-0.0047,0.0038,0.0015,0.0064,0.0397,0.0057
ABMD,0.0102,0.0217,0.013,0.0103,-0.0086,0.0153,-0.011,0.0272,-0.0065,0.0549,...,0.0056,0.0017,0.0247,-0.0075,-0.0064,0.0022,0.0164,0.0096,0.0019,0.0068
ABR,,,,,,,,,,,...,-0.0091,0.0176,-0.02,0.0009,-0.0025,-0.0098,-0.0048,-0.0054,-0.0,0.0132
ACAD,,,,,,,,,,,...,-0.0163,-0.0053,0.0559,0.0794,0.0017,-0.0175,0.0357,0.0042,-0.0104,0.0207


In [19]:
##applying stack to ww and assinging the result to new_y
new_y = ww.stack()

In [20]:
new_y

ticker          
AAPL    20040106   -0.0013
        20040107    0.0162
        20040108    0.0311
        20040109   -0.0089
        20040113    0.0226
        20040114   -0.0083
        20040115   -0.0630
        20040116   -0.0068
        20040121   -0.0169
        20040122   -0.0153
        20040123    0.0206
        20040127    0.0188
        20040128   -0.0134
        20040129    0.0042
        20040130   -0.0065
        20040203   -0.0007
        20040204   -0.0198
        20040205    0.0263
        20040206    0.0059
        20040210    0.0040
        20040211    0.0248
        20040212    0.0049
        20040213   -0.0299
        20040218    0.0045
        20040219   -0.0230
        20040220   -0.0066
        20040224    0.0110
        20040225    0.0143
        20040226    0.0080
        20040227    0.0440
                     ...  
^GSPC   20180228    0.0000
        20180301    0.0000
        20180302    0.0000
        20180306    0.0000
        20180307    0.0000
        201

In [21]:
##Converting to dataframe
new_y = new_y.to_frame(name=None)

In [22]:
new_y.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1
AAPL,20040106,-0.0013
AAPL,20040107,0.0162
AAPL,20040108,0.0311
AAPL,20040109,-0.0089
AAPL,20040113,0.0226


In [23]:
##Reseting the index
new_y.reset_index(inplace=True)

In [24]:
new_y.head()

Unnamed: 0,ticker,level_1,0
0,AAPL,20040106,-0.0013
1,AAPL,20040107,0.0162
2,AAPL,20040108,0.0311
3,AAPL,20040109,-0.0089
4,AAPL,20040113,0.0226


In [25]:
new_y.shape

(908537, 3)

In [26]:
## Assigning the column name
new_y.columns= ['ticker','date of publication','target']

In [27]:
new_y.head()

Unnamed: 0,ticker,date of publication,target
0,AAPL,20040106,-0.0013
1,AAPL,20040107,0.0162
2,AAPL,20040108,0.0311
3,AAPL,20040109,-0.0089
4,AAPL,20040113,0.0226


In [28]:
#function to convert target to 0 or 1

def convertLabel(target):
    
    if target < 0:
        return 0
    else:
        return 1

In [29]:
new_y['target'] = new_y['target'].apply(convertLabel)

In [30]:
new_y.head()

Unnamed: 0,ticker,date of publication,target
0,AAPL,20040106,0
1,AAPL,20040107,1
2,AAPL,20040108,1
3,AAPL,20040109,0
4,AAPL,20040113,1


In [31]:
X.head()

Unnamed: 0,ticker,name of company,date of publication,headline,first sentence,news category
0,AA,Alcoa Corporation,20110708,Global markets weekahead: Lacking conviction,LONDON Investors are unlikely to gain strong c...,normal
1,AA,Alcoa Corporation,20110708,Jobs halt Wall Street rally investors eye ear...,NEW YORK Stocks fell on Friday as a weak jobs ...,topStory
2,AA,Alcoa Corporation,20110708,REFILE-TABLE-Australia's top carbon polluters,CANBERRA July 8 Following is a list of Austr...,normal
3,AA,Alcoa Corporation,20110708,US STOCKS-Jobs data hits stocks but earnings ...,* Google slumps on downgrade one of Nasdaq's...,normal
4,AA,Alcoa Corporation,20110708,US STOCKS-Jobs halt Wall St rally investors e...,* Dow off 0.5 pct S&P down 0.7 pct Nasdaq o...,normal


In [32]:
X.dtypes

ticker                 object
name of company        object
date of publication     int64
headline               object
first sentence         object
news category          object
dtype: object

In [33]:
new_y.dtypes

ticker                 object
date of publication    object
target                  int64
dtype: object

In [34]:
#convert date of publication to int
new_y['date of publication'] = new_y['date of publication'].astype(int)

In [35]:
new_y.dtypes

ticker                 object
date of publication     int64
target                  int64
dtype: object

In [36]:
## Merge X and new_y into one dataframe
df = pd.merge(X,new_y)

In [37]:
df.head()

Unnamed: 0,ticker,name of company,date of publication,headline,first sentence,news category,target
0,AAPL,1-800 FLOWERSCOM Inc,20140415,Apple cannot escape U.S. states' e-book antitr...,NEW YORK Apple Inc on Tuesday lost an attempt ...,normal,0
1,AAPL,1-800 FLOWERSCOM Inc,20140415,Apple cannot escape U.S. states' e-book antitr...,NEW YORK April 15 Apple Inc on Tuesday lost a...,normal,0
2,AAPL,1-800 FLOWERSCOM Inc,20140415,Keep Steve Jobs' personality out of trial: tec...,SAN FRANCISCO Witnesses at an upcoming trial o...,topStory,0
3,AAPL,1-800 FLOWERSCOM Inc,20140415,Smartphone makers carriers embrace anti-theft...,NEW YORK April 15 Major U.S. wireless carrier...,normal,0
4,AAPL,1-800 FLOWERSCOM Inc,20140415,UPDATE 1-Keep Steve Jobs' personality out of t...,,normal,0


In [38]:
df.shape

(30903, 7)

In [39]:
#checking to see which is longer in text, btw headline and first sentence - first sentence is longer
df.iloc[22,4]

" SEOUL  July 7 Samsung Electronics   the world's top maker of memory chips and televisions  estimated its April-June operating profit would fall 26 percent year on year  as its LCD display business is widely expected to report another loss."

In [40]:
df.iloc[22,3]

'Samsung estimates Q2 profit down 26 pct '

In [41]:
## new x and y and then train-test split 
new_x = df['first sentence']
new_y = df['target']

### Data Preparation


In [42]:
import numpy as np
import nltk
vocabulary_size = 8000
unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in new_x]

In [43]:
import itertools

In [44]:
# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print("Found %d unique words tokens." % len(word_freq.items()))

Found 31889 unique words tokens.


In [45]:
# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])

print("Using vocabulary size %d." % vocabulary_size)
print("The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1]))

Using vocabulary size 8000.
The least frequent word in our vocabulary is 'restructured' and appeared 7 times.


In [46]:
# Replace all words not in our vocabulary with the unknown token
for i, sent in enumerate(tokenized_sentences):
    tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]

#print("\nExample sentence: '%s'" % sentences[0])
print("\nExample sentence after Pre-processing: '%s'" % tokenized_sentences[0])


Example sentence after Pre-processing: '['NEW', 'YORK', 'Apple', 'Inc', 'on', 'Tuesday', 'lost', 'an', 'attempt', 'to', 'dismiss', 'lawsuits', 'by', 'state', 'attorneys', 'general', 'accusing', 'it', 'of', 'conspiring', 'with', 'five', 'major', 'publishers', 'to', 'fix', 'e-book', 'prices', '.']'


In [47]:
type(tokenized_sentences)

list

In [48]:
tokenized_sentences_x = pd.Series(tokenized_sentences)

In [49]:
type(tokenized_sentences_x)

pandas.core.series.Series

In [50]:
type(new_x)

pandas.core.series.Series

In [51]:
new_x.shape

(30903,)

In [52]:
tokenized_sentences_x.shape

(30903,)

In [53]:
tokenized_sentences_x.head()

0    [NEW, YORK, Apple, Inc, on, Tuesday, lost, an,...
1    [NEW, YORK, April, 15, Apple, Inc, on, Tuesday...
2    [SAN, FRANCISCO, UNKNOWN_TOKEN, at, an, upcomi...
3    [NEW, YORK, April, 15, Major, U.S., wireless, ...
4                                                   []
dtype: object

In [54]:
# Create the training data
X_train_tokenized = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences])

In [55]:
#testing to see if the words were tokenized
X_train_tokenized[0]

[43,
 44,
 26,
 11,
 6,
 37,
 758,
 35,
 995,
 1,
 1661,
 1060,
 20,
 386,
 3338,
 1008,
 662,
 17,
 3,
 1449,
 15,
 512,
 184,
 1267,
 1,
 2040,
 1746,
 209]

In [56]:
X_train_tokenized

array([ list([43, 44, 26, 11, 6, 37, 758, 35, 995, 1, 1661, 1060, 20, 386, 3338, 1008, 662, 17, 3, 1449, 15, 512, 184, 1267, 1, 2040, 1746, 209]),
       list([43, 44, 73, 187, 26, 11, 6, 37, 758, 35, 995, 1, 1661, 1060, 20, 386, 3338, 1008, 662, 17, 3, 1449, 15, 512, 184, 1267, 1, 2040, 1746, 209]),
       list([95, 106, 7999, 29, 35, 744, 366, 57, 7999, 2071, 5, 1103, 1160, 440, 67, 58, 1647, 1, 264, 1297, 22, 26, 1476, 1040, 1203, 48, 51, 4, 7999, 51, 554, 184, 455, 113, 4458, 5, 4, 108, 312]),
       ...,
       list([7999, 7055, 97, 7999, 1070, 25, 230, 1, 126, 1184, 2414, 11, 8, 2184, 65, 10, 14, 2093, 30, 5, 330, 0, 294, 626, 50, 276, 179, 119, 15, 0, 1833]),
       list([73, 259, 7999, 7055, 97, 7999, 1070, 25, 230, 1, 126, 1184, 2414, 11, 8, 2184, 65, 10, 14, 2093, 30, 5, 330, 0, 294, 626, 50, 276, 179, 119, 15, 0, 1833]),
       list([13, 7999, 399, 597, 125, 7, 634, 63, 295, 107])], dtype=object)

In [57]:
### - Everything in data prep seems good

In [58]:
X_train_tokenized.shape

(30903,)

In [59]:
new_y.shape

(30903,)

In [60]:
new_x.shape

(30903,)

## Train and Test Split

In [61]:
#Train Test split 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_train_tokenized, new_y, test_size=0.3, random_state=0)

## Model 1: RNN

In [62]:
# truncate and pad input sequences
from keras.preprocessing import sequence
from keras.preprocessing.sequence import pad_sequences

max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

Using TensorFlow backend.


In [63]:
embedding_vecor_length = 32

In [64]:
from keras.models import Sequential
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Conv1D, MaxPooling1D

# create the model
model = Sequential()

model.add(Embedding(31889, embedding_vecor_length, input_length=max_review_length))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           1020448   
_________________________________________________________________
dropout_1 (Dropout)          (None, 500, 32)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 1,073,749
Trainable params: 1,073,749
Non-trainable params: 0
_________________________________________________________________
None


In [65]:
#Fit model
model.fit(X_train, y_train, epochs=3, batch_size=64)
# Final evaluation of the RNN model
rnn_scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (rnn_scores[1]*100))

Epoch 1/3
Epoch 2/3
Epoch 3/3
Accuracy: 61.19%


#### RNN ACCURACY


In [66]:
scores = model.evaluate(X_test, y_test, verbose=0)
print("RNN Accuracy: %.2f%%" % (scores[1]*100))

RNN Accuracy: 61.19%


In [67]:
y_pred = model.predict(X_test)

In [68]:
y_pred = (y_pred > 0.5).astype(int)

In [69]:
rnn_precision_recall = precision_recall_fscore_support(y_test, y_pred, average='micro')

In [70]:
#The the Precision, Recall and Fscore
rnn_precision_recall

(0.61190810052852984, 0.61190810052852984, 0.61190810052852984, None)

In [71]:
print("RNN Precision score: %.2f%%" % (rnn_precision_recall[0]*100))

RNN Precision score: 61.19%


In [72]:
print("RNN Recall score: %.2f%%" % (rnn_precision_recall[1]*100))

RNN Recall score: 61.19%


In [73]:
print("RNN F score: %.2f%%" % (rnn_precision_recall[2]*100))

RNN F score: 61.19%


In [74]:
yt = y_test
yt = yt.to_frame(name=None)
yt.reset_index(inplace = True)
yt.columns = ['index','true_label']
yt['pred'] = y_pred
token_data = tokenized_sentences_x.to_frame(name=None)
token_data.reset_index(inplace = True)
token_data.columns= ['index','data']
ty = pd.merge(yt,token_data)

ty['Correct prediction'] = ty['true_label'] == ty['pred']
rnn_bad_classification = ty[ty['Correct prediction'] == False]
rnn_good_classification = ty[ty['Correct prediction'] == True]    


## Model 2: CNN

In [75]:
#-Use raw text not encoded or tokenize
#new_x new_y
import tensorflow as tf
import numpy as np
from keras.utils.np_utils import to_categorical

In [76]:
docs = []
sentences = []
sentiments = []

for sentences, sentiment in zip(new_x, new_y):
    sentences_cleaned = [sent.lower() for sent in sentences]
    docs.append(sentences_cleaned)
    sentiments.append(sentiment)

len(docs), len(sentiments)

(30903, 30903)

In [77]:
maxlen = 1024 
nb_filter = 256
dense_outputs = 1024
filter_kernels = [7, 7, 3, 3, 3, 3]
n_out = 2
batch_size = 80
nb_epoch = 10

In [78]:
txt = ''
for doc in docs:
    for s in doc:
        txt += s
chars = set(txt)
vocab_size = len(chars)
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

total chars: 93


In [79]:
from keras.preprocessing.sequence import pad_sequences

def vectorize_sentences(data, char_indices):
    X = []
    for sentences in data:
        x = [char_indices[w] for w in sentences]
        x2 = np.eye(len(char_indices))[x]
        X.append(x2)
    return (pad_sequences(X, maxlen=maxlen))

news_data = vectorize_sentences(docs,char_indices)
news_data.shape
ydata = to_categorical(sentiments)



In [80]:
ydata #Why is does it have 2 columns, why not use the old one

array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       ..., 
       [ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.]])

In [81]:
Xtrain, Xtest, ytrain, ytest = train_test_split(news_data, ydata, test_size=0.3, random_state=0)

In [84]:
from keras.models import Model
from keras.layers import Input, Dense, Dropout, Flatten
from keras.layers.convolutional import Convolution1D, MaxPooling1D

inputs = Input(shape=(maxlen, vocab_size), name='input', dtype='float32')

conv = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[0],
                     border_mode='valid', activation='relu',
                     input_shape=(maxlen, vocab_size))(inputs)
conv = MaxPooling1D(pool_length=3)(conv)

conv1 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[1],
                      border_mode='valid', activation='relu')(conv)
conv1 = MaxPooling1D(pool_length=3)(conv1)

conv2 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[2],
                      border_mode='valid', activation='relu')(conv1)

conv3 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[3],
                      border_mode='valid', activation='relu')(conv2)

conv4 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[4],
                      border_mode='valid', activation='relu')(conv3)

conv5 = Convolution1D(nb_filter=nb_filter, filter_length=filter_kernels[5],
                      border_mode='valid', activation='relu')(conv4)
conv5 = MaxPooling1D(pool_length=3)(conv5)
conv5 = Flatten()(conv5)

z = Dropout(0.5)(Dense(dense_outputs, activation='relu')(conv5))
z = Dropout(0.5)(Dense(dense_outputs, activation='relu')(z))

pred = Dense(n_out, activation='sigmoid', name='output')(z)


#pred = Dense(n_out, activation='sigmoid', name='output')(z)


model = Model(input=inputs, output=pred)

model.compile(loss='binary_crossentropy', optimizer='rmsprop',
              metrics=['accuracy'])



# model.compile(loss='binary_crossentropy', optimizer='rmsprop',
#               metrics=['accuracy'])

  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  del sys.path[0]
  


In [None]:
# model.fit(Xtrain, ytrain, batch_size=32,
#           epochs=1, validation_split=0.2, verbose=False)

model.fit(Xtrain, ytrain, batch_size=64,
           epochs=3)
cnn_scores = model.evaluate(Xtest, ytest, verbose=0)
print("CNN Accuracy: %.2f%%" % (cnn_scores[1]*100))

Epoch 1/3
Epoch 2/3

### CNN ACCURACY

In [None]:
cnn_scores = model.evaluate(Xtest, ytest, verbose=0)
print("CNN Accuracy: %.2f%%" % (cnn_scores[1]*100))

In [None]:
ypred = model.predict(Xtest)

ypred = (ypred > 0.5).astype(int)

In [None]:
cnn_precision_recall = precision_recall_fscore_support(ytest, ypred, average='micro')

cnn_precision_recall

In [None]:
print("CNN Precision score: %.2f%%" % (cnn_precision_recall[0]*100))

In [None]:
print("CNN Recall score: %.2f%%" % (cnn_precision_recall[1]*100))

In [None]:
print("CNN F score: %.2f%%" % (cnn_precision_recall[2]*100))

In [None]:
yt = ytest
yt = yt.to_frame(name=None)
yt.reset_index(inplace = True)
yt.columns = ['index','true_label']
yt['pred'] = ypred
token_data = tokenized_sentences_x.to_frame(name=None)
token_data.reset_index(inplace = True)
token_data.columns= ['index','data']
ty = pd.merge(yt,token_data)

ty['Correct prediction'] = ty['true_label'] == ty['pred']
cnn_bad_classification = ty[ty['Correct prediction'] == False]
cnn_good_classification = ty[ty['Correct prediction'] == True]    


## Model 3: RNN+CNN

In [None]:
from keras.models import Model
from keras.layers import Input, Dense, Dropout, Flatten
from keras.layers.convolutional import Convolution1D, MaxPooling1D

maxlen = 1024 
nb_filter = 256
dense_outputs = 1024
filter_kernels = [7, 7, 3, 3, 3, 3]
n_out = 2
batch_size = 80
nb_epoch = 10

# Convolution
kernel_size = 5
filters = 64
pool_size = 4

# Training
batch_size = 30
epochs = 2

In [None]:
model = Sequential()
model.add(Embedding(31889, embedding_vecor_length, input_length=max_review_length))
model.add(Dropout(0.25))

## CNN
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(MaxPooling1D(pool_size=pool_size))

      
## RNN
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])




In [None]:
model.fit(X_train, y_train,
          batch_size=batch_size,
          epochs=3,
          validation_data=(X_test, y_test))

####  RNN + CNN ACCURACY

In [None]:
cnn_rnn_scores = model.evaluate(X_test, y_test, verbose=0)
print("CNN and RNN Combined Accuracy: %.2f%%" % (cnn_rnn_scores[1]*100))

In [None]:
y_pred = model.predict(X_test)

y_pred = (y_pred > 0.5).astype(int)

cnn_rnn_precision_recall = precision_recall_fscore_support(y_test, y_pred, average='micro')

cnn_rnn_precision_recall

In [None]:
print("CNN and RNN Combined Precision score: %.2f%%" % (cnn_rnn_precision_recall[0]*100))

In [None]:
print("CNN and RNN Combined Recall score: %.2f%%" % (cnn_rnn_precision_recall[1]*100))

In [None]:
print("CNN and RNN Combined Fscore: %.2f%%" % (cnn_rnn_precision_recall[2]*100))

In [None]:
yt = y_test
yt = yt.to_frame(name=None)
yt.reset_index(inplace = True)
yt.columns = ['index','true_label']
yt['pred'] = y_pred
token_data = tokenized_sentences_x.to_frame(name=None)
token_data.reset_index(inplace = True)
token_data.columns= ['index','data']
ty = pd.merge(yt,token_data)

ty['Correct prediction'] = ty['true_label'] == ty['pred']
cnn_rnn_bad_classification = ty[ty['Correct prediction'] == False]
cnn_rnn_good_classification = ty[ty['Correct prediction'] == True]    




### 4: Compare the performance of all of these models in a table (precision and recall)

In [None]:
#Compare the performance of all of these models in a table (precision and recall)

In [None]:
import pandas as pd
performance_df = {'Index': [1, 2, 3]
                 , 'Model': ['RNN' ,'CNN', 'CNN and RNN']
                                               
                 , 'Precision': [rnn_precision_recall[0] ,cnn_precision_recall[0] , cnn_rnn_precision_recall[0]]
                 , 'Recall': [ rnn_precision_recall[1] ,cnn_precision_recall[1] , cnn_rnn_precision_recall[1]]
                 , 'Accuracy': [ rnn_scores[1] ,cnn_scores[1] , cnn_rnn_scores[1] ]
                  }
df = pd.DataFrame(data = performance_df )
df.set_index('Index', inplace = True)
df

### 5: Look at your labeling and print out the underlying data compared to the labels - for each model print out 2-3 examples of a good classification and a bad classification. Make an assertion why your model does well or poorly on those outputs.

#### Good RNN Prediction

In [None]:
print('Four Good RNN prediction')

In [None]:
rnn_good_classification[:4]

#### BAD RNN Prediction

In [None]:
print('Four Good RNN prediction')

In [None]:
rnn_bad_classification[:4]

#### Good CNN Prediction

In [None]:
print('Four Good CNN prediction')

In [None]:
cnn_good_classification[:4]

#### BAD CNN Prediction

In [None]:
print('Four Bad CNN prediction')

In [None]:
cnn_good_classification[:4]

#### Good RNN and CNN Prediction

In [None]:
print('Four Good CNN+RNN prediction')

In [None]:
cnn_rnn_good_classification[:4]

#### BAD RNN and CNN Prediction

In [None]:
print('Four Bad CNN+RNN prediction')

In [None]:
cnn_rnn_bad_classification[:4]