# News Article Sentiment Classification (RNNs)

In this hands-on workshop, we'll practice classifying text using RNNs.

Input: 8000 news articles that are labeled by relevance to the US Economy.

Task: Fit a model that classifies the articles based on whether each is relevant to the US Economy. This is a binary classification task.

Reference: https://github.com/msahamed/yelp_comments_classification_nlp

## Dataset

CSV: https://www.figure-eight.com/wp-content/uploads/2016/03/Full-Economic-News-DFE-839861.csv

Source: https://www.figure-eight.com/data-for-everyone/

Description:

>Contributors read snippets of news articles. They then noted if the article was relevant to the US economy and, if so, what the tone of the article was. Tone was judged on a 9 point scale (from 1 to 9, with 1 representing the most negativity). Dataset contains these judgments as well as the dates, source titles, and text. Dates range from 1951 to 2014.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('../data/news-article/Full-Economic-News-DFE-839861.csv', encoding='latin1',
                usecols=['relevance', 'text'])
df.head()

Unnamed: 0,relevance,text
0,yes,NEW YORK -- Yields on most certificates of dep...
1,no,The Wall Street Journal Online</br></br>The Mo...
2,no,WASHINGTON -- In an effort to achieve banking ...
3,no,The statistics on the enormous costs of employ...
4,yes,NEW YORK -- Indecision marked the dollar's ton...


In [3]:
df.drop(df.loc[df.relevance=='not sure'].index, inplace=True)
df.relevance.unique()

array(['yes', 'no'], dtype=object)

In [4]:
# check the distribution of the relevant / irrelevant articles
df.groupby(['relevance']).size()

relevance
no     6571
yes    1420
dtype: int64

### Tokenization and Vectorization

Similar to sklearn and spaCy, Keras provides text pre-processing libraries that can convert text documents into TF-IDF vectors.

https://keras.io/preprocessing/text/

https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/

In [5]:
from keras.preprocessing.text import Tokenizer

docs = ['I need to learn machine learning.',
        'I have to go to work every day.',
        'I rather go on holiday.',
        'Maybe I can train a model to do my job for me some day.']

# create the tokenizer
tokenizer = Tokenizer()

# fit the tokenizer on the documents
tokenizer.fit_on_texts(docs)

# summarize what was learned
print('Word count', tokenizer.word_counts)
print('\nDocument count', tokenizer.document_count)
print('\nWord index', tokenizer.word_index)
print('\nDocument index', tokenizer.word_docs)

# Tf-idf encode documents
encoded_docs = tokenizer.texts_to_matrix(docs, mode='tfidf')
print(encoded_docs)

# Sequence encode documents
sequences = tokenizer.texts_to_sequences(docs)
print(sequences)

Using TensorFlow backend.


Word count OrderedDict([('i', 4), ('need', 1), ('to', 4), ('learn', 1), ('machine', 1), ('learning', 1), ('have', 1), ('go', 2), ('work', 1), ('every', 1), ('day', 2), ('rather', 1), ('on', 1), ('holiday', 1), ('maybe', 1), ('can', 1), ('train', 1), ('a', 1), ('model', 1), ('do', 1), ('my', 1), ('job', 1), ('for', 1), ('me', 1), ('some', 1)])

Document count 4

Word index {'i': 1, 'to': 2, 'go': 3, 'day': 4, 'need': 5, 'learn': 6, 'machine': 7, 'learning': 8, 'have': 9, 'work': 10, 'every': 11, 'rather': 12, 'on': 13, 'holiday': 14, 'maybe': 15, 'can': 16, 'train': 17, 'a': 18, 'model': 19, 'do': 20, 'my': 21, 'job': 22, 'for': 23, 'me': 24, 'some': 25}

Document index defaultdict(<class 'int'>, {'need': 1, 'learning': 1, 'i': 4, 'machine': 1, 'learn': 1, 'to': 3, 'work': 1, 'every': 1, 'day': 2, 'go': 2, 'have': 1, 'on': 1, 'rather': 1, 'holiday': 1, 'model': 1, 'do': 1, 'my': 1, 'can': 1, 'job': 1, 'me': 1, 'train': 1, 'for': 1, 'some': 1, 'maybe': 1, 'a': 1})
[[0.         0.58778666

In [6]:
# For LSTMs, we need sequences to be a fixed size

from keras.preprocessing.sequence import pad_sequences

sequence_length = 50

tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['text'])

sequences = tokenizer.texts_to_sequences(df['text'])
X = pad_sequences(sequences, maxlen=sequence_length)

In [7]:
X

array([[  11,  328,   22, ...,   67, 1638, 3040],
       [  63, 1391,    1, ...,    7,  209, 7986],
       [  41,  486,    9, ...,    1,  223,  129],
       ...,
       [   8,   33, 1280, ...,    4,  661,  188],
       [  14,    1,  567, ...,  211,   40,  415],
       [   4,  725,   11, ...,  133, 1598,   63]])

In [8]:
X.shape

(7991, 50)

In [9]:
# dictionary
tokenizer.word_index

{'the': 1,
 'br': 2,
 'of': 3,
 'to': 4,
 'a': 5,
 'in': 6,
 'and': 7,
 'that': 8,
 'for': 9,
 'is': 10,
 'on': 11,
 'as': 12,
 'at': 13,
 'by': 14,
 'it': 15,
 'said': 16,
 'with': 17,
 'from': 18,
 'was': 19,
 'are': 20,
 'but': 21,
 'year': 22,
 'have': 23,
 'be': 24,
 'has': 25,
 'its': 26,
 'an': 27,
 'market': 28,
 'new': 29,
 'this': 30,
 'more': 31,
 's': 32,
 'will': 33,
 '1': 34,
 'u': 35,
 'or': 36,
 'than': 37,
 'stock': 38,
 'he': 39,
 'their': 40,
 'up': 41,
 'they': 42,
 'which': 43,
 'about': 44,
 'would': 45,
 'rates': 46,
 'percent': 47,
 'federal': 48,
 'economic': 49,
 'not': 50,
 'were': 51,
 'last': 52,
 'rate': 53,
 'interest': 54,
 'some': 55,
 'economy': 56,
 'prices': 57,
 '2': 58,
 'billion': 59,
 'been': 60,
 'one': 61,
 'inflation': 62,
 'million': 63,
 'his': 64,
 'had': 65,
 'after': 66,
 '5': 67,
 'other': 68,
 '3': 69,
 'when': 70,
 'who': 71,
 'first': 72,
 'york': 73,
 'years': 74,
 'investors': 75,
 'over': 76,
 '4': 77,
 'out': 78,
 'growth': 79,
 '

In [10]:
# total number of words (vocabulary)
n_vocab = len(tokenizer.word_index)
print(n_vocab)

50726


## Train

1. Train an LSTM classifier
2. Get classification_report metric

In [11]:
y = df.relevance.map({'yes': 1, 'no': 0})

In [12]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from keras.optimizers import Adam
from keras.callbacks import TensorBoard, EarlyStopping, ModelCheckpoint
import time

model = Sequential()
model.add(Embedding(n_vocab, 100, input_length=sequence_length)) # 100 = embedding size
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer=Adam(),
              loss='binary_crossentropy',
              metrics=['accuracy'])


tensorboard = TensorBoard(log_dir='./logs/news_article/%d' % time.time())
earlystopping = EarlyStopping(patience=2)
checkpoint = ModelCheckpoint('lstm-news-article-{epoch:02d}-{loss:.4f}.hdf5',
                             monitor='loss', save_best_only=True, mode='min')

model.fit(X, y, epochs=200, callbacks=[tensorboard, earlystopping, checkpoint],
          validation_split=.2)

Train on 6392 samples, validate on 1599 samples
Epoch 1/200
Epoch 2/200
Epoch 3/200


<keras.callbacks.History at 0x19f5a91b0b8>