# Sentiment Prediction Using Deep Learning - Artificial Neural Network

In this section, I want to create a Artificial Neural Network (ANN), train, and test it on a dataset retrieved from https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news/kernels. This ANN will then fitted to the all 3 given datasets (CNBC, Reuters, and the Guardian) to evaluate whether the headline/preview is positive, neutral, or negative.

In [1]:
import pandas as pd
import numpy as np
from keras.utils import np_utils
from sentiment_module import tokenize_stem

df = pd.read_csv("../input/sentiment-analysis-for-financial-news/Labeled-headlines.csv", header = None, encoding='latin-1', names=["Sentiment", "Headlines"])
df['Sentiment'] = df['Sentiment'].replace("negative",0).replace("neutral",1).replace("positive",2)

corpus = []
for item in df['Headlines']:
    corpus.append(tokenize_stem(item))

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:, 0].values

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
print(X.shape)
print(y.shape)

(4846, 6679)
(4846,)


In [3]:
# transform column y to categorical data
y = np_utils.to_categorical(y, num_classes=3)

In [4]:
# Splitting into training sets and validation sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Create an Artificial Neural Network

In [5]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from keras.utils import np_utils

model = Sequential()
model.add(Dense(128, input_dim=(X_train.shape[1]), activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=20, batch_size=32)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f00c8e88850>

In [6]:
model.evaluate(x=X_test, y=y_test, batch_size=None, verbose=1, sample_weight=None)



[1.9163249731063843, 0.7371134161949158]

# Fitting the model to generate sentiment predictions

## Import data

In [7]:
from part1_cleaning import get_clean_data
df1, df2, df3 = get_clean_data()

## CNBC Headlines and Previews

In [8]:
from sentiment_module import tokenize_stem

# Predicting Headlines
corpus_hl1 = []
for item in df1['Headlines']:
    corpus_hl1.append(tokenize_stem(item))
pred_hl1 = cv.transform(corpus_hl1).toarray()
y_pred_hl1 = model.predict(pred_hl1)

In [9]:
print(y_pred_hl1.shape)
print(y_pred_hl1[0:10])

(2790, 3)
[[1.4323562e-04 8.7205112e-02 9.1265172e-01]
 [1.8325380e-05 9.9936324e-01 6.1845948e-04]
 [4.3294109e-05 8.0374338e-02 9.1958237e-01]
 [6.8006479e-08 9.9846882e-01 1.5310686e-03]
 [6.7102340e-10 9.9999428e-01 5.7502625e-06]
 [2.4883882e-06 9.9959761e-01 3.9995305e-04]
 [8.3639682e-09 9.9997890e-01 2.1140811e-05]
 [3.7892234e-10 9.9999750e-01 2.5375555e-06]
 [2.0827121e-05 6.9974917e-01 3.0022994e-01]
 [1.4516988e-05 9.9840206e-01 1.5834352e-03]]


In [10]:
from sentiment_module import cluster_extraction

# Clustering Headlines
hl_sentiment = cluster_extraction(y_pred_hl1)
hl_sentiment[0:10]

[2, 1, 2, 1, 1, 1, 1, 1, 1, 1]

In [11]:
# Predicting Descriptions/Previews
corpus_ds1 = []
for item in df1['Description']:
    corpus_ds1.append(tokenize_stem(item))
pred_ds1 = cv.transform(corpus_ds1).toarray()
y_pred_ds1 = model.predict(pred_ds1)

In [12]:
print(y_pred_ds1.shape)
print(y_pred_ds1[0:10])

(2790, 3)
[[6.5378106e-07 7.6185656e-01 2.3814283e-01]
 [6.2313903e-04 8.8500679e-02 9.1087621e-01]
 [8.6409109e-08 9.9999499e-01 4.8940583e-06]
 [7.6581623e-12 1.0000000e+00 4.2624585e-08]
 [1.6219007e-04 2.5488646e-06 9.9983525e-01]
 [6.2313903e-04 8.8500679e-02 9.1087621e-01]
 [1.2388120e-06 8.9816451e-01 1.0183418e-01]
 [5.9796795e-03 9.6053543e-04 9.9305975e-01]
 [3.0668882e-05 1.5729654e-05 9.9995363e-01]
 [1.0490964e-01 8.5255809e-02 8.0983454e-01]]


In [13]:
# Clustering Descriptions/Previews
ds_sentiment = cluster_extraction(y_pred_ds1)
ds_sentiment[0:10]

[1, 2, 1, 1, 2, 2, 1, 2, 2, 2]

### Combining

Finally, to determine the sentiment of the article, I am going to evaluate based on both the sentiment of the headline as well as the sentiment of the preview. Firstly, if at least at least 1 out of 2 (headline and preview) is positive and the other isnt negative, the article is assigned as positive. Secondly, if the 2 are both neutral or one is negative, the other is positive and vice versa, the article is assigned as neutral. Thirdly, if at least 1 out of 2 (headline and preview) is negative and the other isnt positive, the article is assigned as negative.

In [14]:
from sentiment_module import combine_sentiments
ann_c_sentiment = combine_sentiments(hl_sentiment, ds_sentiment)
ann_c_sentiment[0:10]

[2, 2, 2, 1, 2, 2, 1, 2, 2, 2]

## Reuters Headlines and Previews

### Predicting

In [15]:
# Headlines
corpus_hl2 = []
for item in df2['Headlines']:
    corpus_hl2.append(tokenize_stem(item))
pred_hl2 = cv.transform(corpus_hl2).toarray()
y_pred_hl2 = model.predict(pred_hl2)
print(y_pred_hl2.shape)

(32673, 3)


In [16]:
print(y_pred_hl2.shape)
print(y_pred_hl2[0:10])

(32673, 3)
[[2.2078127e-10 9.9987781e-01 1.2223334e-04]
 [7.2164670e-02 1.1377629e-02 9.1645771e-01]
 [6.9285268e-03 7.5315787e-03 9.8553991e-01]
 [1.0199434e-09 9.9999917e-01 8.9203985e-07]
 [6.8162160e-05 1.7636581e-01 8.2356602e-01]
 [3.0626248e-11 9.9999988e-01 1.2705000e-07]
 [1.3832501e-04 9.9833518e-01 1.5265293e-03]
 [1.1676603e-07 9.9992001e-01 7.9819685e-05]
 [9.1137981e-06 9.9896622e-01 1.0246177e-03]
 [1.5637579e-05 1.6279037e-05 9.9996805e-01]]


In [17]:
# Clustering Headlines
hl_sentiment = cluster_extraction(y_pred_hl2)
hl_sentiment[0:10]

[1, 2, 2, 1, 2, 1, 1, 1, 1, 2]

In [18]:
# Descriptions/Previews
corpus_ds2 = []
for item in df2['Description']:
    corpus_ds2.append(tokenize_stem(item))
pred_ds2 = cv.transform(corpus_ds2).toarray()
y_pred_ds2 = model.predict(pred_ds2)
print(y_pred_ds2.shape)

(32673, 3)


In [19]:
print(y_pred_ds2.shape)
print(y_pred_ds2[0:10])

(32673, 3)
[[3.41031178e-21 1.00000000e+00 1.18194923e-13]
 [1.27763763e-18 1.00000000e+00 1.27154363e-13]
 [3.83013884e-07 9.99963284e-01 3.63821200e-05]
 [2.75795853e-09 9.99999404e-01 6.10373320e-07]
 [1.18721694e-15 1.00000000e+00 2.64517463e-10]
 [5.46280852e-08 9.99938488e-01 6.14668606e-05]
 [3.39692968e-10 9.99963284e-01 3.66930399e-05]
 [2.79700013e-11 1.00000000e+00 1.91119316e-08]
 [8.01147870e-10 9.94458556e-01 5.54147549e-03]
 [1.15398926e-07 5.33638662e-08 9.99999881e-01]]


In [20]:
# Clustering Descriptions/Previews
ds_sentiment = cluster_extraction(y_pred_ds2)
ds_sentiment[0:10]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 2]

### Combining

Similar to CNBC data, I am going to evaluate each article's sentiment based on both the sentiment of its headline as well as the sentiment of its preview.

In [21]:
from sentiment_module import combine_sentiments
ann_r_sentiment = combine_sentiments(hl_sentiment, ds_sentiment)
ann_r_sentiment[0:10]

[1, 2, 2, 1, 2, 1, 1, 1, 1, 2]

## The Guardian Headlines and Previews

### Predicting

In [22]:
# Headlines
corpus_hl3 = []
for item in df3['Headlines']:
    corpus_hl3.append(tokenize_stem(item))
pred_hl3 = cv.transform(corpus_hl3).toarray()
y_pred_hl3 = model.predict(pred_hl3)
print(y_pred_hl3.shape)

(17795, 3)


In [23]:
print(y_pred_hl3.shape)
print(y_pred_hl3[0:10])

(17795, 3)
[[9.78029930e-05 9.97460723e-01 2.44133756e-03]
 [3.89612973e-01 4.95873332e-01 1.14513658e-01]
 [8.57681481e-10 9.99996901e-01 3.05865069e-06]
 [3.67214384e-11 9.99999881e-01 1.67136491e-07]
 [7.53350032e-04 1.66809097e-01 8.32437575e-01]
 [6.09501376e-06 9.99744356e-01 2.49548903e-04]
 [6.49304911e-06 9.99558628e-01 4.34971327e-04]
 [5.90442028e-07 9.99886870e-01 1.12537324e-04]
 [1.52127686e-04 9.95626688e-01 4.22113528e-03]
 [6.49312336e-04 9.94738162e-01 4.61243792e-03]]


In [24]:
# Clustering Headlines
hl_sentiment = cluster_extraction(y_pred_hl3)
hl_sentiment[0:10]

[1, 1, 1, 1, 2, 1, 1, 1, 1, 1]

In [25]:
# The Guardian's headline sentiment is the only variavle dictate the sentiment of the Guardian's articles
ann_g_sentiment = hl_sentiment