# Sentiment Prediction Using Deep Learning - Artificial Neural Network

In this section, I want to create a Artificial Neural Network (ANN), train, and test it on a dataset retrieved from https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news/kernels. This ANN will then fitted to the all 3 given datasets (CNBC, Reuters, and the Guardian) to evaluate whether the headline/preview is positive, neutral, or negative.

In [1]:
import sys
sys.path.insert(0, './lib')
import pandas as pd
import numpy as np
from keras.utils import np_utils
from sentiment_module import tokenize_stem

df = pd.read_csv("./data/dataset.csv", header = None, encoding='latin-1', names=["Sentiment", "Headlines"])
df['Sentiment'] = df['Sentiment'].replace("negative",0).replace("neutral",1).replace("positive",2)

corpus = []
for item in df['Headlines']:
    corpus.append(tokenize_stem(item))

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:, 0].values
%store cv

[nltk_data] Downloading package stopwords to C:\Users\Long's
[nltk_data]     XPS13\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Stored 'cv' (CountVectorizer)


In [2]:
print(X.shape)
print(y.shape)

(4846, 6679)
(4846,)


In [3]:
# transform column y to categorical data
y = np_utils.to_categorical(y, num_classes=3)

In [4]:
# Splitting into training sets and validation sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Create an Artificial Neural Network

In [5]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from keras.utils import np_utils

model = Sequential()
model.add(Dense(128, input_dim=(X_train.shape[1]), activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=10, batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x19f4f330f10>

In [6]:
model.evaluate(x=X_test, y=y_test, batch_size=None, verbose=1, sample_weight=None)



[1.84946608543396, 0.7329896688461304]

# Fitting the model to generate sentiment predictions

## CNBC Headlines and Previews

In [7]:
%store -r df1
from sentiment_module import tokenize_stem

# Predicting Headlines
corpus_hl1 = []
for item in df1['Headlines']:
    corpus_hl1.append(tokenize_stem(item))
pred_hl1 = cv.transform(corpus_hl1).toarray()
y_pred_hl1 = model.predict(pred_hl1)

In [8]:
print(y_pred_hl1.shape)
print(y_pred_hl1[0:10])

(2790, 3)
[[3.8671424e-04 3.9488778e-01 6.0472560e-01]
 [4.8199145e-04 9.9943560e-01 8.2302082e-05]
 [7.4745796e-05 3.3978444e-02 9.6594685e-01]
 [3.3359075e-07 9.9015737e-01 9.8423082e-03]
 [4.3858744e-08 9.9999571e-01 4.3189279e-06]
 [1.9821286e-04 9.9975377e-01 4.7994236e-05]
 [1.8200850e-05 9.9998021e-01 1.6045101e-06]
 [4.8642636e-07 9.9972171e-01 2.7774836e-04]
 [3.8362483e-05 9.0547568e-01 9.4486080e-02]
 [1.2204961e-05 9.9850696e-01 1.4807507e-03]]


In [9]:
from sentiment_module import cluster_extraction

# Clustering Headlines
hl_sentiment = cluster_extraction(y_pred_hl1)
hl_sentiment[0:10]

[2, 1, 2, 1, 1, 1, 1, 1, 1, 1]

In [10]:
# Predicting Descriptions/Previews
corpus_ds1 = []
for item in df1['Description']:
    corpus_ds1.append(tokenize_stem(item))
pred_ds1 = cv.transform(corpus_ds1).toarray()
y_pred_ds1 = model.predict(pred_ds1)

In [11]:
print(y_pred_ds1.shape)
print(y_pred_ds1[0:10])

(2790, 3)
[[6.30989305e-08 5.67937076e-01 4.32062894e-01]
 [1.23058697e-02 5.38669825e-02 9.33827221e-01]
 [4.88903834e-06 9.99994993e-01 1.11769445e-07]
 [3.26736971e-09 9.99999404e-01 6.40586848e-07]
 [5.72288409e-05 4.66213278e-05 9.99896169e-01]
 [1.23058697e-02 5.38669825e-02 9.33827221e-01]
 [8.25300006e-08 9.63330388e-01 3.66694406e-02]
 [3.28034103e-01 4.71051876e-03 6.67255402e-01]
 [1.51691688e-02 3.85451800e-04 9.84445393e-01]
 [1.96458935e-03 5.76814055e-01 4.21221316e-01]]


In [12]:
# Clustering Descriptions/Previews
ds_sentiment = cluster_extraction(y_pred_ds1)
ds_sentiment[0:10]

[1, 2, 1, 1, 2, 2, 1, 2, 2, 1]

### Combining

Finally, to determine the sentiment of the article, I am going to evaluate based on both the sentiment of the headline as well as the sentiment of the preview. Firstly, if at least at least 1 out of 2 (headline and preview) is positive and the other isnt negative, the article is assigned as positive. Secondly, if the 2 are both neutral or one is negative, the other is positive and vice versa, the article is assigned as neutral. Thirdly, if at least 1 out of 2 (headline and preview) is negative and the other isnt positive, the article is assigned as negative.

In [13]:
from sentiment_module import combine_sentiments
ann_c_sentiment = combine_sentiments(hl_sentiment, ds_sentiment)
ann_c_sentiment[0:10]

[2, 2, 2, 1, 2, 2, 1, 2, 2, 1]

In [14]:
# storing data for the result dataframe
%store -r final_df1
final_df1['ann_sentiment'] = ann_c_sentiment
%store final_df1

Stored 'final_df1' (DataFrame)


## Reuters Headlines and Previews

### Predicting

In [15]:
%store -r df2

# Headlines
corpus_hl2 = []
for item in df2['Headlines']:
    corpus_hl2.append(tokenize_stem(item))
pred_hl2 = cv.transform(corpus_hl2).toarray()
y_pred_hl2 = model.predict(pred_hl2)
print(y_pred_hl2.shape)

(32696, 3)


In [16]:
print(y_pred_hl2.shape)
print(y_pred_hl2[0:10])

(32696, 3)
[[2.7165914e-08 9.9998093e-01 1.9104598e-05]
 [6.4099318e-01 9.9399149e-02 2.5960761e-01]
 [3.5100028e-02 6.5218367e-02 8.9968163e-01]
 [8.0143036e-07 9.9999845e-01 6.7549411e-07]
 [1.4344887e-05 1.2273005e-01 8.7725562e-01]
 [1.8500049e-08 1.0000000e+00 4.8505182e-09]
 [2.2652510e-03 9.9744177e-01 2.9303078e-04]
 [1.7349186e-05 9.9995959e-01 2.3010365e-05]
 [2.9326597e-05 9.9978524e-01 1.8542174e-04]
 [1.5160785e-05 2.4653543e-04 9.9973828e-01]]


In [17]:
# Clustering Headlines
hl_sentiment = cluster_extraction(y_pred_hl2)
hl_sentiment[0:10]

[1, 0, 2, 1, 2, 1, 1, 1, 1, 2]

In [18]:
# Descriptions/Previews
corpus_ds2 = []
for item in df2['Description']:
    corpus_ds2.append(tokenize_stem(item))
pred_ds2 = cv.transform(corpus_ds2).toarray()
y_pred_ds2 = model.predict(pred_ds2)
print(y_pred_ds2.shape)

(32696, 3)


In [19]:
print(y_pred_ds2.shape)
print(y_pred_ds2[0:10])

(32696, 3)
[[4.33482796e-16 1.00000000e+00 3.37171087e-14]
 [5.73501695e-13 1.00000000e+00 9.13492338e-15]
 [4.10903510e-07 9.99995232e-01 4.46318109e-06]
 [6.18947752e-07 9.99999285e-01 9.34166664e-08]
 [1.88232719e-12 1.00000000e+00 1.06565715e-10]
 [9.87342048e-08 9.99984860e-01 1.49945427e-05]
 [5.99503096e-08 9.99999523e-01 3.27153913e-07]
 [7.49539320e-10 1.00000000e+00 8.05277893e-12]
 [3.04293146e-10 9.99999046e-01 9.35450657e-07]
 [1.04396491e-07 2.91091908e-07 9.99999642e-01]]


In [20]:
# Clustering Descriptions/Previews
ds_sentiment = cluster_extraction(y_pred_ds2)
ds_sentiment[0:10]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 2]

### Combining

Similar to CNBC data, I am going to evaluate each article's sentiment based on both the sentiment of its headline as well as the sentiment of its preview.

In [21]:
from sentiment_module import combine_sentiments
ann_r_sentiment = combine_sentiments(hl_sentiment, ds_sentiment)
ann_r_sentiment[0:10]

[1, 0, 2, 1, 2, 1, 1, 1, 1, 2]

In [22]:
# storing data for the result dataframe
%store -r final_df2
final_df2['ann_sentiment'] = ann_r_sentiment
%store final_df2

Stored 'final_df2' (DataFrame)


## The Guardian Headlines and Previews

### Predicting

In [23]:
%store -r df3

# Headlines
corpus_hl3 = []
for item in df3['Headlines']:
    corpus_hl3.append(tokenize_stem(item))
pred_hl3 = cv.transform(corpus_hl3).toarray()
y_pred_hl3 = model.predict(pred_hl3)
print(y_pred_hl3.shape)

(17794, 3)


In [24]:
print(y_pred_hl3.shape)
print(y_pred_hl3[0:10])

(17794, 3)
[[3.5169342e-04 9.9960142e-01 4.6925092e-05]
 [4.3137833e-03 9.9530089e-01 3.8536280e-04]
 [9.2211485e-06 9.9644655e-01 3.5441997e-03]
 [1.3374630e-03 9.9777168e-01 8.9078065e-04]
 [1.3539822e-01 7.9606497e-01 6.8536855e-02]
 [5.2598962e-06 9.9999428e-01 4.8217618e-07]
 [2.3469028e-08 9.9999881e-01 1.2345121e-06]
 [1.5334741e-03 2.7048718e-03 9.9576169e-01]
 [3.1593016e-05 9.9988854e-01 7.9831792e-05]
 [3.5610481e-04 9.9943393e-01 2.0995163e-04]]


In [25]:
# Clustering Headlines
hl_sentiment = cluster_extraction(y_pred_hl3)
hl_sentiment[0:10]

[1, 1, 1, 1, 1, 1, 1, 2, 1, 1]

In [26]:
# The Guardian's headline sentiment is the only variavle dictate the sentiment of the Guardian's articles
ann_g_sentiment = hl_sentiment

In [27]:
# storing data for the result dataframe
%store -r final_df3
final_df3['ann_sentiment'] = ann_g_sentiment
%store final_df3

Stored 'final_df3' (DataFrame)
