# Sentiment Prediction Using Machine Learning - Model Selection

For part 4, I was able to applied Natural Language Processing (NLP) models and libraries to evaluate the sentiment of each article's headline (for all 3 major news platforms) and preview (for CNBC and Reuters only). 
In this part 5, I am going to apply Machine Learning and Deep Learning models to predict the sentiment of each headline and preview based on the tokenized version of said headline/preview. Throughout part 5, I am using data collected from https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news/kernels as the training and validation data.

Specifically, in part 5.1, I am going to tokenized the texts from the dataset, categorizing their sentiments (negative, neutral, positive) and applying popular Machine Learning models to see which model yields the best result. This result is saved to the final dataframe for my conclusion.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

In [2]:
import sys
sys.path.insert(0, './lib')
import pandas as pd
import numpy as np
from keras.utils import np_utils
from sentiment_module import tokenize_stem

df = pd.read_csv("./data/dataset.csv", header = None, encoding='latin-1', names=["Sentiment", "Headlines"])

corpus = []
for item in df['Headlines']:
    corpus.append(tokenize_stem(item))
%store corpus

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:, 0].values
%store cv

[nltk_data] Downloading package stopwords to C:\Users\Long's
[nltk_data]     XPS13\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Stored 'corpus' (list)
Stored 'cv' (CountVectorizer)


In [3]:
# transform column y to categorical data
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [5]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### Logistic Regression

In [6]:
from sklearn.linear_model import LogisticRegression
logistic_c = LogisticRegression()
logistic_c.fit(X_train, y_train)
y_pred = logistic_c.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

[[2 2]
 [2 1]
 [1 1]
 ...
 [0 0]
 [1 0]
 [1 1]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [7]:
print(confusion_matrix(y_test, y_pred))

[[ 71  41  16]
 [ 28 464  83]
 [ 24  99 144]]


In [8]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import f1_score
print(cosine_similarity([y_test, y_pred]))
print()
print(f1_score(y_test, y_pred, zero_division=1, average='micro'))

[[1.         0.87250961]
 [0.87250961 1.        ]]

0.7


### K - Nearest Neighbors (5 neighbors)

In [9]:
from sklearn.neighbors import KNeighborsClassifier
knn_c = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn_c.fit(X_train, y_train)
y_pred = knn_c.predict(X_test)

In [10]:
print(confusion_matrix(y_test, y_pred))

[[ 60  30  38]
 [ 26 255 294]
 [ 19  83 165]]


In [11]:
print(cosine_similarity([y_test, y_pred]))
print()
print(f1_score(y_test, y_pred, zero_division=1, average='micro'))

[[1.         0.84830087]
 [0.84830087 1.        ]]

0.4948453608247423


### SVM

In [12]:
from sklearn.svm import SVC
svm_c = SVC(kernel = 'linear', random_state = 0)
svm_c.fit(X_train, y_train)
y_pred = svm_c.predict(X_test)

In [13]:
print(confusion_matrix(y_test, y_pred))

[[ 70  41  17]
 [ 73 429  73]
 [ 44 107 116]]


In [14]:
print(cosine_similarity([y_test, y_pred]))
print()
print(f1_score(y_test, y_pred, zero_division=1, average='micro'))

[[1.         0.82587291]
 [0.82587291 1.        ]]

0.634020618556701


### Kernel SVM (Gaussian RBF)

In [15]:
from sklearn.svm import SVC
rbf_c = SVC(kernel = 'rbf', random_state = 0)
rbf_c.fit(X_train, y_train)
y_pred = rbf_c.predict(X_test)

In [16]:
print(confusion_matrix(y_test, y_pred))

[[ 31  77  20]
 [  2 569   4]
 [  0 182  85]]


In [17]:
print(cosine_similarity([y_test, y_pred]))
print()
print(f1_score(y_test, y_pred, zero_division=1, average='micro'))

[[1.         0.88890799]
 [0.88890799 1.        ]]

0.7061855670103093


### Gaussian Naive Bayes

In [18]:
from sklearn.naive_bayes import GaussianNB
bayes_c = GaussianNB()
bayes_c.fit(X_train, y_train)
y_pred = bayes_c.predict(X_test)

In [19]:
print(confusion_matrix(y_test, y_pred))

[[ 70  24  34]
 [ 81 296 198]
 [ 68  78 121]]


In [20]:
print(cosine_similarity([y_test, y_pred]))
print()
print(f1_score(y_test, y_pred, zero_division=1, average='micro'))

[[1.         0.77240692]
 [0.77240692 1.        ]]

0.5020618556701031


### Classification Tree

In [21]:
from sklearn.tree import DecisionTreeClassifier
tree_c = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
tree_c.fit(X_train, y_train)
y_pred = tree_c.predict(X_test)

In [22]:
print(confusion_matrix(y_test, y_pred))

[[ 59  49  20]
 [ 20 473  82]
 [ 22 101 144]]


In [23]:
print(cosine_similarity([y_test, y_pred]))
print()
print(f1_score(y_test, y_pred, zero_division=1, average='micro'))

[[1.         0.87082266]
 [0.87082266 1.        ]]

0.6969072164948453


### Classification Forest 

In [24]:
from sklearn.ensemble import RandomForestClassifier
forest_c = RandomForestClassifier(n_estimators = 10, random_state = 0)
forest_c.fit(X_train, y_train)
y_pred = forest_c.predict(X_test)

In [25]:
print(confusion_matrix(y_test, y_pred))

[[ 44  63  21]
 [  7 537  31]
 [ 12 146 109]]


In [26]:
print(cosine_similarity([y_test, y_pred]))
print()
print(f1_score(y_test, y_pred, zero_division=1, average='micro'))

[[1.         0.87810154]
 [0.87810154 1.        ]]

0.711340206185567


### XGBoost

In [27]:
import xgboost as xgb
classifier = xgb.XGBClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

In [28]:
confusion_matrix(y_test, y_pred)

array([[ 68,  49,  11],
       [  6, 539,  30],
       [ 10, 116, 141]], dtype=int64)

In [29]:
print(cosine_similarity([y_test, y_pred]))
print()
print(f1_score(y_test, y_pred, zero_division=1, average='micro'))

[[1.         0.90946066]
 [0.90946066 1.        ]]

0.7711340206185567


Through the application of many Machine Learning models, I can conclude that XGBoost Classifier yields the best result. In the next part, I will use this model to predict financial news headlines I collected.

In [31]:
xgb_classifier = classifier
%store xgb_classifier

Stored 'xgb_classifier' (XGBClassifier)
