# Sentiment Prediction Using Machine Learning - Model Selection

For part 4, I was able to applied Natural Language Processing (NLP) models and libraries to evaluate the sentiment of each article's headline (for all 3 major news platforms) and preview (for CNBC and Reuters only). 
In this part 5, I am going to apply Machine Learning and Deep Learning models to predict the sentiment of each headline and preview based on the tokenized version of said headline/preview. Throughout part 5, I am using data collected from https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news/kernels as the training and validation data.

Specifically, in part 5.1, I am going to tokenized the texts from the dataset, categorizing their sentiments (negative, neutral, positive) and applying popular Machine Learning models to see which model yields the best result. This result is saved to the final dataframe for my conclusion.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score
from matplotlib.colors import ListedColormap

In [2]:
import pandas as pd
import numpy as np
from keras.utils import np_utils

df = pd.read_csv("./data/dataset.csv", header = None, encoding='latin-1', names=["Sentiment", "Headlines"])
df['Sentiment'] = df['Sentiment'].replace("negative",0).replace("neutral",1).replace("positive",2)
%store -r corpus
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:, 0].values

In [3]:
y

array([1, 1, 0, ..., 0, 0, 0], dtype=int64)

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [5]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### Logistic Regression

In [6]:
from sklearn.linear_model import LogisticRegression
logistic_c = LogisticRegression()
logistic_c.fit(X_train, y_train)
y_pred = logistic_c.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

[[2 2]
 [2 1]
 [1 1]
 ...
 [0 0]
 [1 0]
 [1 1]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [7]:
print(confusion_matrix(y_test, y_pred))

[[ 71  41  16]
 [ 28 464  83]
 [ 24  99 144]]


In [8]:
accuracy_score(y_test, y_pred)

0.7

### K - Nearest Neighbors (5 neighbors)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_c = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn_c.fit(X_train, y_train)
y_pred = knn_c.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
accuracy_score(y_test, y_pred)

### SVM

In [None]:
from sklearn.svm import SVC
svm_c = SVC(kernel = 'linear', random_state = 0)
svm_c.fit(X_train, y_train)
y_pred = svm_c.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
accuracy_score(y_test, y_pred)

### Kernel SVM (Gaussian RBF)

In [None]:
from sklearn.svm import SVC
rbf_c = SVC(kernel = 'rbf', random_state = 0)
rbf_c.fit(X_train, y_train)
y_pred = rbf_c.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
accuracy_score(y_test, y_pred)

### Gaussian Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
bayes_c = GaussianNB()
bayes_c.fit(X_train, y_train)
y_pred = bayes_c.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
accuracy_score(y_test, y_pred)

### Classification Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree_c = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
tree_c.fit(X_train, y_train)
y_pred = tree_c.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
accuracy_score(y_test, y_pred)

### Classification Forest 

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest_c = RandomForestClassifier(n_estimators = 10, random_state = 0)
forest_c.fit(X_train, y_train)
y_pred = forest_c.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
accuracy_score(y_test, y_pred)

### XGBoost

In [None]:
import xgboost as xgb
classifier = xgb.XGBClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

In [None]:
confusion_matrix(y_test, y_pred)

In [None]:
accuracy_score(y_test, y_pred)

Through the application of many Machine Learning models, I can conclude that XGBoost Classifier yields the best result. In the next part, I will use this model to predict financial news headlines I collected.

In [None]:
xgb_classifier = classifier
%store xgb_classifier