# Sentiment Classifier

This is a a sentiment classifer using Random Forest. The data sets are pulled from Kaggle:
__link_here__


First, begin by importing several items needed for machine learning

In [27]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV


## Data Exploration and Cleansing

There's some additional items to import that will help with cleaning the input data. The input is text, and need to undergo some cleansing to reach a more standardized form before it is given to the model

In [3]:
import re #regular expressions
import string
import nltk #Natural Language 
from nltk.stem import WordNetLemmatizer

In [4]:
#input_data = pd.read_csv(r'C:\Users\Patrick\Documents\GitHub\bootcamp_capstone\kaggle_dataset\sentiment_analysis_financial_news\all-data.csv'
#                , encoding = "ISO-8859-1", header=None, names=['sentiment', 'text'])


input_data = pd.read_csv(r'C:\Users\Patrick\Documents\GitHub\bootcamp_capstone\kaggle_dataset\stock-market_sentiment\stock_data.csv',
                        encoding="ISO-8859-1", header=1, names=['text', 'sentiment'] )

input_data.head(30)

Unnamed: 0,text,sentiment
0,user: AAP MOVIE. 55% return for the FEA/GEED i...,1
1,user I'd be afraid to short AMZN - they are lo...,1
2,MNTA Over 12.00,1
3,OI Over 21.37,1
4,PGNX Over 3.04,1
5,AAP - user if so then the current downtrend wi...,-1
6,Monday's relative weakness. NYX WIN TIE TAP IC...,-1
7,GOOG - ower trend line channel test & volume s...,1
8,AAP will watch tomorrow for ONG entry.,1
9,i'm assuming FCX opens tomorrow above the 34.2...,1


A quick preview of the data shows that the data lacks a consistent form. Some rows have special characters, while others have none at all; each line is of a different length; there is no consistent form.
The cleanup begins by removing special characters, and converting everything to lower case. Then, prefixes and suffixes can be removed (lemmatizing), reducing words to ty and get a level of consistency.
 

Stopwords are also removed. These are sentence modifiers like "A", "The", "And", "This". They don't add much information to a sentence, but exist because of grammar rules for human readabilty. They aren't necessary for the ML Model to extract the sentinment. The stopwords here will default to English

In [5]:
nltk.download("stopwords")
stopwords = nltk.corpus.stopwords.words('english')

lemmatizer = WordNetLemmatizer()

pattern = r'[^a-zA-Z0-9\s\%]'
cleaned_buffer = []
for x in input_data['text']:
    temp = re.sub(pattern, " ", x)
    temp = temp.lower()
    temp = temp.split()
    temp = [lemmatizer.lemmatize(word) for word in temp if not word in set(stopwords)]
    temp = ' '.join(temp)
    cleaned_buffer.append(temp)
    

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Patrick\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
input_data['cleaned'] = cleaned_buffer
input_data.head(10)

Unnamed: 0,text,sentiment,cleaned
0,user: AAP MOVIE. 55% return for the FEA/GEED i...,1,user aap movie 55% return fea geed indicator 1...
1,user I'd be afraid to short AMZN - they are lo...,1,user afraid short amzn looking like near monop...
2,MNTA Over 12.00,1,mnta 12 00
3,OI Over 21.37,1,oi 21 37
4,PGNX Over 3.04,1,pgnx 3 04
5,AAP - user if so then the current downtrend wi...,-1,aap user current downtrend break otherwise sho...
6,Monday's relative weakness. NYX WIN TIE TAP IC...,-1,monday relative weakness nyx win tie tap ice i...
7,GOOG - ower trend line channel test & volume s...,1,goog ower trend line channel test volume support
8,AAP will watch tomorrow for ONG entry.,1,aap watch tomorrow ong entry
9,i'm assuming FCX opens tomorrow above the 34.2...,1,assuming fcx open tomorrow 34 25 trigger buy s...


In some cases, the data cleansing has removed more information that necessary, making the resulting statement rather meaningless (rows 2, 3, and 4 above). With the additional rows of data, this can be overcome.

## Creating Training and Test Data

Now that the data is cleansed, the next step is to split it into a training and a test set. The data must also be converted into a number format, as the Machine Learning model cannot comprehend text. 

Converting the input data into numbers is done with a TFIDF Vectorizer. This will look at the words in the given input, and generate a mapping of which word(s) go together and how often. It's set to do up to 3 words at a time. The TFIDF is extracting features that the ML Model will use

In [7]:
xtrain, xtest, ytrain, ytest = train_test_split( input_data['cleaned'], input_data['sentiment'],
                                                               test_size=.4, random_state=10)
#60% of the data will be used for training. 40% to test.
#the random state number is so that each time this is run it generates the same result


In [8]:
tfidf = TfidfVectorizer(ngram_range=(1,3))
xtrain_tf = tfidf.fit_transform(xtrain)
xtest_tf = tfidf.transform(xtest)

In [9]:
print("nsamples: %d, nfeatures: %d" % xtest_tf.shape)
print(xtest_tf[1:3])

nsamples: 2316, nfeatures: 57630
  (0, 45368)	0.510097570366633
  (0, 25696)	0.510097570366633
  (0, 25695)	0.510097570366633
  (0, 14855)	0.3210467802934407
  (0, 12960)	0.3410723837858898
  (1, 31576)	0.822655797157474
  (1, 31525)	0.4690720683613014
  (1, 5270)	0.32126131744492964


Now for actually building and training the model. The model is a Random Forest Classifier from SciKitLearn.

The amount of CrossValidation done here is arbitrary, and can be changed.

In [10]:
rand_forest = RandomForestClassifier()
scores = cross_val_score(rand_forest, xtrain_tf, ytrain, cv=5)
print(scores)

[0.76546763 0.74244604 0.75395683 0.73093525 0.72622478]


In [18]:
#the first version of hyperparameter tuning
params = { 'n_estimators' : [5, 10, 25, 50, 100], 'max_depth' : [2, 5, 10, 20, None]}

grid_search = GridSearchCV (rand_forest, params)
grid_search.fit(xtrain_tf, ytrain.values)

all_means = grid_search.cv_results_['mean_test_score']
all_std_dev = grid_search.cv_results_['std_test_score']
all_params = grid_search.cv_results_['params']
for x in range(0, len(all_means)):
    print(all_params[x], "\t", all_means[x], "\t", all_std_dev[x])

{'max_depth': 2, 'n_estimators': 5} 	 0.6309729853005204 	 0.0006478831006395653
{'max_depth': 2, 'n_estimators': 10} 	 0.6315485248688658 	 0.00021230278025421434
{'max_depth': 2, 'n_estimators': 25} 	 0.6315485248688658 	 0.00021230278025421434
{'max_depth': 2, 'n_estimators': 50} 	 0.6315485248688658 	 0.00021230278025421434
{'max_depth': 2, 'n_estimators': 100} 	 0.6315485248688658 	 0.00021230278025421434
{'max_depth': 5, 'n_estimators': 5} 	 0.6321244790910787 	 0.0016862283817811348
{'max_depth': 5, 'n_estimators': 10} 	 0.6327008479671594 	 0.001481704893706652
{'max_depth': 5, 'n_estimators': 25} 	 0.6315485248688658 	 0.00021230278025421434
{'max_depth': 5, 'n_estimators': 50} 	 0.6315485248688658 	 0.00021230278025421434
{'max_depth': 5, 'n_estimators': 100} 	 0.6315485248688658 	 0.00021230278025421434
{'max_depth': 10, 'n_estimators': 5} 	 0.6416221259303796 	 0.009419443001572986
{'max_depth': 10, 'n_estimators': 10} 	 0.6370173947297493 	 0.003610886162135585
{'max_depth

In [20]:
grid_search.best_estimator_

RandomForestClassifier()

After performing the grid search across every combination of depth and estimators, and printing the results, there is a clear jump in accuracy when there are no constraints on depth. While standard deviation trends downards with more esitmators, there isn't much of a difference between 50 and 100 estimators. Diminishing returns is very strong at that point.


For fun, do a search again, with even more input parameters, except this time it uses Random Search. It will take a while, so this is a great time to take a break and remember to eat.

In [22]:
n_estimators = [int(x) for x in np.linspace(start=5, stop=300, num=5)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 100, num=5)]
max_depth.append(None)

min_samples_split = [1,2,4,10,20,25,50,100]
min_samples_leaf = [1,2,3,4]
bootstrap = [True, False]

grid = {'n_estimators':n_estimators,
        'max_features':max_features,
        'max_depth':max_depth,
        'min_samples_split':min_samples_split,
        'min_samples_leaf':min_samples_leaf,
        'bootstrap':bootstrap
       }



In [24]:
#Whereas grid search does X1, X2, X3, Y1, Y2, Y3, Z1, Z2.... Random search instead picks
#points within the "grid" of parameters to test
#this will take some time, so use n_jobs=-1 to use all the proceessors on the computer
#DO NOT RUN WITH n_iter=1000 (that takes a solid 7 hours)
rf_rand = RandomizedSearchCV(estimator=rand_forest, param_distributions=grid, n_iter=100, 
                            cv=5, random_state=10,n_jobs=-1)
rf_rand.fit(xtrain_tf, ytrain)

70 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
40 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Patrick\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Patrick\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 450, in fit
    trees = Parallel(
  File "C:\Users\Patrick\anaconda3\lib\site-packages\joblib\parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\Users\Patrick\anaconda3\lib\site-packages\joblib\parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\Pat

RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 32, 55, 77, 100,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 3, 4],
                                        'min_samples_split': [1, 2, 4, 10, 20,
                                                              25, 50, 100],
                                        'n_estimators': [5, 78, 152, 226, 300]},
                   random_state=10)

In [25]:
rf_rand.best_estimator_

RandomForestClassifier(bootstrap=False, min_samples_leaf=2, min_samples_split=4,
                       n_estimators=152)

In [30]:
#now create a RandForestClassifier using the best estimator from the random search.
forest = RandomForestClassifier(bootstrap=False, max_features='sqrt', min_samples_split=5, n_estimators=152)
forest.fit(xtrain_tf, ytrain)
forest_predictions = forest.predict(xtest_tf)
results = metrics.classification_report(ytest, forest_predictions)

In [29]:
print(results)

              precision    recall  f1-score   support

          -1       0.64      0.65      0.64       826
           1       0.80      0.80      0.80      1490

    accuracy                           0.74      2316
   macro avg       0.72      0.72      0.72      2316
weighted avg       0.75      0.74      0.75      2316



This is just basic hyperparameter tuning to find the best parameters for the constructor. More in-depth tuning has to take place after using the model for some time to understand what other areas need to be optimized. 

Now the model for production is ready to create using the best estimators from the search. This will be saved to a Pickle file, avoiding the need to train the file every time a prediction is desired, and also avoiding the need dto store or provide the training data

In [31]:
import pickle

filename = "my_saved_model"
pickle.dump(forest, open(filename, 'wb+'))

filename = "my_saved_tfidf"
pickle.dump(tfidf, open(filename, 'wb+'))