# Sentiment Classifier

This is a a sentiment classifer using Random Forest. The data sets are pulled from Kaggle:
__link_here__


First, begin by importing several items needed for machine learning

In [3]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV


## Data Exploration and Cleansing

There's some additional items to import that will help with cleaning the input data. The input is text, and need to undergo some cleansing to reach a more standardized form before it is given to the model

In [4]:
import re #regular expressions
import string
import nltk #Natural Language 
from nltk.stem import WordNetLemmatizer

In [5]:
#input_data = pd.read_csv(r'C:\Users\Patrick\Documents\GitHub\bootcamp_capstone\kaggle_dataset\sentiment_analysis_financial_news\all-data.csv'
#                , encoding = "ISO-8859-1", header=None, names=['sentiment', 'text'])

input_data = pd.read_csv(r'C:\Users\Patrick\Documents\GitHub\bootcamp_capstone\kaggle_dataset\stock-market_sentiment\stock_data.csv',
                        encoding="ISO-8859-1", header=1, names=['text', 'sentiment'] )

input_data.head(30)

Unnamed: 0,text,sentiment
0,user: AAP MOVIE. 55% return for the FEA/GEED i...,1
1,user I'd be afraid to short AMZN - they are lo...,1
2,MNTA Over 12.00,1
3,OI Over 21.37,1
4,PGNX Over 3.04,1
5,AAP - user if so then the current downtrend wi...,-1
6,Monday's relative weakness. NYX WIN TIE TAP IC...,-1
7,GOOG - ower trend line channel test & volume s...,1
8,AAP will watch tomorrow for ONG entry.,1
9,i'm assuming FCX opens tomorrow above the 34.2...,1


A quick preview of the data shows that the data lacks a consistent form. Some rows have special characters, while others have none at all; each line is of a different length; there is no consistent form.
The cleanup begins by removing special characters, and converting everything to lower case. Then, prefixes and suffixes can be removed (lemmatizing), reducing words to ty and get a level of consistency. 

Stopwords are also removed. These are sentence modifiers like "A", "The", "And", "This". They don't add much information to a sentence, but exist because of grammar rules for human readabilty. They aren't necessary for the ML Model to extract the sentinment. The stopwords here will default to English

In [8]:
nltk.download("stopwords")
stopwords = nltk.corpus.stopwords.words('english')

lemmatizer = WordNetLemmatizer()

pattern = r'[^a-zA-Z0-9\s\%]'
cleaned_buffer = []
for x in input_data['text']:
    temp = re.sub(pattern, " ", x)
    temp = temp.lower()
    temp = temp.split()
    temp = [lemmatizer.lemmatize(word) for word in temp if not word in set(stopwords)]
    temp = ' '.join(temp)
    cleaned_buffer.append(temp)
    

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Patrick\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
input_data['cleaned'] = cleaned_buffer
input_data.head(10)

Unnamed: 0,text,sentiment,cleaned
0,user: AAP MOVIE. 55% return for the FEA/GEED i...,1,user aap movie 55% return fea geed indicator 1...
1,user I'd be afraid to short AMZN - they are lo...,1,user afraid short amzn looking like near monop...
2,MNTA Over 12.00,1,mnta 12 00
3,OI Over 21.37,1,oi 21 37
4,PGNX Over 3.04,1,pgnx 3 04
5,AAP - user if so then the current downtrend wi...,-1,aap user current downtrend break otherwise sho...
6,Monday's relative weakness. NYX WIN TIE TAP IC...,-1,monday relative weakness nyx win tie tap ice i...
7,GOOG - ower trend line channel test & volume s...,1,goog ower trend line channel test volume support
8,AAP will watch tomorrow for ONG entry.,1,aap watch tomorrow ong entry
9,i'm assuming FCX opens tomorrow above the 34.2...,1,assuming fcx open tomorrow 34 25 trigger buy s...


In some cases, the data cleansing has removed more information that necessary, making the resulting statement rather meaningless (rows 2, 3, and 4 above). With the additional rows of data, this can be overcome.

## Creating Training and Test Data

Now that the data is cleansed, the next step is to split it into a training and a test set. The data must also be converted into a number format, as the Machine Learning model cannot comprehend text. 

Converting the input data into numbers is done with a TFIDF Vectorizer. This will look at the words in the given input, and generate a mapping of which word(s) go together and how often. It's set to do up to 3 words at a time. The TFIDF is extracting features that the ML Model will use

In [13]:
xtrain, xtest, ytrain, ytest = train_test_split( input_data['cleaned'], input_data['sentiment'],
                                                               test_size=.4, random_state=10)
#60% of the data will be used for training. 40% to test.
#the random state number is so that each time this is run it generates the same result


In [14]:
tfidf = TfidfVectorizer(ngram_range=(1,3))
xtrain_tf = tfidf.fit_transform(xtrain)
xtest_tf = tfidf.transform(xtest)

In [22]:
print("nsamples: %d, nfeatures: %d" % xtest_tf.shape)
print(xtest_tf[1:3])

nsamples: 2316, nfeatures: 57630
  (0, 45368)	0.510097570366633
  (0, 25696)	0.510097570366633
  (0, 25695)	0.510097570366633
  (0, 14855)	0.3210467802934407
  (0, 12960)	0.3410723837858898
  (1, 31576)	0.822655797157474
  (1, 31525)	0.4690720683613014
  (1, 5270)	0.32126131744492964


need to talk about the tfidf matrix. maybe

Now for actually building and training the model. The model is a Random Forest Classifier from SciKitLearn

In [28]:
rand_forest = RandomForestClassifier()
scores = cross_val_score(rand_forest, xtrain_tf, ytrain, cv=5)
print(scores)

[0.76546763 0.73956835 0.75251799 0.7381295  0.72622478]


In [30]:
#a basic version of hyperparameter tuning, to try and make this a bit better
params = { 'n_estimators' : [5, 10, 25, 50, 100], 'max_depth' : [2, 5, 10, 20, None]}

grid_search = GridSearchCV (rand_forest, params)
grid_search.fit(xtrain_tf, ytrain.values)

GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'max_depth': [2, 5, 10, 20, None],
                         'n_estimators': [5, 10, 25, 50, 100]})