### AIDI 1002 - AI Algorithms - Final Project

### Sentiment Analysis

Michael Molnar - Durham College #100806823

## Notebook 4:  Sentiment Analysis of User Text

#### Problem Statement:

How can we use machine learning to automatically extract the sentiment of every review, comment, blog post, or news article that mentions your business or your products?  

This project will create a model that will automatically analyze text and predict its sentiment - negative, neutral, or positive.  This solution will allow a business to automatically parse reviews and comments it receives, sorting them, and allowing for the analyis of how customers feel about the business and brand.  This analysis will allow for a company to determine how feelings towards the company change over time, or after the release of a new product or a shift in direction.  Unhappy customers can be automatically identified and prioritized.  

The proposed solution will be a classification model trained on real product reviews to identify the key words and phrases that most accurately predict the sentiment of a sample of text.

#### Focus of Notebook 4:

In the previous notebook I determined that the best model was a Logistic Regression one.  In this notebook I will create the pipeline to take and process user text to match the data the model has been trained and tested on.  I will then recreate the selected model from the previous notebook and create functions to generate sentiments for text input.  

The final function will take in the text, do all of the processing and generate the sentiment.  It will produce the probabilities for each of the three classes.  Finally, it will make use of the study done at the end of the last notebook to identify the top five words or phrases that most impacted on the model's class prediction.

### Import Necessary Packages

In [2]:
import pandas as pd
import numpy as np
import re
import string
from num2words import num2words
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### Prepare for Input Text

In this section I will use all of the text cleaning functions I created in previous notebooks.  All of these will be combined into one function that will transform a user's text into a suitable form to be applied to the Logistic Regression model.

In [3]:
# Function to expand contractions
def decontract(sentence):
    sentence = re.sub(r"won\'t", "will not", sentence)
    sentence = re.sub(r"can\'t", "can not", sentence)
    
    sentence = re.sub(r"n\'t", " not", sentence)
    sentence = re.sub(r"\'re", " are", sentence)
    sentence = re.sub(r"\'s", " is", sentence)
    sentence = re.sub(r"\'d", " would", sentence)
    sentence = re.sub(r"\'ll", " will", sentence)
    sentence = re.sub(r"\'t", " not", sentence)
    sentence = re.sub(r"\'ve", " have", sentence)
    sentence = re.sub(r"\'m", " am", sentence)
    return sentence

In [4]:
# Function to remove linebreaks
def remove_linebreaks(input):
    text = re.compile(r'\n')
    return text.sub(r' ',input)

In [5]:
# Function to remove punctuation
def remove_punctuation(input):
    no_punc = [char for char in input if char not in string.punctuation]
    no_punc = ''.join(no_punc)
    return no_punc

In [6]:
# Function to replace numbers with text
def replace_numbers(text):
    words = []
    for word in text.split():
        if word.isdigit():
            words.append(num2words(word))
        else:
            words.append(word)
    return " ".join(words)

In [7]:
# Generate the list of stopwords
stopwords_list = stopwords.words('english')
stopwords_list.remove('no')
stopwords_list.remove('not')
stopwords_list.remove('very')
stopwords_list.remove('only')

In [8]:
# Function to remove stopwords
def remove_stopwords(input):
    no_stop = [word for word in input.split() if word not in stopwords_list]
    no_stop = ' '.join(no_stop)
    return no_stop

In [9]:
# Function to stem text
def stem_text(input):
    stemmer = SnowballStemmer('english')
    text = input.split()
    words = ""
    for i in text:
        words += (stemmer.stem(i))+' '
    return words 

I now combine all of these processing techniques into one function that will be used on a user's text.  I note again that I remove punctuation twice - before and after converting numbers to strings - to ensure that any dashes that have been added will be removed.

In [11]:
def process_input(input):
    input = decontract(input)
    input = remove_linebreaks(input)
    input = remove_punctuation(input)
    input = input.lower()
    input = replace_numbers(input)
    input = remove_punctuation(input)
    input = remove_stopwords(input)
    input = stem_text(input)
    return input

### Load the Data the Model was Trained and Tested On

In [12]:
X_train = pd.read_csv('clean_training_data.csv')
X_test = pd.read_csv('clean_testing_data.csv')
y_train = pd.read_csv('training_labels.csv')
y_test = pd.read_csv('testing_labels.csv')

In [13]:
# Remove extra column
X_train.drop(columns=['Unnamed: 0'], inplace=True)
X_test.drop(columns=['Unnamed: 0'], inplace=True)
y_train.drop(columns=['Unnamed: 0'], inplace=True)
y_test.drop(columns=['Unnamed: 0'], inplace=True)

In [14]:
# Convert from dataframes to series
X_train = X_train['reviewText']
X_test = X_test['reviewText']
y_train = y_train['label']
y_test = y_test['sentiment']

In [15]:
# Check the shapes
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(77271,)
(24664,)
(77271,)
(24664,)


### Fit the Logestic Regression Model

In [16]:
# As noted, I will use a Count Vectorizer with a combination of unigrams, bigrams, and trigrams
# This combination produced the highest accuracy in the algorithm testing notebook

vectorizer = CountVectorizer(ngram_range=(1,3))

# Vectorization
X_train_cv = vectorizer.fit_transform(X_train)
X_test_cv = vectorizer.transform(X_test)

In [17]:
# Fitting the Logistic Regression classifier and checking the accuracies
lr = LogisticRegression(solver='liblinear', multi_class='ovr')
lr.fit(X_train_cv, y_train) 
lr_train_preds = lr.predict(X_train_cv)
lr_preds = lr.predict(X_test_cv)
lr_train_acc = accuracy_score(y_train, lr_train_preds)
lr_test_acc = accuracy_score(y_test, lr_preds)
print("Training Accuracy:", lr_train_acc)
print("Testing Accuracy:", lr_test_acc)

Training Accuracy: 0.974078244101927
Testing Accuracy: 0.8607281868310087


### Match the Vectorizer Features and the Coefficients

This will repeat the process from the last notebook of matching the features in the vectorizer with the model's coefficients for each of the three classes.  This will allow me to match the words and phrases found in the input text to these coefficients and then rank them according to their importance to the prediction of the label.

In [23]:
lr.coef_.shape

(3, 1753074)

In [24]:
len(vectorizer.get_feature_names())

1753074

In [25]:
# The coefficients for the negative class
negative_coef = {
    word: coef for word, coef in zip(
        vectorizer.get_feature_names(), lr.coef_[0])
    }

In [26]:
# The coefficients for the neutral class
neutral_coef = {
    word: coef for word, coef in zip(
        vectorizer.get_feature_names(), lr.coef_[1])
    }

In [27]:
# The coefficients for the positive class
positive_coef = {
    word: coef for word, coef in zip(
        vectorizer.get_feature_names(), lr.coef_[2])
    }

### Make Predictions for New Text

In [44]:
"""
This is a simplified function.  It will allow the user to enter their text and then process and vectorize it.  
It will then predict the sentiment according to the Logistic Regression model that has been fit.  
It will then print the probabilities assigned to each of the three classes.

"""
def predict_sentiment():
    # Prompt the user for text
    user_text = input('Enter Your Text Here: \n\n')
    # Process 
    processed_text = [process_input(user_text)]
    # Transform the text according to the Count Vectorizer that has been fit
    vec_text = vectorizer.transform(processed_text)
    
    # Predict the class label
    predicted_sentiment = lr.predict(vec_text)
    # Generate the probabilities for the labels
    confidence = lr.predict_proba(vec_text)
           
    # Print a summary
    print('\n\nThe Predicted Sentiment is: ', predicted_sentiment[0].upper())
    print('\nAnalysis:')
    print('\nProbability of Negative Label:', np.around(confidence[0][0]*100, 2), '%')
    print('Probability of Neutral Label:', np.around(confidence[0][1]*100, 2), '%')
    print('Probabilty of Positive Label:', np.around(confidence[0][2]*100, 2), '%')

### Make Predictions and Analysis for New Text

In [43]:
"""
This function extends the previous one.  After the class label is predicted it will 
create a list of the vectorized features of the input.  It will then create a dictionary
of the model's coefficients for these features based on the label.  Finally, it will sort 
these values and print out a dataframe of the five with the highest importance. 

"""
def predict_sentiment_with_analysis():
    # Prompt the user for text
    user_text = input('Enter Your Text Here: \n\n')
    # Process
    processed_text = [process_input(user_text)]
    # Transform the text according to the Count Vectorizer that has been fit
    vec_text = vectorizer.transform(processed_text)
    
    # Predict the class label
    predicted_sentiment = lr.predict(vec_text)
    # Generate the probabilities for the labels
    confidence = lr.predict_proba(vec_text)
    
    # Get the features of the input text and store the English words or phrases
    phrases = []
    for item in vec_text.indices:
        phrases.append(vectorizer.get_feature_names()[item])
    
    # Create a dictionary of the model's coefficients for these features 
    importances = dict()
    for phrase in phrases:
        if predicted_sentiment == 'negative':
            importances[phrase] = negative_coef.get(phrase)
        elif predicted_sentiment == 'neutral':
            importances[phrase] = neutral_coef.get(phrase)
        elif predicted_sentiment == 'positive':
            importances[phrase] = positive_coef.get(phrase)
            
    # Sort this dictionary according to its values
    importances = sorted(importances.items(), key=lambda x: x[1], reverse=True)
        
    # Create a dataframe of these features and 
    importances = pd.DataFrame(importances)
    importances = importances[:5]
    importances = importances.rename(columns={0: 'Word or Phrase', 1: 'Model Coefficient'},
                                        index={0: 'First', 1: 'Second', 2: 'Third', 3: 'Fourth', 4: 'Fifth'})
        
    # Print the results
    print('\n\nThe Predicted Sentiment is: ', predicted_sentiment[0].upper())
    print('\nAnalysis:')
    print('\nProbability of Negative Label:', np.around(confidence[0][0]*100, 2), '%')
    print('Probability of Neutral Label:', np.around(confidence[0][1]*100, 2), '%')
    print('Probabilty of Positive Label:', np.around(confidence[0][2]*100, 2), '%')
    print('\nThe Five Most Important Stemmed Words or Phrases Are: \n')
    print(importances)

### Testing the Model on Real Reviews

I have searched on Amazon for one positive, one negative, and one neutral review and I will examine the results of my model.

#### Positive Review

<b>Source:</b>

https://www.amazon.ca/product-reviews/B07XP1CNRW/ref=acr_dp_hist_5?ie=UTF8&filterByStar=five_star&reviewerType=all_reviews#reviews-filter-bar

<b>Rating:</b>

5/5

<b>Text:</b>

I was looking to get back into painting and these paints were incredible! I was unsure of whether to purchase or not because they were much more affordable than some other options I saw but I am so glad I purchased these. Whether these are for children or for yourself, you can enjoy and you really get the feel that they are high quality. I've worked with expensive acrylic paints before and these are very similar. They blend very seamlessly together, and the colour selection is absolutely gorgeous. The paint tubes are actually a decent size so you have good value there as well. None of the paints were dried up or anything, and so far they have exceeded my expectations. They arrived in very nice packaging as well so I would imagine that it would make a great gift. Highly recommend this!!

In [45]:
predict_sentiment()

Enter Your Text Here: 

I was looking to get back into painting and these paints were incredible! I was unsure of whether to purchase or not because they were much more affordable than some other options I saw but I am so glad I purchased these. Whether these are for children or for yourself, you can enjoy and you really get the feel that they are high quality. I've worked with expensive acrylic paints before and these are very similar. They blend very seamlessly together, and the colour selection is absolutely gorgeous. The paint tubes are actually a decent size so you have good value there as well. None of the paints were dried up or anything, and so far they have exceeded my expectations. They arrived in very nice packaging as well so I would imagine that it would make a great gift. Highly recommend this!!


The Predicted Sentiment is:  POSITIVE

Analysis:

Probability of Negative Label: 0.0 %
Probability of Neutral Label: 0.13 %
Probabilty of Positive Label: 99.87 %


In [46]:
predict_sentiment_with_analysis()

Enter Your Text Here: 

I was looking to get back into painting and these paints were incredible! I was unsure of whether to purchase or not because they were much more affordable than some other options I saw but I am so glad I purchased these. Whether these are for children or for yourself, you can enjoy and you really get the feel that they are high quality. I've worked with expensive acrylic paints before and these are very similar. They blend very seamlessly together, and the colour selection is absolutely gorgeous. The paint tubes are actually a decent size so you have good value there as well. None of the paints were dried up or anything, and so far they have exceeded my expectations. They arrived in very nice packaging as well so I would imagine that it would make a great gift. Highly recommend this!!


The Predicted Sentiment is:  POSITIVE

Analysis:

Probability of Negative Label: 0.0 %
Probability of Neutral Label: 0.13 %
Probabilty of Positive Label: 99.87 %

The Five Most 

The model is almost certain of the positive sentiment here.  The review contains some very telling words, "great", "glad", and "gorgeous", being the top three.

#### Neutral Review

<b>Source:</b>

https://www.amazon.ca/product-reviews/B076KCGRS6/ref=?ie=UTF8&filterByStar=three_star&reviewerType=all_reviews&pageNumber=1#reviews-filter-bar

<b>Rating:</b>

3/5

<b>Text:</b>

These pencils are pretty good. They're not amazing but they're good enough for beginners at a decent price. The leads are quite strong. I've been using them for a few weeks now and I haven't had one break yet. What bothers me is that they aren't labeled, so it can be hard to tell which color is which. I often have to hold it up into the light to tell the difference, and I still have a hard time since some of them look so similar. I'll usually just scribble it on a separate piece of paper beforehand, just to be sure I have the correct pencil. In the future, I plan on labeling them myself to avoid wasting so much time. The variety of colors are okay, but they're not the most "natural" of hues (think comic book colors) I had to go out and purchase additional colors separately (because I needed more pastel colors) Artists might need a different gray pencil because it looks more like a sandy beige...and the white pencil does absolutely nothing. Overall I am happy with this purchase...but as they run out individually, I will probably replace each of them one by one with different brands, just so I can choose the exact colors I need for specific projects.

In [47]:
predict_sentiment()

Enter Your Text Here: 

These pencils are pretty good. They're not amazing but they're good enough for beginners at a decent price. The leads are quite strong. I've been using them for a few weeks now and I haven't had one break yet. What bothers me is that they aren't labeled, so it can be hard to tell which color is which. I often have to hold it up into the light to tell the difference, and I still have a hard time since some of them look so similar. I'll usually just scribble it on a separate piece of paper beforehand, just to be sure I have the correct pencil. In the future, I plan on labeling them myself to avoid wasting so much time. The variety of colors are okay, but they're not the most "natural" of hues (think comic book colors) I had to go out and purchase additional colors separately (because I needed more pastel colors) Artists might need a different gray pencil because it looks more like a sandy beige...and the white pencil does absolutely nothing. Overall I am happy wit

In [48]:
predict_sentiment_with_analysis()

Enter Your Text Here: 

These pencils are pretty good. They're not amazing but they're good enough for beginners at a decent price. The leads are quite strong. I've been using them for a few weeks now and I haven't had one break yet. What bothers me is that they aren't labeled, so it can be hard to tell which color is which. I often have to hold it up into the light to tell the difference, and I still have a hard time since some of them look so similar. I'll usually just scribble it on a separate piece of paper beforehand, just to be sure I have the correct pencil. In the future, I plan on labeling them myself to avoid wasting so much time. The variety of colors are okay, but they're not the most "natural" of hues (think comic book colors) I had to go out and purchase additional colors separately (because I needed more pastel colors) Artists might need a different gray pencil because it looks more like a sandy beige...and the white pencil does absolutely nothing. Overall I am happy wit

The model is almost certain of the neutral class.  It has relied most on the words "okay" and "decent" in making the determination.

#### Negative Review

Since the model was so sure of the first two labels for this one I have selected a review that was rated one out of five but does not contain any extremely negative words or phrases - there is no "terrible", "horrible", "waste of money", or other such phrases.

<b>Source:</b>

https://www.amazon.ca/product-reviews/B076KCGRS6/ref=cm_cr_unknown?ie=UTF8&filterByStar=one_star&reviewerType=all_reviews&pageNumber=1#reviews-filter-bar

<b>Rating:</b>

1/5

<b>Text:</b>

We ordered the full 3 sets and before we knew it they were breaking constantly. Thinking it was our sharpener, we used a different one and the problem persisted. You will get breakages but not losing 50mm in less than a hour. Before long that pencil will be gone.
We returned ours for a full refund and bought some Castles, much better.
Don’t know how these are constructed, but it looks like a 2 piece outer with pressured stain, which disguises the grain whether good or bad.
We contacted the seller who was very sympathetic and sent us a free box of Cobras, but sadly no different, actually these were worse.

In [49]:
predict_sentiment()

Enter Your Text Here: 

We ordered the full 3 sets and before we knew it they were breaking constantly. Thinking it was our sharpener, we used a different one and the problem persisted. You will get breakages but not losing 50mm in less than a hour. Before long that pencil will be gone. We returned ours for a full refund and bought some Castles, much better. Don’t know how these are constructed, but it looks like a 2 piece outer with pressured stain, which disguises the grain whether good or bad. We contacted the seller who was very sympathetic and sent us a free box of Cobras, but sadly no different, actually these were worse.


The Predicted Sentiment is:  NEGATIVE

Analysis:

Probability of Negative Label: 82.56 %
Probability of Neutral Label: 17.44 %
Probabilty of Positive Label: 0.0 %


In [50]:
predict_sentiment_with_analysis()

Enter Your Text Here: 

We ordered the full 3 sets and before we knew it they were breaking constantly. Thinking it was our sharpener, we used a different one and the problem persisted. You will get breakages but not losing 50mm in less than a hour. Before long that pencil will be gone. We returned ours for a full refund and bought some Castles, much better. Don’t know how these are constructed, but it looks like a 2 piece outer with pressured stain, which disguises the grain whether good or bad. We contacted the seller who was very sympathetic and sent us a free box of Cobras, but sadly no different, actually these were worse.


The Predicted Sentiment is:  NEGATIVE

Analysis:

Probability of Negative Label: 82.56 %
Probability of Neutral Label: 17.44 %
Probabilty of Positive Label: 0.0 %

The Five Most Important Stemmed Words or Phrases Are: 

       Word or Phrase  Model Coefficient
First          return           1.895161
Second            bad           1.473265
Third            se

With this negative review the coefficients are not as large as with the two previous tests.  This is becuase the text does not contain any of the most predictive negative words and phrases that were identified in the last notebook.  Still, the model is over 82% sure that this review is negative, and gives the words it finds most predictive.

### Conclusions

The model is performing very well on user text.  So far this project has comprised of an EDA of the Arts, Crafts, and Sewing subset of the Amazon Product Reviews dataset, in which common words and phrases were identified.  Functions have been written to clean and process text.  This involved expanding contractions, converting numbers to text, removing punctuation, linebreaks and stop words, and then stemming the text.  Six machine learning classifiers have examined and tested using a Count Vectorizer and unigrams, bigrams, trigrams, and combinations.  In the end a Logistic Regression classifier was chosen and tuned, coupled with a Count Vectorizer that used a combination of unigrams, bigrams, and trigrams.  Finally, a pipeline was created to process and vectorize input text in order to use the model to predict its sentiment. 

### Next Steps:

A Cloud deployment of this model.