<font size = 40 color=darkgreen>Sentiment Analysis</font><br>
Sentiment analysis is the process of understanding the opinion of an author about a subject. This is comprised of 3 elements:
1. The opinion (*Positive - Neutral - Negative*) and/or emotion (*Joy - Surprise - Anger - Disgust*) 
2. The subject being talked about
3. Opinion holder<br>

Sentiment analysis is used to give insight how people are talking about a subject.<BR> 
Some common areas of application are:
- Social media monitoring
- Brand monitoring
- Customer service
- Product analysis
- Market research and analysis<BR>

<font size=5 color=darkgreen>Movie Review Model</font><br>
Using Logistic regression to predict the probability of sentiment is positive or negative given a movie review.<br>
This model will be using a cleaned version of the data found on Kaggle:<br> https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews#*<br><br>
Please see other notebook for cleaning the data:<BR>
https://github.com/michael-william/Sentiment-analysis/blob/master/Sentiment-analysis.ipynb<br><br>
**Steps for building the model:**<br>
1. Importing libraries and reading data
2. Organizing the data 
3. Tokenizing the data
4. Remove noise from data
5. Normalizing the data
6. Checking the word density
7. Preparing the model
8. Building the model
9. Conclusion

# <font color=teal>Import data</font>

## Importing required libraries##

In [28]:
# Data manipulation
import pandas as pd
import numpy as np
import random
from wordcloud import STOPWORDS
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize
from langdetect import detect_langs
from nltk.tag import pos_tag
from nltk import WordNetLemmatizer
from nltk import FreqDist
from nltk import classify
from nltk import NaiveBayesClassifier
import re, string


#ML libraries
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Options for pandas
pd.options.display.max_columns
pd.options.display.max_rows = 30

# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

from IPython import get_ipython
ipython = get_ipython()

# autoreload extension
if 'autoreload' not in ipython.extension_manager.loaded:
    %load_ext autoreload

%autoreload 2

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/michaelcondon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/michaelcondon/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/michaelcondon/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

20

## Importing data## 

In [9]:
# Reading data from Git
source = 'https://github.com/michael-william/Sentiment-analysis/raw/master/IMDB_Dataset.csv'
df = pd.read_csv(source)

# <font color=teal> Organizing the data</font>
Need to split the data we have into 'train' and 'test' sets so that we can train the machine on only some of the data, and then test the machine on known data to check for accuracy.<br>
The final output of this phase should be 2 dataframes(train and test) created from the original dataframe.

In [12]:
pos_df = df[df.sentiment=='positive']
neg_df = df[df.sentiment=='negative']

In [14]:
pos_strings = list(pos_df.review.astype('str'))
neg_strings = list(neg_df.review.astype('str'))

# <font color=teal> Tokenizing the data</font>
We need to process the language of the reviews into a format that can be understood by the machine. Tokenization will split the string version of the reviews into smaller parts called tokens. Tokens can then consist of words, emoticons, hashtags, etc.<br>
The final output of this phase should be 2 objects(pos_tokens and neg_tokens) that are lists of lists of reviews where each element of a review is itemized.<br>
Example of an tokenized review should look like this:<br>
['A', 'wonderful', 'little',' production', '.', '<', 'br', '/', '>']

In [15]:
# tokenizing positive and negative reviews
pos_tokens = [word_tokenize(x) for x in pos_strings]
neg_tokens = [word_tokenize(x) for x in neg_strings]

# <font color=teal> Remove Noise</font>
Noise is any part of the text that does not add meaning of information. Stopwords are a type of noise and consist of words like 'a', 'the', 'and.' Other noise can be hyper links, symbols, and other special characters.<br>
The final output of this phase will be a 'de-noised' version of our token objects.

## Creating a function to remove noise from our tokenized lists

In [16]:
# defing a frozen set of stop words
stop_words = ENGLISH_STOP_WORDS.union(["'s",'br','film','movie',"''",'``',"n't",
                                      '1/2',])

In [67]:
# function for removing noise

def remove_noise(tokens, stop_words = ()):
    cleaned_tokens = []
    
    for token, tag in pos_tag(tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*/\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+',"",token) #replacing hyperlinks with spaces
        token = re.sub("(@[A-Za-z0-9_]+)","",token) # replacing scpecial characters with spaces
        
        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words and token.isalpha()==True:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

## Applying our noise removal function to create clean token lists

In [68]:
# creating list of clean pos_tokens
clean_pos_tokens = [remove_noise(x,stop_words) for x in pos_tokens]

In [69]:
# creating list of clean neg_tokens
clean_neg_tokens = [remove_noise(x,stop_words) for x in neg_tokens]

# <font color=teal> Normalizing the data</font>
Words have different forms - for instance, 'ran', 'run', and 'running' are various forms of the same verb 'run.' Normalization is the process of converting a word to its canoical form. It helps with grouping words with the same meaning but different forms.<br>
We will be using lemmatization which normalizes a word with the context of vocabulary and morphilogical analysis of words in text.<br>A few steps that need to be completed:<br>
1. We will use 'wordnet' from nltk as a lexical database for the English language to help determine the base word.(downloaded at the begining of this notebook) 
2. We will use 'averaged_perceptron_tragger' as a resource to determine the context of a word in a sentence
3. We will use pos_tag to provide the relative position of the word in the sentence<br> Stemming is also a method to normalize, which is faster, but not as accurate as lemmatization.

The output of this phase is to normalize every tokenized review.

## Creating a function to lemmatize our token lists

In [70]:
# A function that takes a list of tokens and applies lemmatization
def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer() # initiating WordLemmatizer
    lemmatized_sentence = [] #creating empty list to hold words after they've been analyzed
    for word, tag in pos_tag(tokens): # iterating through tokens and their relative position
        if tag.startswith('NN'): # NN=Noun
            pos='n'
        elif tag.startswith('VB'): # VB=Verb
            pos='v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word,pos)) # appending the word and position
    return lemmatized_sentence

## Applying our lemmatize function to create new cleaned token lists

In [71]:
pos_lemmatized = [lemmatize_sentence(x) for x in clean_pos_tokens]
neg_lemmatized = [lemmatize_sentence(x) for x in clean_neg_tokens]

# <font color=teal> Determine word density</font>
We want to create a bank of words associated with both our positive and negative, cleaned, tokenized lists.
The end output of this phase 

## Creating a function to get unqie words from a cleaned token list

In [72]:
def get_all_words(cleaned_tokens):
    for tokens in cleaned_tokens:
        for token in tokens:
            yield token

## Checking for top common words in positive and negative reviews

In [73]:
freq_dist_pos = FreqDist(get_all_words(pos_lemmatized)).most_common(10)
freq_dist_neg = FreqDist(get_all_words(neg_lemmatized)).most_common(10)

In [74]:
freq_dist_pos

[('do', 19746),
 ('like', 19146),
 ('good', 19095),
 ('time', 15397),
 ('great', 14186),
 ('just', 14043),
 ('character', 13712),
 ('story', 13711),
 ('make', 13036),
 ('watch', 12059)]

In [75]:
freq_dist_neg

[('do', 24821),
 ('like', 23411),
 ('just', 21001),
 ('bad', 20816),
 ('good', 20800),
 ('make', 15215),
 ('time', 14366),
 ('watch', 14057),
 ('character', 14046),
 ('really', 12293)]

# <font color=teal> Prepare the model</font>
First, we will prepare the data to be fed into the model. We will use the Naive Bayes classifier in NLTK to perform the modeling exercise. The model requires not just a list of words in a review, but a Python dictionary with words as keys and True as values.

In [76]:
# generator function to change the format of the cleaned data to dictionary format
def get_tokens_for_model(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tokens)


In [77]:
# applying generator to pos_lemmatized list
positive_dataset = [(token_dict, "Positive")
                     for token_dict in get_tokens_for_model(pos_lemmatized)]

# applying generator to neg_lemmatized list
negative_dataset = [(token_dict, "Negative")
                     for token_dict in get_tokens_for_model(neg_lemmatized)]

# combining pos and neg to get a complete dataset
dataset = positive_dataset + negative_dataset

# shuffling order of full dataset
random.shuffle(dataset)

# splitting  the full dataset in to train and test sets
total_observations = len(dataset)
cutoff = int(np.ceil((total_observations * .5)))
train_data = dataset[:cutoff]
test_data = dataset[cutoff:]

# <font color=teal> Build and test the model</font>
We will be using the NaiveBayesClassifier to build the model. The output gives us the probability of a positive sentiment given a writter review.

In [78]:
# Initiating the classieifer and training it on the train data
classifier = NaiveBayesClassifier.train(train_data)

# checking the accuracy of the model by applying it to the test data
print("Accuracy is:", classify.accuracy(classifier, test_data))

# getting the top 10 words impacting the model
print(classifier.show_most_informative_features(10))

Accuracy is: 0.85728
Most Informative Features
                     uwe = True           Negati : Positi =     30.1 : 1.0
             abomination = True           Negati : Positi =     29.4 : 1.0
                    boll = True           Negati : Positi =     28.1 : 1.0
                bearable = True           Negati : Positi =     24.1 : 1.0
             unwatchable = True           Negati : Positi =     22.4 : 1.0
            collaborator = True           Positi : Negati =     20.5 : 1.0
               geraldine = True           Positi : Negati =     19.2 : 1.0
                   felix = True           Positi : Negati =     16.7 : 1.0
                  farley = True           Positi : Negati =     16.5 : 1.0
               camcorder = True           Negati : Positi =     16.3 : 1.0
None


# <font color=teal> Conclusion</font>
This model was able to correctly guess the sentiment(positive or negative) with an accuracy of 85.7% by analyzing a written review.  This is an improvement from the basic 'polarity' calculation from the TextBlob library. The polarity calculation was only accurate to 71%, 14 percentage points lower than our new model. However, this new model is computationaly heavy and takes a long time to train. Deployment on a large scale may be difficult.

## All code needed to build and run model

In [None]:
# Data manipulation
import pandas as pd
import numpy as np
import random
import re, string
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize, WordNetLemmatizer, classify, NaiveBayesClassifier
from nltk.tag import pos_tag

#ML libraries
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Reading data from Git
source = 'https://github.com/michael-william/Sentiment-analysis/raw/master/IMDB_Dataset.csv'
df = pd.read_csv(source)

# Splitting data into positive and negative dataframes
pos_df = df[df.sentiment=='positive']
neg_df = df[df.sentiment=='negative']

# Converting review columns of dataframes to a list of strings
pos_strings = list(pos_df.review.astype('str'))
neg_strings = list(neg_df.review.astype('str'))

# Tokenizing positive and negative lists
pos_tokens = [word_tokenize(x) for x in pos_strings]
neg_tokens = [word_tokenize(x) for x in neg_strings]

# Defing a frozen set of stop words with additional elements specifically for this dataset
stop_words = ENGLISH_STOP_WORDS.union(["'s",'br','film','movie',"''",'``',"n't"])

# Function for removing noise
def remove_noise(tokens, stop_words = ()):
    cleaned_tokens = []
    
    for token, tag in pos_tag(tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*/\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+',"",token) #replacing hyperlinks with spaces
        token = re.sub("(@[A-Za-z0-9_]+)","",token) # replacing scpecial characters with spaces
        
        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words and token.isalpha()==True:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

# Creating list of clean pos_tokens and neg_tokens
clean_pos_tokens = [remove_noise(x,stop_words) for x in pos_tokens]
clean_neg_tokens = [remove_noise(x,stop_words) for x in neg_tokens]

# A function that takes a list of tokens and applies lemmatization
def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer() # initiating WordLemmatizer
    lemmatized_sentence = [] #creating empty list to hold words after they've been analyzed
    for word, tag in pos_tag(tokens): # iterating through tokens and their relative position
        if tag.startswith('NN'): # NN=Noun
            pos='n'
        elif tag.startswith('VB'): # VB=Verb
            pos='v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word,pos))
    return lemmatized_sentence

# Lemmatizing cleaned tokens
pos_lemmatized = [lemmatize_sentence(x) for x in clean_pos_tokens]
neg_lemmatized = [lemmatize_sentence(x) for x in clean_neg_tokens]

# Generator function to change the format of the cleaned data to dictionary format
def get_tokens_for_model(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tokens)

# applying generator to pos_lemmatized list
positive_dataset = [(token_dict, "Positive")
                     for token_dict in get_tokens_for_model(pos_lemmatized)]

# applying generator to neg_lemmatized list
negative_dataset = [(token_dict, "Negative")
                     for token_dict in get_tokens_for_model(neg_lemmatized)]

# combining pos and neg to get a complete dataset
dataset = positive_dataset + negative_dataset

# shuffling order of full dataset
random.shuffle(dataset)

# splitting  the full dataset into equal size train and test sets
total_observations = len(dataset)
cutoff = int(np.ceil((total_observations * .5)))
train_data = dataset[:cutoff]
test_data = dataset[cutoff:]

# Initiating the classieifer and training it on the train data
classifier = NaiveBayesClassifier.train(train_data)

# checking the accuracy of the model by applying it to the test data
print("Accuracy is:", classify.accuracy(classifier, test_data))

# getting the top 10 words impacting the model
print(classifier.show_most_informative_features(10))

***Special thanks to Shaumik Daityari and his walk-through on sentiment analysis which was the base used for this notebook***
https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk