# Sentiment Analysis

Here, we have created a model using Naive Bayes Classifier to classify a text as positive, negative or neutral.


First import all the libraries you will use.

In [49]:
import nltk

nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

nltk.download('punkt')
nltk.download('twitter_samples')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

For training our model, we have used nltk's twitter dataset.


In [50]:
from nltk.corpus import twitter_samples

positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[:1500]
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')[:1500]
neutral_tweet_tokens = twitter_samples.tokenized('tweets.20150430-223406.json')[:3000]



In [51]:
print(len(positive_tweet_tokens))
print(len(negative_tweet_tokens))
print(len(neutral_tweet_tokens))

1500
1500
3000


In [52]:
def get_data_model(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tokens)

The functions below - clean() and lemmatize_sentence() help in cleaning out the data to remove stopwords, urls, single-character tokens, punctuations. 

We do this using regex and WordNetLemmatizer.

Since lemmatization returns an actual word of the language, it takes a lot of time to process.


In [53]:
import re, string
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [54]:
def clean(tokens):
    data_clean = []

    REPLACE_NO_SPACE = re.compile("[.;\!\'\/?,#\"\[\]]")

    for data in tokens:
        token = REPLACE_NO_SPACE.sub("", data.lower())
        token = re.sub('http[s:]//(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token= re.sub("(@[A-Za-z0-9_]+)","", token)
        token=re.sub("([0-9_]+)","", token)
        if token not in stopwords.words() and len(token)>1 and token not in string.punctuation:
            data_clean.append(token)

    lem = lemmatize_sentence(data_clean)

    return lem


def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []

    for token, tag in pos_tag(tokens):
        #NN - noun, VB - verb
        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation:
            lemmatized_sentence.append(token.lower())

    return lemmatized_sentence

Now we clean our positive, negative and neutral datasets with the help of the functions created above.

This takes some time depending on your dataset sample. Please wait for it to complete.

In [55]:
positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []
neutral_cleaned_tokens_list = []

for i in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(clean(i))
for i in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(clean(i))
for i in neutral_tweet_tokens:
    neutral_cleaned_tokens_list.append(clean(i))
        

In [57]:
print(positive_cleaned_tokens_list[200])


['anyway', ':-)']


Now that we have cleaned the data, we create our data model using the get_data_model() function we created above.
The Naive Bayes Classifier requires not just a list of words, but a Python dictionary with words as keys and True as values.

We take a threshold level of 0.8 to separate our data into training sample and testing sample, ie. 80% training data and 20% testing data.

In [73]:
positive_tokens_for_model = get_data_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_data_model(negative_cleaned_tokens_list)
neutral_tokens_for_model = get_data_model(neutral_cleaned_tokens_list)

positive_dataset = [(data, "Positive") for data in positive_tokens_for_model]

negative_dataset = [(data, "Negative") for data in negative_tokens_for_model]

neutral_dataset = [(data, "Neutral") for data in neutral_tokens_for_model]

threshold = 0.8
pos_len = int(threshold*len(positive_dataset))
neg_len = int(threshold*len(negative_dataset))
neu_len = int(threshold*len(neutral_dataset))

train_data = positive_dataset[:pos_len] + negative_dataset[:neg_len] + neutral_dataset[:neu_len]
test_data = positive_dataset[pos_len:] + negative_dataset[neg_len:] + neutral_dataset[neu_len:]




In [76]:
print(len(train_data))
print(len(test_data))

4800
1200


Finally we have used Naive Bayes Classifier to train our model. 
And we have stored this trained model in a pickle file in our local storage, to give a faster response in our API when we test the sentiment of a user provided string.

In [77]:
from nltk import classify, NaiveBayesClassifier
import pickle as pickle

classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data)*100)

pickle.dump(classifier, open('models/final_prediction.pickle', 'wb'))

Accuracy is: 99.41666666666666


In [78]:
modelfile = 'models/final_prediction.pickle'
model = pickle.load(open(modelfile, 'rb'))

In [89]:
def find(text):
    tokenizer = nltk.tokenize.TreebankWordTokenizer()
    tokens = (tokenizer.tokenize(text))
    prediction = model.classify(dict([token, True] for token in tokens))
    return prediction


In [91]:
text_1 = "This was an amazing movie!"
text_2 = "It was average"
text_3 = "Life is terrible."

print(find(text_1))
print(find(text_2))
print(find(text_3))

Positive
Neutral
Negative
