# +++++++++ Sentiment Analysis Classifier ++++++++++

Importing the Required packages for creating a classifier

In [1]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

Defining a Function to extract Keywords

In [3]:
def extract_features(word_list):
    return dict([(word, True) for word in word_list])

Using NLTK movie review data for training classifier

In [9]:
# Loading positive and negative reviews  
positive_fileids = movie_reviews.fileids('pos')
negative_fileids = movie_reviews.fileids('neg')

Seprating files IDs into positive and negative reviews

In [10]:
features_positive = [(extract_features(movie_reviews.words(fileids=[f])), 'Positive') for f in positive_fileids]
features_negative = [(extract_features(movie_reviews.words(fileids=[f])), 'Negative') for f in negative_fileids]

Dividing Data into train and testing part

In [11]:
# Split the data into train and test (80/20)
threshold_factor = 0.8
threshold_positive = int(threshold_factor * len(features_positive))
threshold_negative = int(threshold_factor * len(features_negative))

Extracting the features

In [12]:
features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]
print("\nNumber of training datapoints:", len(features_train))
print("Number of test datapoints:", len(features_test))


Number of training datapoints: 1600
Number of test datapoints: 400


Training Naive Bayes Classifier

In [15]:
# Train a Naive Bayes classifier
classifier = NaiveBayesClassifier.train(features_train)
print("Accuracy:", nltk.classify.util.accuracy(classifier, features_test))

Accuracy: 0.735


Finding the most effective words that puts impact on the reviews for being positive or negative

In [17]:
print("Top 10 most informative words:")
for item in classifier.most_informative_features()[:10]:
    print(item[0])

Top 10 most informative words:
outstanding
insulting
vulnerable
ludicrous
uninvolving
astounding
avoids
fascination
symbol
animators


Creating a sample inputs to check the accuracy

In [30]:
# Sample input 
input = [
    "It is a awesome movie, you should watch it", 
    "This is a bad movie. I would never recommend it to anyone.",
    "The action of salman khan is pretty great in this movie", 
    "The direction was bad" 
]

Running the classifier on those inputs

In [33]:
print( "Predictions:")
for review in input:
    print("\nReview:", review)
    probdist = classifier.prob_classify(extract_features(review.split()))
    pred_sentiment = probdist.max()
    print(pred_sentiment)

Predictions:

Review: It is a awesome movie, you should watch it
Positive

Review: This is a bad movie. I would never recommend it to anyone.
Negative

Review: The action of salman khan is pretty great in this movie
Positive

Review: The direction was bad
Negative


printing with the probabilty rate

In [38]:
print( "\nPredictions:")
for review in input:
    print("\nReview:", review)
    probdist = classifier.prob_classify(extract_features(review.split()))
    pred_sentiment = probdist.max()
    print("Sentiment:", pred_sentiment)
    print( "Probability:", round(probdist.prob(pred_sentiment),2))


Predictions:

Review: It is a awesome movie, you should watch it
Sentiment: Positive
Probability: 0.63

Review: This is a bad movie. I would never recommend it to anyone.
Sentiment: Negative
Probability: 0.64

Review: The action of salman khan is pretty great in this movie
Sentiment: Positive
Probability: 0.61

Review: The direction was bad
Sentiment: Negative
Probability: 0.62


How does the Code work?

We have used NLTK’s Naive Bayes classifier here. 

In the feature extractor function, we basically extract all the unique words. 

but NLTK classifier needs the data to be arranged in the form of a dictionary. 

So we arranged it in such a way that the NLTK classifier object can understand it.

After dividing the data into training and testing datasets, 
we train the classifier to categorize the sentences into positive and negative. 

If you look at the top informative words, you can see that we have words such as “outstanding” to indicate positive reviews and words such as “insulting” to indicate negative reviews. 

This is interesting information because it tells us what words are being used to indicate strong reactions.

Thus we learn how to perform Sentiment Analysis in Python. 

# ++++++++ Web Scrapping for reviews ++++++++++

importing required libraries

In [39]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

xxxxxxxx product name

In [42]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
file = "D:/DataScience/Introtallent/Python Project.csv"
f = open(file, "w", encoding='utf-8')
Headers = "Index,Reviews\n"
f.write(Headers)
for page in range(1,8):
    print(page)
    url = 'https://www.ecommerceWebsiteName.com/abc/product-reviews/xyzxyz/ref=cm_cr_getr_d_paging_btm_prev_{}?ie=UTF8&reviewerType=all_reviews&pageNumber={}'.format(page,page)
    html = urlopen(url)
    soup = BeautifulSoup(html,"html.parser")
    Title = soup.find_all("div", {"class":"a-section review aok-relative"})
    for i in Title:
        pos_name = i.find("span", {"class":"a-size-base review-text review-text-content"}).get_text()
        f.write("{}".format(pos_name).replace(",","|"))
f.close()

1
2
3
4
5
6
7


importing other lib

In [44]:
import pandas as pd

In [45]:
review_data = pd.read_csv("D:/DataScience/Introtallent/Python Project.csv")

pre processing our data set

In [47]:
import re
def pre_process(text):
    
    # lowercase
    text=text.lower()
    
    #remove tags
    text=re.sub("<!--?.*?-->","",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    return text

In [48]:
review_data['Index'] = review_data['Index'].apply(lambda x:pre_process(x))

Converting our dataset to list

In [50]:
doc = review_data['Index'].tolist()

Running the pre build classifier on those input sentences and obtaining the prediction

In [52]:
print("Predictions:")
for review in doc:
    print("\nReview:", review)
    probdist = classifier.prob_classify(extract_features(review.split()))
    pred_sentiment = probdist.max()
    print("\nSentiment:",pred_sentiment)

Predictions:

Review: especially with an item that is counterfeited as often as egyptian cotton sheets you can t just depend on the product description you must also consider who is selling the product lowest price is not always the most important factor someone can list a product labeled egyptian quality but that is really just synthetic material under the same label as the genuine product i suspect that many of the poorer ratings here came from that situation look for sellers who have a good rating for many sales not just a small number i can only rate the product received based on the particular seller i bought from egyptian cotton factory outlet store i chose this seller because of their high ratings and because i assumed that their shipping from inside the us would have faster delivery than other sellers who imported i am sure that many of the other sellers also have a good product but i did not purchase from them the product that i received from the seller that i chose was outsta