# Sentiment Analysis Using NLTK

Sentiment Analysis, or Opinion Mining, is a sub-field of Natural Language Processing (NLP) that tries to identify and extract opinions within a given text. The aim of sentiment analysis is to gauge the attitude, sentiments, evaluations, attitudes and emotions of a speaker/writer based on the computational treatment of subjectivity in a text.

## Existing NLTK's VADER module
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER uses a combination of A sentiment lexicon is a list of lexical features (e.g., words) which are generally labelled according to their semantic orientation as either positive or negative.
VADER has been found to be quite successful when dealing with social media texts, NY Times editorials, movie reviews, and product reviews. This is because VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is.

In [2]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/muhkhoi/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [3]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [4]:
sid = SentimentIntensityAnalyzer()

### Checking Sentiment of the sentences.
this will print a sentiment score of the correspond sentence, neg is negative sentiment, neu is neutral sentiment, and pos is positive sentiment score, while compound is normalizing between these all scores.

In [5]:
a = 'This is good'
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.408, 'pos': 0.592, 'compound': 0.4404}

In [6]:
a = 'This is the best, and awesome movie EVER MADE!!'
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.441, 'pos': 0.559, 'compound': 0.8715}

In [7]:
a = 'This is WORST movie that ever disgraced the screen'
sid.polarity_scores(a)

{'neg': 0.528, 'neu': 0.472, 'pos': 0.0, 'compound': -0.8331}

### Analyzing Amazon Reviews Sentiment Data

In [8]:
import pandas as pd

In [9]:
df = pd.read_csv('amazonreviews.tsv',sep='\t')

In [10]:
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [11]:
df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

### Checking if contain either nan or blank field.

In [15]:
df.isna().sum()

label     0
review    0
dtype: int64

In [16]:
df.dropna(inplace=True)

In [17]:
blanks = []
for i,lb,rv in df.itertuples():
    if type(rv) == str:
        if rv.isspace():
            blanks.append(i)

In [19]:
blanks

[]

In [20]:
#df.drop(blanks,inplace=True)

In [24]:
print(sid.polarity_scores(df['review'][0]))
print(df['review'][0])

{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}
Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^


### Apply Lambda function into each dataframe

In [27]:
df['scores'] = df['review'].apply(lambda eachrev: sid.polarity_scores(eachrev))

### Unpack Dictionary into Column

In [29]:
df['compound'] = df['scores'].apply(lambda label:label['compound'])

In [44]:
df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


### Using Lambda expression

In [47]:
df['new_scores'] = df['compound'].apply(lambda scores: 'pos' if scores >= 0 else 'neg')

### Oter way Using Function

In [50]:
def posneg(scores):
    if scores >= 0:
        return 'pos'
    else:
        return 'neg'

df['new_scores'] = df['compound'].apply(posneg)

In [54]:
df['new_scores'].value_counts()

pos    6944
neg    3056
Name: new_scores, dtype: int64

## Comparing VADER accuracy model with existing label's sentiment

In [55]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [56]:
accuracy_score(df['label'],df['new_scores'])

0.7091

In [58]:
print(classification_report(df['label'],df['new_scores']))

              precision    recall  f1-score   support

         neg       0.86      0.51      0.64      5097
         pos       0.64      0.91      0.75      4903

    accuracy                           0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



In [60]:
print(confusion_matrix(df['label'],df['new_scores']))

[[2622 2475]
 [ 434 4469]]
