# Sentiment Analysis
Now that we've seen word vectors we can start to investigate sentiment analysis. The goal is to find commonalities between documents, with the understanding that similarly *combined* vectors should correspond to similar sentiments.

While the scope of sentiment analysis is very broad, we will focus our work in two ways.

### 1. Polarity classification
We won't try to determine if a sentence is objective or subjective, fact or opinion. Rather, we care only if the text expresses a *positive*, *negative* or *neutral* opinion.
### 2. Document level scope
We'll also try to aggregate all of the sentences in a document or paragraph, to arrive at an overall opinion.
### 3. Coarse analysis
We won't try to perform a fine-grained analysis that would determine the degree of positivity/negativity. That is, we're not trying to guess how many stars a reviewer awarded, just whether the review was positive or negative.

## Broad Steps:
* First, consider the text being analyzed. A model trained on paragraph-long movie reviews might not be effective on tweets. Make sure to use an appropriate model for the task at hand.
* Next, decide the type of analysis to perform. In the previous section on text classification we used a bag-of-words technique that considered only single tokens, or *unigrams*. Some rudimentary sentiment analysis models go one step further, and consider two-word combinations, or *bigrams*. In this section, we'd like to work with complete sentences, and for this we're going to import a trained NLTK lexicon called *VADER*.

## NLTK's VADER module
VADER is an NLTK module that provides sentiment scores based on words used ("completely" boosts a score, while "slightly" reduces it), on capitalization & punctuation ("GREAT!!!" is stronger than "great."), and negations (words like "isn't" and "doesn't" affect the outcome).
<br>To view the source code visit https://www.nltk.org/_modules/nltk/sentiment/vader.html

**Download the VADER lexicon.** You only need to do this once.

In [1]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Mike\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

<div class="alert alert-danger">NOTE: At the time of this writing there's a <a href='https://github.com/nltk/nltk/issues/2053'>known issue</a> with SentimentIntensityAnalyzer that raises a harmless warning on loading<br>
<tt><font color=black>&emsp;UserWarning: The twython library has not been installed.<br>&emsp;Some functionality from the twitter package will not be available.</tt>

This is due to be fixed in an upcoming NLTK release. For now, if you want to avoid it you can (optionally) install the NLTK twitter library with<br>
<tt><font color=black>&emsp;conda install nltk[twitter]</tt><br>or<br>
<tt><font color=black>&emsp;pip3 install -U nltk[twitter]</tt></div>

In [8]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()



VADER's `SentimentIntensityAnalyzer()` takes in a string and returns a dictionary of scores in each of four categories:
* negative
* neutral
* positive
* compound *(computed by normalizing the scores above)*


So let's create a really simple string:

In [9]:
a = 'This was a good movie.'
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

And you get back this dictionary which has some negative value, a neutral value, a positive value and then a compound value which essentially normalizing these three values here.

So, as we expect there is no negative value since this is a good movie. It has some neutral words or tones in it and then it has also some positive tones. And the max value for any of these four scorews is 1.0.

So now let's try a more complicated string. Notice we're going to capitalize "ever made" and have three exclamation points.

In [10]:
a = 'This was the best, most awesome movie EVER MADE!!!'
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}

As we previously mentioned, VADER is smart enough to understand things like repeated punctuation and capitalization.

And here we can see it's again more positive than the previous one.

And we can see here that the compound score is much more positive because neutral also dropped.

Finally let's go ahead and have a very negative string.

In [11]:
a = 'This was the worst film to ever disgrace the screen.'
sid.polarity_scores(a)

{'neg': 0.477, 'neu': 0.523, 'pos': 0.0, 'compound': -0.8074}

So quite a negative review. Let's see if the VADER picks it up.

And here we can see that now there is no positive, it's just neutral and negative and so happens is the compound score then becomes negative. 

- So we can see here a compound score of zero would be completely neutral
- A compound score above zero indicates some sort of positive score 
- A compound score below zero indicates some sort of negative score

## Use VADER to analyze Amazon Reviews
For this exercise we're going to apply `SentimentIntensityAnalyzer` to a dataset of 10,000 Amazon reviews. Like our movie reviews datasets, these are labeled as either "pos" or "neg". At the end we'll determine the accuracy of our sentiment analysis with VADER.

> The text is tab separated. So you need to also indicate that the separator is backslash `t` for tab separation.

In [12]:
import numpy as np
import pandas as pd

df = pd.read_csv('../TextFiles/amazonreviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


Once you read that in you should be able to view it by simply calling the `head()` of that data frame. And essentially what we have here are:

- `labels`: `pos` for positive or `neg` for a negative 
- `review`: the actual text of the review

So if we wanted to get an idea of how many positive or negative labels we have we can say `df` pass passing the `label` column and then simply call `value_counts()` and we can see here we have slightly more negative reviews than positive reviews but, overall it looks like we have around 10,000 reviews.

In [13]:
df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

### Clean the data:

So what we're gonna do now is do a little bit of cleaning of the data just to double check that we have no empty records and then we're going to run a first review through VADER.

Recall that our moviereviews.tsv file contained empty records. Let's check to see if any exist in amazonreviews.tsv.

#### Steps:

This is going to drop anything that's missing, we're going to do is drop anything that has a empty whitespace value.

```python
df.dropna(inplace=True)
```

Now for your data sets depending where you get them you may or may not have this but, it's always a good idea.

So I'm just saying for index for label and for review:

```python
for i, lb, rv in df.itertuples()
```

So those are kind of place holders there. Now let's use `df.itertuples()` so, here everything is just going to be returned as a tuple where I have the index the label and then the review text.

So for i, label and review:

```python
for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
```

I'm going to say if the type of the review is equal to the string type, then I'm going to check that the review is space, essentially checking whether or not it's a space there. And if it's true, I'm going to take a list of blanks `blanks = []` and  simply say `blanks.append()` and then we'll plan that index position.

In [14]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list

df.drop(blanks, inplace=True)

So if we run this let's go ahead and check on blanks. See if we had any blanks.

In [4]:
blanks

[]

It looks like we did it. This list is empty. So we don't need to drop anything. But if we did have some index positions that were blanks, we simply need to say:

In [5]:
df.drop(blanks,inplace=True)

But again since we don't have any we don't actually need to run that line.

In [15]:
df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

In this case there were no empty records. Good!

## Let's run the first review through VADER

So now we're going to do is continue on and run a first review through VADER. We're going to just run the first review on it.

Let's checkout the text of the first review:

In [17]:
df.iloc[0]['review']

'Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

So we can see here it's quite positive, "the soundtrack was beautiful",  "game music", exclamation points "best music" etc.

So check now the polarity score here:

In [16]:
sid.polarity_scores(df.loc[0]['review'])

{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}

It looks like it has a very small amount of negativitythat Vader picked up, and it could be small phrases that get confusing for Vader things like "anyone who cares to listen!" may be kind of negative in a slight sense but it's actually a very small negativity. In fact most of it is neutral or slightly positive which means a compound score is extremely positive, which if we take a look at that first label was positive. So looks like Vader is actually able to select that.

In [11]:
df.loc[0]['label']

'pos'

Great! Our first review was labeled "positive", and earned a positive compound score.

## Adding Scores and Labels to the DataFrame

So now let's go ahead and as scores and labels to the data frame.

In this next section we'll add columns to the original DataFrame to store polarity_score dictionaries, extracted compound scores, and new "pos/neg" labels derived from the compound score. We'll use this last column to perform an accuracy test.

So we're going to create a new column named `scores` that is equal to `review` and then we're going to call in `apply()` method in order to essentially apply `sid.polarity_scores()` to every single review in our data frame. So we'll say lambda take that review and then apply `sid.polarity_scores()` to that particular review:

In [19]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

df.head()

Unnamed: 0,label,review,scores
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."


So we run that and this may take a little bit of time because it is running this whole polarity of course function on every single review. But once you have that you can go ahead and check out the ahead of the data frame and then you'll get back in your column `scores` that contains this dictionary.

But we really just want to be of the compound score. So let's go ahead and create a new `compound` column:

In [20]:
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


So notice these first five labels (`label`) they're all positive and it looks like the compound score (`compund`) is also all positive.

So let's go ahead and based off this compound score do a little bit of logic and say:

- if it's greater than zero then it's positive.

- If it's less than zero it's negative 

And then we'll compare these compound scores to the true labels that we already know. 

So we're going to say one last column of our creation `comp_score`. 

Essentially changing the score into a string that matches our current label.


In [21]:
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


So looks like we're matching up on the first five.

But let's go ahead and have an overall report on the accuracy comparing the Vader compound score labels to the manual labels from this dataset.

## Report on Accuracy
Finally, we'll use scikit-learn to determine how close VADER came to our original 10,000 labels.

In [23]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

Let's first just get the accuracy score and we can do that by simply saying:

In [24]:
accuracy_score(df['label'],df['comp_score'])

0.7091

So essentially we're comparing how well that Vader perform against what was manually labeled. So the `label` column was manual label, essentially a person read these reviews and decided whether or not they're positive or negative.

So if we run their accuracy score we get an accuracy of 0.71. If we were to randomly choose positives and negatives we'd be probably getting an accuracy score of around 0.5.

So we can see we're doing better than random guessing which is quite good given the fact that we're essentially just running one line of code to get the polarity scores. So it's definitely not bad considering how simple it is to run this process.


Let's go ahead and print the classification report we'll say print classification report.

In [17]:
print(classification_report(df['label'],df['comp_score']))

              precision    recall  f1-score   support

         neg       0.86      0.51      0.64      5097
         pos       0.64      0.91      0.75      4903

   micro avg       0.71      0.71      0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



And we'll pass in the true `label` that we know and then our calculated `comp_score` so we'll run this and then we can see our precision, recall, and F1 score and we can also compare negative versus positive.

So it looks like the Vader has a little bit of trouble with negative reviews versus positive reviews. And if you take a look at some of these Amazon reviews some of these strings and some of the text is sometimes a bit hard to read and sometimes it's also sarcastic which means it's really hard to detect. So sarcasm is almost impossible to detect for something like Vader.

And then finally let's print out a confusion matrix:

In [25]:
print(confusion_matrix(df['label'],df['comp_score']))

[[2623 2474]
 [ 435 4468]]



This tells us that VADER correctly identified an Amazon review as "positive" or "negative" roughly 71% of the time. It is not performing well with negative reviews. This performs not bad considering how simple the process is but, it is also not excellent compared to maybe some state-of-the-art deep learning methods for sentiment analysis.



## Up Next: Sentiment Analysis Project