### Extract Information:
    
- Tag data item with values for sentiments
- One/More categorical data series created
- Analysing categorical sentiment data

### Act:

- Trade financial market
- Change or reallocate ad budgets
- Tailor electoral strategy
- Decide product recall strategies

### Applications of sentiment analysis:

- Event-driven Trading (Company earnings vs Forecast) Buy the Rumor , Sell the News
- 

Polarity: Positive or negative?
Subjectivity: Subjective or objective?
Aspects: Part or whole?

### Rule-based and ML-based Binary Classifiers:
Static vs. Dynamic
Experts needed to formulate rules vs. No need expert skill   
Corpus of data needed, can not operate on isolated problem instances vs. can operate on isolated problem instances
To update classifier, update corpus vs. To update classifier, update rules
Require training step vs. No training step

## Rule based approached

Building is hard, Using is easy:

- Split text into words
- Calculate polarity of individual words (requires use of a sentiment lexicon, ignore stop and neutral words)
- Aggregate word polarities

Limitations of a Simplistic Approach:
- Polarity alone loses intensity information

VADER (Valence Aware Dictionary for sEntiment Reasoning) 
 
 - Builtt in nltk
 - Both algorithm and dataset
 - Support for Emoticons, Idioms, Punction, Negation, Double Negation, Emphasis, Contrast and Boost word (So, Really)
 
Sentiwordnet

In [None]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [8]:
import nltk

In [6]:
from nltk.sentiment import vader



In [7]:
sia = vader.SentimentIntensityAnalyzer()
sia.polarity_scores("What a terrible restaurant")

{'compound': -0.4767, 'neg': 0.608, 'neu': 0.392, 'pos': 0.0}

## Classifying Movie Review with VADER 

downlown data from [here](https://www.cs.cornell.edu/people/pabo/movie-review-data/). This dataset contains two files each has 5331 positive or negative processed sentences/snippets.


In [13]:
with open(r'C:\Users\jli\Downloads\rt-polaritydata\rt-polaritydata\rt-polarity.pos') as pf:
    posReviews = pf.readlines()

In [14]:
len(posReviews)

5331

In [16]:
posReviews[0]

'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . \n'

In [17]:
with open(r'C:\Users\jli\Downloads\rt-polaritydata\rt-polaritydata\rt-polarity.neg') as pf:
    negReviews = pf.readlines()

In [18]:
len(negReviews)

5331

In [19]:
negReviews[0]

'simplistic , silly and tedious . \n'

In [21]:
import nltk
from nltk.sentiment import vader
sia = vader.SentimentIntensityAnalyzer()
def vaderSentiment(review):
    return sia.polarity_scores(review)['compound']

def getReviewSentiment(sentimentCalculator):
    posCompound = [sentimentCalculator(review) for review in posReviews]
    negCompound = [sentimentCalculator(review) for review in negReviews]
    
    return {'Positive_Reviews': posCompound, 'Negative_Reviews': negCompound}

In [23]:
vaderResults = getReviewSentiment(vaderSentiment)

In [24]:
vaderResults.keys()

['Positive_Reviews', 'Negative_Reviews']

In [26]:
len(vaderResults['Positive_Reviews'])

5331

In [27]:
vaderResults['Positive_Reviews'][:10]

[0.3612,
 0.8069,
 0.2617,
 0.8271,
 0.6592,
 0.5994,
 0.4215,
 -0.5994,
 0.0938,
 0.4939]

In [36]:
# calculate the accuracy on positive reviews
pos_accuracy = sum(1.0 for x in vaderResults['Positive_Reviews'] if x > 0 ) / len(vaderResults['Positive_Reviews'])
print(pos_accuracy)

0.694428812606


In [37]:
neg_accuracy = sum(1.0 for x in vaderResults['Negative_Reviews'] if x < 0 ) / len(vaderResults['Negative_Reviews'])
print(neg_accuracy)

0.400862877509


In [46]:
def runDiagnostics(reviewResult):
    posReviewResults = reviewResult['Positive_Reviews']
    negReviewResults = reviewResult['Negative_Reviews']
    
    pctTruePos = float(sum(x > 0 for x in posReviewResults)) / len(posReviewResults)
    pctTrueNeg = float(sum(x < 0 for x in negReviewResults)) / len(negReviewResults)
    overallPct = (pctTruePos + pctTrueNeg) / 2
    print "Accuracy on positive reviews = " + "%.2f" % (pctTruePos*100) + "%"
    print "Accuracy on negative reviews = " + "%.2f" % (pctTrueNeg*100) + "%"
    print "Overall Accuracy = " + "%.2f" % (overallPct*100) + "%"

In [47]:
runDiagnostics(getReviewSentiment(vaderSentiment))

Accuracy on positive reviews = 69.44%
Accuracy on negative reviews = 40.09%
Overall Accuracy = 54.76%


## Classifying Movie Review with Sentiwordnet

 - Words, Lemmas, Synsets
 - Sentiwordnet extends wordnet with polarity

In [38]:
from nltk.corpus import sentiwordnet as swn

In [39]:
swn.senti_synsets('dog')

[SentiSynset('dog.n.01'),
 SentiSynset('frump.n.01'),
 SentiSynset('dog.n.03'),
 SentiSynset('cad.n.01'),
 SentiSynset('frank.n.02'),
 SentiSynset('pawl.n.01'),
 SentiSynset('andiron.n.01'),
 SentiSynset('chase.v.01')]

In [40]:
swn.senti_synsets('dog')[3].neg_score()

1.0

In [41]:
swn.senti_synsets('dog')[3].pos_score()

0.0

In [42]:
from nltk.corpus import sentiwordnet as swn
def superNaiveSentiment(review):
    reviewPolarity = 0.0
    numExceptions = 0
    
    for word in review.lower().split():
        weight = 0.0
        try:
            common_meaning = swn.senti_synsets(word)[0]  # only relys on the first common meaning
            if common_meaning.pos_score() > common_meaning.neg_score():
                weight += common_meaning.pos_score()
            elif common_meaning.pos_score() < common_meaning.neg_score():
                weight -= common_meaning.neg_score()
        except:
            numExceptions += 1
        reviewPolarity += weight
    return reviewPolarity        

In [48]:
runDiagnostics(getReviewSentiment(superNaiveSentiment))

Accuracy on positive reviews = 69.44%
Accuracy on negative reviews = 40.09%
Overall Accuracy = 54.76%


In [49]:
from nltk.corpus import sentiwordnet as swn
def naiveSentiment(review):
    reviewPolarity = 0.0
    numExceptions = 0
    
    for word in review.lower().split():
        numMeanings = 0
        if word in stopwords:
            continue
        weight = 0.0
        try:
            for meaning in swn.senti_synsets(word):
                if meaning.pos_score() > meaning.neg_score():
                    weight += (meaning.pos_score() - meaning.neg_score())
                    numMeanings += 1
                elif meaning.pos_score() < meaning.neg_score():
                    weight -= (meaning.neg_score() - meaning.pos_score())
                    numMeanings += 1
        except:
            numExceptions += 1
        if numMeanings > 0:
            reviewPolarity += (weight/numMeanings)
    return reviewPolarity    

In [51]:
runDiagnostics(getReviewSentiment(naiveSentiment))

Accuracy on positive reviews = 69.44%
Accuracy on negative reviews = 40.09%
Overall Accuracy = 54.76%
