# A Better Sentiment Analysis System

#### IS620 Final Project

##### Author: Partha Banerjee, CUNY MSDA

This project is to create a model which will be able to analyze the sentiment more accurately. The target is to train the model to predict “not cool” as negative instead of positive as predicted by majority of the models due to the positive word “cool”.

Sentiment analysis helps modern business many ways - a prospect buyers can use the sentiments of other buyers to decide about the product (s)he is planning to buy, producers can plan about their product lines based upon buyers sentiment, producers can take corrective measures to address negative sentiment about their product, marketers can use this for their research and recommendation the best etc. Social media like twiteer, facebook play an important role to spread this sentiment very quickly.

While working on this project, I got a good deal of help from the text book and the following sites:

* <a href="http://ravikiranj.net/posts/2012/code/how-build-twitter-sentiment-analyzer/">how to build a twitter sentiment analyzer?</a>
* <a href="http://streamhacker.com/2010/05/24/text-classification-sentiment-analysis-stopwords-collocations/">Text Classification For Sentiment Analysis – Stopwords And Collocations</a>

### Setup Environment

Let us start with setting up the environment with all necessary libraries in one place.

In [1]:
import re, math, collections, itertools, os
import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk.corpus import stopwords
from nltk.collocations import BigramCollocationFinder

### Feature Evaluator

As we have learnt during our course, “features” are an important piece in sentiment analysis, whatever someone is analyzing in an attempt to correlate to the labels. In this code, the features will be the words in each review.

For building feature corpus, I am going to use *sentence polarity dataset v1.0* having 5,331 positive and 5,331 negative processed sentences / snippets introduced by Cornell professor Bo Pang in Pang/Lee ACL 2005. Released July 2005. Though this data collected on movie review, but we can still use this dataset to use in our purpose. 

In [17]:
def evaluate_features(feature_fun):
    posFeatures = []
    negFeatures = []
    # http://stackoverflow.com/questions/367155/splitting-a-string-into-words-and-punctuation
    # Breaks up the sentences into lists of individual words (as selected by the 
    # input mechanism) and appends 'pos' or 'neg' after each list
    with open('./data/rt-polarity.pos', 'r') as posSentences:
        for i in posSentences:
            posWords = re.findall(r"[\w']+|[.,!?;]", i.rstrip())
            posWords = [feature_fun(posWords), 'pos']
            posFeatures.append(posWords)
    with open('./data/rt-polarity.neg', 'r') as negSentences:
        for i in negSentences:
            negWords = re.findall(r"[\w']+|[.,!?;]", i.rstrip())
            negWords = [feature_fun(negWords), 'neg']
            negFeatures.append(negWords)

    # Now we need to split the data into 80:20 ratio as training and 
    # testing data for a Naive Bayes classifier.
    posCutoff = int(math.floor(len(posFeatures)*0.8))
    negCutoff = int(math.floor(len(negFeatures)*0.8))
    trainFeatures = posFeatures[:posCutoff] + negFeatures[:negCutoff]
    testFeatures = posFeatures[posCutoff:] + negFeatures[negCutoff:]

    # Trains a Naive Bayes Classifier using NLTK
    classifier = NaiveBayesClassifier.train(trainFeatures)

    #initiates referenceSets and testSets
    referenceSets = collections.defaultdict(set)
    testSets = collections.defaultdict(set)

    # Puts correctly labeled sentences in referenceSets and the 
    # predictively labeled version in testsets
    for i, (features, label) in enumerate(testFeatures):
        referenceSets[label].add(i)
        predicted = classifier.classify(features)
        testSets[predicted].add(i)

    print 'train on {:,} instances, test on {:,} instances'.format( \
                            len(trainFeatures), len(testFeatures))
    print 'accuracy:', nltk.classify.util.accuracy(classifier, testFeatures)
    print 'pos precision:', nltk.metrics.precision(referenceSets['pos'], \
                            testSets['pos'])
    print 'pos recall:', nltk.metrics.recall(referenceSets['pos'], \
                            testSets['pos'])
    print 'neg precision:', nltk.metrics.precision(referenceSets['neg'], \
                            testSets['neg'])
    print 'neg recall:', nltk.metrics.recall(referenceSets['neg'], \
                            testSets['neg'])
    print
    classifier.show_most_informative_features(10)
    
    return classifier

Now let us create a feature selection mechanism that uses all words.

In [21]:
def make_complete_dict(words):
    return dict([(word, True) for word in words])

print 'Using bag of words feature selection:\n'
monogramClassifier = evaluate_features(make_complete_dict)

Using bag of words feature selection:

train on 8,528 instances, test on 2,134 instances
accuracy: 0.778819119025
pos precision: 0.787996127783
pos recall: 0.762886597938
neg precision: 0.770208900999
neg recall: 0.794751640112

Most Informative Features
              engrossing = True              pos : neg    =     18.3 : 1.0
                mediocre = True              neg : pos    =     13.7 : 1.0
                   flaws = True              pos : neg    =     13.7 : 1.0
               absorbing = True              pos : neg    =     13.0 : 1.0
                 generic = True              neg : pos    =     13.0 : 1.0
                  boring = True              neg : pos    =     12.4 : 1.0
              refreshing = True              pos : neg    =     12.3 : 1.0
               inventive = True              pos : neg    =     12.3 : 1.0
                    flat = True              neg : pos    =     11.8 : 1.0
                 triumph = True              pos : neg    =     11.7 :

The accuracy 77.88% is good, still we will try to find a better accuracy. The precisions and recalls are also pretty close to each other indicating that it is classifying everything fairly evenly. Then we see the most informative features.

**Stopword Filtering**

Let us now remove stopwords and see the accuracy parameters.

In [22]:
stopset = set(stopwords.words('english'))
 
def stopword_filtered_word_feats(words):
    return dict([(word, True) for word in words if word not in stopset])
 
stopwordClassifier = evaluate_features(stopword_filtered_word_feats)

train on 8,528 instances, test on 2,134 instances
accuracy: 0.77038425492
pos precision: 0.766882516189
pos recall: 0.77694470478
neg precision: 0.773979107312
neg recall: 0.763823805061

Most Informative Features
              engrossing = True              pos : neg    =     18.3 : 1.0
                mediocre = True              neg : pos    =     13.7 : 1.0
                   flaws = True              pos : neg    =     13.7 : 1.0
               absorbing = True              pos : neg    =     13.0 : 1.0
                 generic = True              neg : pos    =     13.0 : 1.0
                  boring = True              neg : pos    =     12.4 : 1.0
              refreshing = True              pos : neg    =     12.3 : 1.0
               inventive = True              pos : neg    =     12.3 : 1.0
                    flat = True              neg : pos    =     11.8 : 1.0
                 triumph = True              pos : neg    =     11.7 : 1.0


Accuracy has gone down from 77.88% to 77.03%. Also negative recall has gone down a bit. This is an indication that stopwords add information to sentiment analysis classification. So, we should not remove stopwords. 

**Bigram Collection**

Let us now include bigrams to see the accuracy parameters. We will use NLTK library bigram features for this. The BigramCollocationFinder maintains 2 internal FreqDists, one for individual word frequencies, another for bigram frequencies. Once it has these frequency distributions, it can score individual bigrams using a scoring function provided by BigramAssocMeasures, such chi-square. These scoring functions measure the collocation correlation of 2 words, basically whether the bigram occurs about as frequently as each individual word.

In [23]:
def bigram_word_features(words, score_fn=BigramAssocMeasures.chi_sq, n=500):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])
 
bigramClassifier = evaluate_features(bigram_word_features)

train on 8,528 instances, test on 2,134 instances
accuracy: 0.787722586692
pos precision: 0.792380952381
pos recall: 0.779756326148
neg precision: 0.783210332103
neg recall: 0.795688847235

Most Informative Features
              engrossing = True              pos : neg    =     18.3 : 1.0
                mediocre = True              neg : pos    =     13.7 : 1.0
          (',', 'funny') = True              pos : neg    =     13.7 : 1.0
                   flaws = True              pos : neg    =     13.7 : 1.0
           ('dull', ',') = True              neg : pos    =     13.7 : 1.0
               absorbing = True              pos : neg    =     13.0 : 1.0
          ('to', 'care') = True              neg : pos    =     13.0 : 1.0
           ('up', 'for') = True              pos : neg    =     13.0 : 1.0
                 generic = True              neg : pos    =     13.0 : 1.0
   ('examination', 'of') = True              pos : neg    =     13.0 : 1.0


significant Accuracy is now up from 77.88% to 78.77%. Also both positive and negative precision and recall values have increased. So we can conclude that including bigrams can increase classifier effectiveness.

### Get into the Business

Now after seeing the benefits of bigram, let us start building our final model to find the sentiments of tweets. For this I am going to use some code snippets used earlier.

In [6]:
posFeatures = []
negFeatures = []
with open('./data/rt-polarity.pos', 'r') as posSentences:
    for i in posSentences:
        posWords = re.findall(r"[\w']+|[.,!?;]", i.rstrip())
        posWords = [bigram_word_features(posWords), 'pos']
        posFeatures.append(posWords)
with open('./data/rt-polarity.neg', 'r') as negSentences:
    for i in negSentences:
        negWords = re.findall(r"[\w']+|[.,!?;]", i.rstrip())
        negWords = [bigram_word_features(negWords), 'neg']
        negFeatures.append(negWords)

# Now we need to split the data into 80:20 ratio as training and 
# testing data for a Naive Bayes classifier.
posCutoff = int(math.floor(len(posFeatures)*0.8))
negCutoff = int(math.floor(len(negFeatures)*0.8))
trainFeatures = posFeatures[:posCutoff] + negFeatures[:negCutoff]
testFeatures = posFeatures[posCutoff:] + negFeatures[negCutoff:]

In [7]:
# Trains a Naive Bayes Classifier using NLTK
classifier = NaiveBayesClassifier.train(trainFeatures)
print classifier.show_most_informative_features(20)

Most Informative Features
              engrossing = True              pos : neg    =     18.3 : 1.0
                mediocre = True              neg : pos    =     13.7 : 1.0
          (',', 'funny') = True              pos : neg    =     13.7 : 1.0
                   flaws = True              pos : neg    =     13.7 : 1.0
           ('dull', ',') = True              neg : pos    =     13.7 : 1.0
               absorbing = True              pos : neg    =     13.0 : 1.0
          ('to', 'care') = True              neg : pos    =     13.0 : 1.0
           ('up', 'for') = True              pos : neg    =     13.0 : 1.0
                 generic = True              neg : pos    =     13.0 : 1.0
   ('examination', 'of') = True              pos : neg    =     13.0 : 1.0
                  boring = True              neg : pos    =     12.4 : 1.0
              refreshing = True              pos : neg    =     12.3 : 1.0
        ('with', 'such') = True              pos : neg    =     12.3 : 1.0

In [24]:
print bigramClassifier.show_most_informative_features(20)

Most Informative Features
              engrossing = True              pos : neg    =     18.3 : 1.0
                mediocre = True              neg : pos    =     13.7 : 1.0
          (',', 'funny') = True              pos : neg    =     13.7 : 1.0
                   flaws = True              pos : neg    =     13.7 : 1.0
           ('dull', ',') = True              neg : pos    =     13.7 : 1.0
               absorbing = True              pos : neg    =     13.0 : 1.0
          ('to', 'care') = True              neg : pos    =     13.0 : 1.0
           ('up', 'for') = True              pos : neg    =     13.0 : 1.0
                 generic = True              neg : pos    =     13.0 : 1.0
   ('examination', 'of') = True              pos : neg    =     13.0 : 1.0
                  boring = True              neg : pos    =     12.4 : 1.0
              refreshing = True              pos : neg    =     12.3 : 1.0
        ('with', 'such') = True              pos : neg    =     12.3 : 1.0

In [25]:
def predictSentiment(tweet, classifier):
    twt = []
    for key, value in tweet.iteritems():
        twt.append(value)
        
    twFeatures = []
    for i in twt[0]:
        twWords = re.findall(r"[\w']+|[.,!?;]", i.rstrip())
        twWords = [bigram_word_features(twWords), 'tbd']
        twFeatures.append(twWords)

    for i, (features, label) in enumerate(twFeatures):
        predicted = classifier.classify(features)
        print "{} - {}".format(twt[0][i], predicted)

In [30]:
test = {0: ["you look so cool", "he looks not so cool", "so great", "not so great"]}

In [28]:
predictSentiment(test, stopwordClassifier)

you look so cool - pos
he looks not so cool - pos
so great - pos
not so great - pos


In [27]:
predictSentiment(test, monogramClassifier)

you look so cool - pos
he looks not so cool - neg
so great - pos
not so great - pos


In [29]:
predictSentiment(test, bigramClassifier)

you look so cool - pos
he looks not so cool - neg
so great - pos
not so great - neg


### Get Data from Twitter

Finally we need to have data for analyzing its sentiment and I extract the data from twitter. Data extraction is based upon the key word and time period. I have choosen them just for demonstrating my project.

In [10]:
import argparse, urllib, urllib2, json, random
import os, oauth2, datetime, re
from datetime import timedelta

class TwitterData:
    def __init__(self):
        self.currDate = datetime.datetime.now()
        self.weekDates = []
        self.weekDates.append(self.currDate.strftime("%Y-%m-%d"))
        for i in range(1,7):
            dateDiff = timedelta(days=-i)
            newDate = self.currDate + dateDiff
            self.weekDates.append(newDate.strftime("%Y-%m-%d"))

    def getTwitterData(self, keyword, time):
        self.weekTweets = {}
        if(time == 'lastweek'):
            for i in range(0,6):
                params = {'since': self.weekDates[i+1], 'until': \
                          self.weekDates[i]}
                self.weekTweets[i] = self.getData(keyword, params)
        elif(time == 'today'):
            for i in range(0,1):
                params = {'since': self.weekDates[i+1], 'until': \
                          self.weekDates[i]}
                self.weekTweets[i] = self.getData(keyword, params)
        return self.weekTweets
    
    def parse_config(self):
        config = {}
        if os.path.exists('config.json'):
            with open('config.json') as f:
                config.update(json.load(f))
        return config

    def oauth_req(self, url, http_method="GET", post_body=None,
                  http_headers=None):
        config = self.parse_config()
        consumer = oauth2.Consumer(key=config.get('consumer_key'), \
                                   secret=config.get('consumer_secret'))
        token = oauth2.Token(key=config.get('access_token'), \
                             secret=config.get('access_token_secret'))
        client = oauth2.Client(consumer, token)

        resp, content = client.request(
            url,
            method=http_method,
            body=post_body or '',
            headers=http_headers
        )
        return content
    
    def getData(self, keyword, params = {}):
        maxTweets = 50
        url = 'https://api.twitter.com/1.1/search/tweets.json?'    
        data = {'q': keyword, 'lang': 'en', 'result_type': 'recent', \
                'count': maxTweets, 'include_entities': 0}

        if params:
            for key, value in params.iteritems():
                data[key] = value
        
        url += urllib.urlencode(data)
        
        response = self.oauth_req(url)
        jsonData = json.loads(response)
        
        tweets = []
        if 'errors' in jsonData:
            print "API Error"
            print jsonData['errors']
        else:
            for item in jsonData['statuses']:
                tweets.append(item['text'])
        return tweets

Define keyword and time to extract data from twitter.

In [11]:
keyword = raw_input('Enter hash tag (with #) you want to retrieve? ')
print keyword

Enter hash tag (with #) you want to retrieve? #trump
#trump


In [12]:
#keyword = '#Trump'
time = 'today'
twitterData = TwitterData()
tweets = twitterData.getTwitterData(keyword, time)

In [13]:
json.dump(tweets, open("./data/tweets.txt",'w'))

**Cleanup data**

* Convert the tweets to lower case
* Remove unicode
* Remove URLs
* Remove @usernames
* Remove additional white spaces

In [14]:
def processTweet(tweet):
    #Convert to lower case
    tweet = tweet.lower()
    #Remove unicode
    tweet = tweet.encode('ascii','ignore')
    #Remove www.* or https?://*
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))',' ',tweet)
    tweet = tweet.replace("https://","")
    tweet = tweet.replace("https:","")
    #Remove @username
    tweet = re.sub('@[^\s]+',' ',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    tweet = tweet.strip('\'"')
    #Finally remove rt from the begining 
    if tweet[:3] == "rt ":
        tweet = tweet[3:]
    return tweet

processedTweet = {k: map(processTweet, v) for k, v in tweets.items()}

In [15]:
json.dump(processedTweet, open("./data/processedTweets.txt",'w'))

In [16]:
predictSentiment(processedTweet)

i am getting tired of the false trump attacks. this sums up liberals vs nazis scary but true! liberallogic greta ht - neg
so any trump supporter could have this little coward arrested for making a death threat, correct?  - neg
 isn't it time to remove donald trump from the hall of fame considering the racist rule and all? wwe trump - neg
 donald trump live in iowa buildthewall trumpkin trumptrain trumptoday trumpmiami makeamericagreatagain trump - pos
so any trump supporter could have this little coward arrested for making a death threat, correct?  - neg
nice try cont. new leftist anti-trump narrative they "care" abt gop "dilemma". spare the false "mournful state" pity. cnn trump2016 - pos
rt pat buchanan's message to america! get behind trump!!! makeamericagreatagain  - pos
so any trump supporter could have this little coward arrested for making a death threat, correct?  - neg
i am getting tired of the false trump attacks. this sums up liberals vs nazis scary but true! liberallogic gr