# Sentiment Analysis

In [6]:
import os
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.metrics import accuracy_score

### Dataset

IMDB movie reviews dataset: http://ai.stanford.edu/~amaas/data/sentiment/
* 25000 positive & 25000 negative reviews
* 50/50 training/test split
* 7 stars or more -> positive review
* 4 starts or fewer -> negative review
* at most 30 reviews per movie

In [7]:
def read_corpus(dataset):
    corpus = []
    labels = []
    base_path = '/Users/maxim/codebase/python/spiced_projects/data/aclImdb/'
    for rev in ['pos', 'neg']:
        for file in os.listdir(base_path + dataset + '/'+ rev + '/'):
            file_path = base_path + dataset + '/'+ rev + '/' + file
            with open(file_path, 'r') as f:
                corpus.append(f.read())
                if rev == 'pos':
                    labels.append(1)
                else:
                    labels.append(0)
    return corpus, labels

In [8]:
corpus_train, y_train = read_corpus('train')
corpus_test, y_test = read_corpus('test')

### Approaches

1. rule-based (unsupervised)
2. machine learning (supervised & unsupervised)

#### 1.a. *Simple rule-based approach*: lexicon-based method

We start with two lexicons of words associated with positive and negative sentiments.

`positive-words.txt`: https://gist.github.com/mkulakowski2/4289437

`negative-words.txt`: https://gist.github.com/mkulakowski2/4289441

Let's imagine you have an unlabeled dataset of movie reviews. How would you use these lists of positive and negative words to infer the sentiment of the reviews?
* count positive and negative words from the lexicon in each review and assign majority class
* same as above, but remove all occurrences of "not + [word]" (or count them in the opposite category)
* determine sentiment by sentences, count positive vs negative sentences in a review
* apply weighting schemes to words (e.g. bad=1, garbage=2; superlatives)
* first/last word in review that is in either of the lexicons determines the sentiment (e.g. "I thought it was a _fantastic_...")

In [9]:
def read_words(sentiment):
    f = open(f'/Users/maxim/codebase/python/spiced_projects/data/posneg/{sentiment}-words.txt', mode='r')
    result = f.readlines()
    f.close()
    result = [line.strip('\n') for line in result if not line.startswith(';') and len(line)>1]
    return result

In [10]:
def determine_sentiment(corpus, neg_lexicon, pos_lexicon):
    y_pred = []
    for text in corpus:
        n_pos = len([w for w in pos_lexicon if w in text])
        n_neg = len([w for w in neg_lexicon if w in text])
        if n_pos > n_neg:
            y_pred.append(1)
        elif n_pos < n_neg:
            y_pred.append(0)
        else:
            y_pred.append(np.random.choice([0, 1]))
    return y_pred

In [11]:
positive_words = read_words('positive')

In [12]:
negative_words = read_words('negative')

In [28]:
y_pred_lexicon = determine_sentiment(corpus_test, negative_words, positive_words)

In [29]:
accuracy_score(y_pred_lexicon, y_test)

0.66536

#### 1.b. *Advanced rule-based approach*: VADER Sentiment Analysis

[VADER](https://github.com/cjhutto/vaderSentiment) (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based model for sentiment analysis that takes into account polarity (positive vs. negative) but also intensity of a sentiment.

In [3]:
#!pip install vaderSentiment

In [13]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

**Your tasks**

1. Take a look at the Vader github repo and try to answer these questions: https://github.com/cjhutto/vaderSentiment

    * Locate the "lexicon" (dictionary). What can we find in the dictionary, and more specifically: what are the values in the file representing (check out the README)?
    * Locate the implementation of the "rules":
        * Does vader take punctuation into account?
        * Which words intensify a sentiment?
        * What happens if one word is in ALL CAPS? What if the whole text is in ALL CAPS?


2. Implement sentiment analysis using VADER, following the README file here: https://github.com/cjhutto/vaderSentiment#code-examples

    * For each review in your test corpus, determine the sentiment (positive or negative), and compare that with the labels for your test set to determine accuracy
    * How does this compare with the accuracy of the simple lexicon-based approach?


3. For your project:

    * Get tweets from MongoDB
    * Clean the tweets
    * Do sentiment analysis with VADER
    * Save tweet and sentiment in postgres

In [34]:
def determine_sentiment_using_vader(sentences):
    analyzer = SentimentIntensityAnalyzer()
    y_pred = []
    for sentence in sentences:
        vs = analyzer.polarity_scores(sentence)
        y_pred.append(1 if vs['compound'] > 0 else 0)
#         print("{:-<5} {}".format(sentence, str(vs)))

    return y_pred

# sentences = [
#     'This movie was good',
#     'This movie was GOOD',
#     'This movie WAS good',
#     'This MOVIE was good',
#     'THIS MOVIE WAS GOOD',
#     'This movie was goooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooood',
# ]
# determine_sentiment_using_vader(sentences)

In [37]:
y_pred_lexicon_vader = determine_sentiment_using_vader(corpus_test)

In [38]:
accuracy_score(y_pred_lexicon_vader, y_test)

0.6974

In [42]:
s = '///asdasd//ff/'
s.strip('//')

'asdasd//ff'