# Sentiment to Spyplanes

You can see the content this notebook was based on (with a lot more words) [right over here](https://investigate.ai/investigating-sentiment-analysis/comparing-sentiment-analysis-tools/).

Our sentences:

* I love this kitten
* That article was pure garbage
* Your feedback is appreciated :)
* Your feedback is appreciated 🤮
* That restaurant was great, but I'm not sure if I'll go there again!

Before we get started on sentiment, though, we need to **do a little setup.**

## Install what needs installing

We'll need to install a few tools before we move on.

* **matplotlib:** graphing library
* **pandas:** data analysis (although we're only using it to build a table)
* **NLTK:** text and sentiment analysis tool (old workhorse)
* **TextBlob:** text and sentiment analysis tool (a bit more convenient than NLTK)

In [None]:
!pip install matplotlib pandas nltk textblob eli5 twython

And now a little additional setup for our old friend NLTK.

In [None]:
import nltk

nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('movie_reviews')

Download a couple datasets for later...

In [None]:
!wget --quiet -O reviews-marked.csv https://github.com/jsoma/sentiment-to-spyplanes/blob/master/reviews-marked.csv?raw=true
!wget --quiet -O sentiment140-subset.csv https://github.com/jsoma/sentiment-to-spyplanes/blob/master/sentiment140-subset.csv?raw=true

# Scoring our sentences

Let's feed our sentences in **NLTK** and see what happens.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

sia = SIA()

sia.polarity_scores("I love this kitten")

In [None]:
text = "I hate this keyboard"
sia.polarity_scores(text)

In [None]:
text = "Your feedback is appreciated :)"
sia.polarity_scores(text)

In [None]:
text = "Your feedback is appreciated 🤮"
sia.polarity_scores(text)

In [None]:
text = "That restaurant was great, but I'm not sure if I'll go there again"
sia.polarity_scores(text)

In [None]:
text = "This article was pure garbage"
sia.polarity_scores(text)

## TextBlob

TextBlob is another library for performing text analysis, and it has **two ways** of performing sentiment analysis.

### Option A

In [None]:
from textblob import TextBlob
from textblob import Blobber
from textblob.sentiments import NaiveBayesAnalyzer

In [None]:
blob = TextBlob("I love this kitten")
blob.sentiment

In [None]:
blob = TextBlob("I hate this keyboard")
blob.sentiment

In [None]:
blob = TextBlob("This article was pure garbage")
blob.sentiment

### Option B

In [None]:
blobber = Blobber(analyzer=NaiveBayesAnalyzer())

blob = blobber("This article was pure garbage")
blob.sentiment

# Comparing all of our sentiment analysis tools

In [None]:
import pandas as pd
pd.set_option("display.max_colwidth", 200)

sentences = pd.DataFrame({'content': [
    "I love this kitten",
    "I hate keyboard",
    "I appreciate the feedback :)",
    "I appreciate the feedback 🤮",
    "This article was garbage",
    "This article was pure garbage",
    "That restaurant was great, but I'm not sure if I'll go there again",
    "I'm not sure how I feel about toast",
    "Did you see the baseball game yesterday?",
    "The package was delivered late and the contents were broken",
    "Trashy television shows are some of my favorites",
    "I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",
    "I find chirping birds irritating, but I know I'm not the only one",
    "Sick moves, bro",
    "ur a nazi",
]})

sentences

In [None]:
def get_scores(content):
    blob = TextBlob(content)
    nb_blob = blobber(content)
    sia_scores = sia.polarity_scores(content)
    
    return pd.Series({
        'content': content,
        'textblob': blob.sentiment.polarity,
        'textblob_bayes': nb_blob.sentiment.p_pos - nb_blob.sentiment.p_neg,
        'nltk': sia_scores['compound'],
    })

scores = sentences.content.apply(get_scores)
scores.style.background_gradient(cmap='RdYlGn', axis=None, low=0.4, high=0.4)

## What's it used for?

* UpShot's Trump + State of the Union: https://www.nytimes.com/interactive/2017/02/28/upshot/trump-sounds-different-tone-in-first-address-to-congress.html
* WaPo's App Stores: https://www.washingtonpost.com/technology/2019/11/22/apple-says-its-app-store-is-safe-trusted-place-we-found-reports-unwanted-sexual-behavior-six-apps-some-targeting-minors/
* AJC's Doctors and Sex Abuse: http://doctors.ajc.com/
* BuzzFeed's Spies in the Skies: https://www.buzzfeednews.com/article/peteraldhous/hidden-spy-planes
* Trump on Twitter: https://www.nytimes.com/interactive/2019/11/02/us/politics/trump-twitter-presidency.html

# Building our sentiment analysis tools

We'll start by reading in a list of tweets that are tagged as either positive or negative.

In [None]:
import pandas as pd

df = pd.read_csv("sentiment140-subset.csv")
df.head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

## Build our classifiers

Now that we have a list of words, we can say hey, learn to associate the appearance of these words with either positivity or negativity!

And did I mention that not only do we get to pick our dataset, there are also **multiple kinds of classifiers?** Let's try two.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

# Word counts + positive/negative
X = words_df
y = df.polarity

# Train a LinearSVC classifier
svc = LinearSVC()
svc.fit(X, y)

# Train a Multinomial Naive Bayes classifier
bayes = MultinomialNB()
bayes.fit(X, y)

In [None]:
# Count the words in the sentences from before
vectors = vectorizer.transform(sentences.content)

new_scores = sentences.copy()

# SVC predictions
new_scores['pred_svc'] = svc.predict(vectors)
new_scores['svc_score'] = svc.decision_function(vectors)

# Bayes predictions + probabilities
new_scores['pred_bayes'] = bayes.predict(vectors)
# Proability that it's positive
new_scores['bayes_positive_prob'] = bayes.predict_proba(vectors)[:,1]

## Checking out our results

Beware that the scoring here isn't the same as up above! That's why we're skipping out on the coloring this time.

In [None]:
new_scores

## Explaining our classifiers

In [None]:
import eli5

eli5.show_weights(svc, vec=vectorizer, top=(5, 5))

# Classifying with the Washington Post

We'll be reproducing part of [Apple says its App Store is ‘a safe and trusted place.’ We found 1,500 reports of unwanted sexual behavior on six apps, some targeting minors](https://www.washingtonpost.com/technology/2019/11/22/apple-says-its-app-store-is-safe-trusted-place-we-found-reports-unwanted-sexual-behavior-six-apps-some-targeting-minors/?arc404=true), from the Washington Post.

In [None]:
import pandas as pd
pd.set_option("display.max_colwidth", 300)

# Read in our data, then drop ones without a text
# review and get rid of a few unwannted columns
df = pd.read_csv("reviews-marked.csv")
df = df.dropna(subset=['Review'])
df = df.drop(columns=['Country', 'Date', 'Version'])
df.head()

Split our dataset into ones we've labeled and ones that don't have labels yet.

In [None]:
known = df[df.sexual.notna()].copy()
unknown = df[df.sexual.isna()].copy()

Count the words inside

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(known.Review)

# Build a dataframe of words, purely out of curiosity
words_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
words_df.head(5)

Train a classifier to understand the difference between the two categories.

In [None]:
from sklearn.svm import LinearSVC

vectorizer = TfidfVectorizer(max_features=500, max_df=0.30)
matrix = vectorizer.fit_transform(known.Review)

X = matrix
y = known.sexual

clf = LinearSVC(class_weight='balanced')
clf.fit(X, y)

In [None]:
X = vectorizer.transform(unknown.Review)

unknown['predicted'] = clf.predict(X)
unknown['predicted_proba'] = clf.decision_function(X)

How many are in each category?

In [None]:
unknown.predicted.value_counts()

Which ones might we be interested in?

In [None]:
unknown.sort_values(by='predicted_proba', ascending=False).head(10)

What does it make those decisions?

In [None]:
import eli5

eli5.explain_weights(clf, vec=vectorizer)