# Sentiment to Spyplanes

You can see the content this notebook was based on (with a lot more words) [right over here](https://investigate.ai/investigating-sentiment-analysis/comparing-sentiment-analysis-tools/).

Our sentences:

* I love this kitten
* That article was pure garbage
* Your feedback is appreciated :)
* Your feedback is appreciated 🤮
* That restaurant was great, but I'm not sure if I'll go there again!

Before we get started on sentiment, though, we need to **do a little setup.**

## Install what needs installing

We'll need to install a few tools before we move on.

* **matplotlib:** graphing library
* **pandas:** data analysis (although we're only using it to build a table)
* **NLTK:** text and sentiment analysis tool (old workhorse)
* **TextBlob:** text and sentiment analysis tool (a bit more convenient than NLTK)

In [1]:
!pip install matplotlib pandas nltk textblob eli5 twython

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


And now a little additional setup for our old friend NLTK.

In [2]:
import nltk

nltk.download('vader_lexicon')
nltk.download('punkt')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/soma/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /Users/soma/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Download a couple datasets for later...

In [32]:
!wget --quiet -O reviews-marked.csv https://github.com/jsoma/sentiment-to-spyplanes/blob/master/reviews-marked.csv?raw=true
!wget --quiet -O sentiment140-subset.csv https://github.com/jsoma/sentiment-to-spyplanes/blob/master/sentiment140-subset.csv?raw=true

# Scoring our sentences

Let's feed our sentences in **NLTK** and see what happens.

In [3]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

sia = SIA()

sia.polarity_scores("I love this kitten")



{'neg': 0.0, 'neu': 0.323, 'pos': 0.677, 'compound': 0.6369}

In [4]:
text = "I hate this keyboard"
sia.polarity_scores(text)

{'neg': 0.649, 'neu': 0.351, 'pos': 0.0, 'compound': -0.5719}

In [5]:
text = "Your feedback is appreciated :)"
sia.polarity_scores(text)

{'neg': 0.0, 'neu': 0.323, 'pos': 0.677, 'compound': 0.743}

In [6]:
text = "Your feedback is appreciated 🤮"
sia.polarity_scores(text)

{'neg': 0.0, 'neu': 0.476, 'pos': 0.524, 'compound': 0.5106}

In [7]:
text = "That restaurant was great, but I'm not sure if I'll go there again"
sia.polarity_scores(text)

{'neg': 0.153, 'neu': 0.688, 'pos': 0.159, 'compound': 0.0276}

In [8]:
text = "This article was pure garbage"
sia.polarity_scores(text)

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

## TextBlob

TextBlob has **two ways** of performing sentiment analysis.

In [9]:
from textblob import TextBlob
from textblob import Blobber
from textblob.sentiments import NaiveBayesAnalyzer

In [10]:
blob = TextBlob("I love this kitten")
blob.sentiment

Sentiment(polarity=0.5, subjectivity=0.6)

In [11]:
blob = TextBlob("I hate this keyboard")
blob.sentiment

Sentiment(polarity=-0.8, subjectivity=0.9)

In [12]:
blob = TextBlob("This article was pure garbage")
blob.sentiment

Sentiment(polarity=0.21428571428571427, subjectivity=0.5)

In [13]:
blobber = Blobber(analyzer=NaiveBayesAnalyzer())

blob = blobber("This article was pure garbage")
blob.sentiment

Sentiment(classification='neg', p_pos=0.3898306696279278, p_neg=0.610169330372073)

## Comparison

In [14]:
import pandas as pd
pd.set_option("display.max_colwidth", 200)

sentences = pd.DataFrame({'content': [
    "I love this kitten",
    "I hate keyboard",
    "I appreciate the feedback :)",
    "I appreciate the feedback 🤮",
    "This article was garbage",
    "This article was pure garbage",
    "That restaurant was great, but I'm not sure if I'll go there again",
    "I'm not sure how I feel about toast",
    "Did you see the baseball game yesterday?",
    "The package was delivered late and the contents were broken",
    "Trashy television shows are some of my favorites",
    "I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",
    "I find chirping birds irritating, but I know I'm not the only one",
    "Sick moves, bro",
    "ur a nazi",
]})

sentences

Unnamed: 0,content
0,I love this kitten
1,I hate keyboard
2,I appreciate the feedback :)
3,I appreciate the feedback 🤮
4,This article was garbage
5,This article was pure garbage
6,"That restaurant was great, but I'm not sure if I'll go there again"
7,I'm not sure how I feel about toast
8,Did you see the baseball game yesterday?
9,The package was delivered late and the contents were broken


In [15]:
def get_scores(content):
    blob = TextBlob(content)
    nb_blob = blobber(content)
    sia_scores = sia.polarity_scores(content)
    
    return pd.Series({
        'content': content,
        'textblob': blob.sentiment.polarity,
        'textblob_bayes': nb_blob.sentiment.p_pos - nb_blob.sentiment.p_neg,
        'nltk': sia_scores['compound'],
    })

scores = sentences.content.apply(get_scores)
scores.style.background_gradient(cmap='RdYlGn', axis=None, low=0.4, high=0.4)

Unnamed: 0,content,textblob,textblob_bayes,nltk
0,I love this kitten,0.5,-0.0879325,0.6369
1,I hate keyboard,-0.8,-0.206089,-0.5719
2,I appreciate the feedback :),0.5,-0.299545,0.6908
3,I appreciate the feedback 🤮,0.0,-0.299545,0.4019
4,This article was garbage,0.0,-0.519103,0.0
5,This article was pure garbage,0.214286,-0.220339,0.0
6,"That restaurant was great, but I'm not sure if I'll go there again",0.275,0.186505,0.0276
7,I'm not sure how I feel about toast,-0.25,0.394659,-0.2411
8,Did you see the baseball game yesterday?,-0.4,0.61305,0.0
9,The package was delivered late and the contents were broken,-0.35,-0.57427,-0.4767


* https://www.nytimes.com/interactive/2017/02/28/upshot/trump-sounds-different-tone-in-first-address-to-congress.html
* https://www.nytimes.com/interactive/2019/11/02/us/politics/trump-twitter-presidency.html
* https://www.washingtonpost.com/technology/2019/11/22/apple-says-its-app-store-is-safe-trusted-place-we-found-reports-unwanted-sexual-behavior-six-apps-some-targeting-minors/
* http://doctors.ajc.com/
* https://www.buzzfeednews.com/article/peteraldhous/hidden-spy-planes

# Building our own

We'll start by reading in a list of tweets that are tagged as either positive or negative.

In [16]:
import pandas as pd

df = pd.read_csv("sentiment140-subset.csv")
df.head()

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was afraid I was gonna crash twitter with all the spamming I did 2 RR..sorry bout that"
3,1,Wii fit says I've lost 10 pounds since last time
4,0,@MrKinetik Not a thing!!! I don't really have a life.....


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

Unnamed: 0,10,100,11,12,15,1st,20,2day,2nd,30,...,yesterday,yet,yo,you,young,your,yourself,youtube,yum,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.334095,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.427465,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Build our classifiers

Now that we have a list of words, we can say hey, learn to associate the appearance of these words with either positivity or negativity!

And did I mention that not only do we get to pick our dataset, there are also **multiple kinds of classifiers?** Let's try two.

In [18]:
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

# Word counts + positive/negative
X = words_df
y = df.polarity

# Train a LinearSVC classifier
svc = LinearSVC()
svc.fit(X, y)

# Train a Multinomial Naive Bayes classifier
bayes = MultinomialNB()
bayes.fit(X, y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [19]:
# Count the words in the sentences from before
vectors = vectorizer.transform(sentences.content)

new_scores = sentences.copy()

# SVC predictions
new_scores['pred_svc'] = svc.predict(vectors)
new_scores['svc_score'] = svc.decision_function(vectors)

# Bayes predictions + probabilities
new_scores['pred_bayes'] = bayes.predict(vectors)
# Proability that it's positive
new_scores['bayes_positive_prob'] = bayes.predict_proba(vectors)[:,1]

## Checking out our results

Beware that the scoring here isn't the same as up above! That's why we're skipping out on the coloring this time.

In [20]:
new_scores

Unnamed: 0,content,pred_svc,svc_score,pred_bayes,bayes_positive_prob
0,I love this kitten,1,0.719146,1,0.67268
1,I hate keyboard,0,-1.498996,0,0.123175
2,I appreciate the feedback :),1,0.828317,1,0.843951
3,I appreciate the feedback 🤮,1,0.828317,1,0.843951
4,This article was garbage,0,-0.302569,0,0.412105
5,This article was pure garbage,0,-0.302569,0,0.412105
6,"That restaurant was great, but I'm not sure if I'll go there again",0,-0.038524,1,0.533919
7,I'm not sure how I feel about toast,0,-0.524692,0,0.416819
8,Did you see the baseball game yesterday?,1,0.162518,1,0.509662
9,The package was delivered late and the contents were broken,0,-0.924342,0,0.219788


## Explaining our classifiers

In [21]:
import eli5

eli5.show_weights(svc, vec=vectorizer, top=(5, 5))

Weight?,Feature
+1.560,worry
+1.520,welcome
+1.363,smile
+1.322,excited
+1.314,thx
… 452 more positive …,… 452 more positive …
… 539 more negative …,… 539 more negative …
-1.950,wont
-2.008,sadly
-2.065,shame


# Classifying with the Washington Post

We'll be reproducing part of [Apple says its App Store is ‘a safe and trusted place.’ We found 1,500 reports of unwanted sexual behavior on six apps, some targeting minors](https://www.washingtonpost.com/technology/2019/11/22/apple-says-its-app-store-is-safe-trusted-place-we-found-reports-unwanted-sexual-behavior-six-apps-some-targeting-minors/?arc404=true), from the Washington Post.

In [22]:
import pandas as pd
pd.set_option("display.max_colwidth", 300)

# Read in our data, then drop ones without a text
# review and get rid of a few unwannted columns
df = pd.read_csv("reviews-marked.csv")
df = df.dropna(subset=['Review'])
df = df.drop(columns=['Country', 'Date', 'Version'])
df.head()

Unnamed: 0,Rating,Review,source,racism,bullying,sexual
0,5,It’s a great app to meet new people and chat in very satisfied with downloading this app i recommend this app if you like to chat or just to meet new people. And you can choose which country To find different users!,holla,,,
1,5,"Holla is an excellent app, where I get to know new people every time and even get to make new friends. I truly recommend this application to all people!",holla,,,
2,1,Get rid of micro transactions or i will find a new app to use. Why should i have to pay for that it’s so stupid,holla,0.0,0.0,0.0
3,5,"Free to use app, meet people around the world.",holla,,,
4,5,I got this app and everything has been different. I’ve met so many interesting people. From around the world. I was recently reunited with my high school girlfriend. We’re getting married. I met and married The love of my Life thanks to Holla. Thanks Holla!!!!!,holla,,,


In [23]:
known = df[df.sexual.notna()].copy()
unknown = df[df.sexual.isna()].copy()


In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(known.Review)

# Build a dataframe of words, purely out of curiosity
words_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
words_df.head(5)

Unnamed: 0,000,10,100,13,14,15,16,17,18,19,...,للدردشة,للرمنسية,للعب,مخصص,مكان,من,نطاق,والصداقة,وضع,ومكان
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.115461,0.0,0.10512,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
from sklearn.svm import LinearSVC

vectorizer = TfidfVectorizer(max_features=500, max_df=0.30)
matrix = vectorizer.fit_transform(known.Review)

X = matrix
y = known.sexual

clf = LinearSVC(class_weight='balanced')
clf.fit(X, y)

LinearSVC(C=1.0, class_weight='balanced', dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [26]:
X = vectorizer.transform(unknown.Review)

unknown['predicted'] = clf.predict(X)
unknown['predicted_proba'] = clf.decision_function(X)

In [27]:
unknown.predicted.value_counts()


0.0    55247
1.0      478
Name: predicted, dtype: int64

In [28]:
unknown.sort_values(by='predicted_proba', ascending=False).head(10)

Unnamed: 0,Rating,Review,source,racism,bullying,sexual,predicted,predicted_proba
53538,3,I like this app but there is so many horny guys and they are all 30 and asking for nudes,chat-for-strangers,,,,1.0,1.153481
30392,1,Six 18yd men asked for nudes I’m 13,holla,,,,1.0,1.129564
39875,1,The only thing you’re going to get on the site is fake news there is asking me to go on to another site to pay to watch their nudes,skout,,,,1.0,1.077133
16547,2,Just a bunch of horny guys wanting nudes,holla,,,,1.0,1.03121
19428,2,All the guys on here ever ask for is nudes like I don't want to send my nudes to you,skout,,,,1.0,1.008181
20616,1,"Almost all the guys on the app ask girls for nudes and if you don't send them it they'll literally get upset and unfriend you on either snapchat or yellow itself, the people on there is shallow. 🤧",holla,,,,1.0,0.878374
22834,1,Just a bunch of guys that want to show off their privates,holla,,,,1.0,0.867645
21071,1,I just want to say that all these guys downloading this or reviewing just to get girls to send nudes are asking to be trolled.,chat-for-strangers,,,,1.0,0.850985
22327,1,"The app is old men on there, guys harass you, they treat women on there like we want sex and they say they’ll pay you for sex smh. This app needs to be shut down a lot of creepy old guys and some creepy young guys. They don’t read your profile they just harass you over and over again. The women ...",skout,,,,1.0,0.803721
22904,4,There are guys using hot photos of girls \nAnd in their\nMen searching for women.,skout,,,,1.0,0.76675


In [29]:
import eli5

eli5.explain_weights(clf, vec=vectorizer)

Weight?,Feature
+1.901,nudes
+1.589,guys
+1.413,men
+1.251,thing
+1.219,filter
+1.217,without
+1.131,report
+1.053,off
+0.975,video
+0.965,on
