<a href="https://colab.research.google.com/github/jmbanda/CSC8980_NLP_Spring2021/blob/main/Class_15_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Designing your own sentiment analysis tool

While there are a lot of tools that will automatically give us a sentiment of a piece of text, we learned that they don't always agree! Let's design our own to see both how these tools work internally, along with how we can test them to see how well they might perform.

**Original Source: https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/investigating-sentiment-analysis/notebooks/Designing%20your%20own%20sentiment%20analysis%20tool.ipynb.**

### Prep work: Downloading necessary files
Before we get started, we need to download all of the data we'll be using.
* **sentiment140-subset.csv:** cleaned subset of Sentiment140 data - half a million tweets marked as positive or negative


In [1]:
# Make data directory if it doesn't exist
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/investigating-sentiment-analysis/data/sentiment140-subset.csv.zip -P data
!unzip -n -d data data/sentiment140-subset.csv.zip

--2021-03-02 23:49:31--  https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/investigating-sentiment-analysis/data/sentiment140-subset.csv.zip
Resolving nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)... 162.243.189.2
Connecting to nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)|162.243.189.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17927149 (17M) [application/zip]
Saving to: ‘data/sentiment140-subset.csv.zip’


2021-03-02 23:49:32 (23.2 MB/s) - ‘data/sentiment140-subset.csv.zip’ saved [17927149/17927149]

Archive:  data/sentiment140-subset.csv.zip
  inflating: data/sentiment140-subset.csv  


## Training on tweets

Let's say we were going to analyze the sentiment of tweets. If we had a list of tweets that were scored positive vs. negative, we could see which words are usually associated with positive scores and which are usually associated with negative scores.

Luckily, we have **Sentiment140** - http://help.sentiment140.com/for-students - a list of 1.6 million tweets along with a score as to whether they're negative or positive. We'll use it to build our own machine learning algorithm to see separate positivity from negativity.

### Read in our data

In [2]:
import pandas as pd

df = pd.read_csv("data/sentiment140-subset.csv", nrows=30000)
df.head()

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was ..."
3,1,Wii fit says I've lost 10 pounds since last ti...
4,0,@MrKinetik Not a thing!!! I don't really have...


It isn't a very complicated dataset. `polarity` is whether it's positive or not, `text` is the text of the tweet itself.

How many rows do we have?

In [3]:
df.shape

(30000, 2)

How many **positive** tweets compared to how many **negative** tweets?

In [4]:
df.polarity.value_counts()

1    15064
0    14936
Name: polarity, dtype: int64

## Train our algorithm


### Vectorize our tweets

Create a `TfidfVectorizer` and use it to vectorize our tweets. Since we don't have all the time in the world, we should probably use `max_features` to only take a selection of terms - how about 1000 for now?

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

Unnamed: 0,10,100,11,12,15,1st,20,2day,2nd,30,able,about,account,actually,add,after,afternoon,again,ago,agree,ah,ahh,ahhh,air,album,all,almost,alone,already,alright,also,although,always,am,amazing,amp,an,and,annoying,another,...,work,worked,working,works,world,worried,worry,worse,worst,worth,would,wouldn,wow,write,writing,wrong,wtf,www,xd,xoxo,xx,xxx,ya,yay,yea,yeah,year,years,yep,yes,yesterday,yet,yo,you,young,your,yourself,youtube,yum,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.334095,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.427465,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Setting up our variables

Because we want to fit in with all the other progammers, we need to create two variables: one called `X` and one called `y`.

`X` is all of our **features**, the things we use to predict positive or negative. That's going to be our words.

`y` is all of our **labels**, the positive or negative rating. We'll use the `polarity` column for that.

In [7]:
X = words_df
y = df.polarity

### Picking an algorithm

What kind of algorithm do we want? Who knows, we don't know anything about machine learning! **Let's just pick ALL OF THEM.**

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

### Training our algorithms

When we teach our algorithm about what a positive or a negative tweet looks like, this is called **training**. Training can take different amounts of time based on what kind of algorithm you are using.

In [11]:
%%time
# Create and train a logistic regression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X, y)

CPU times: user 14 s, sys: 821 ms, total: 14.8 s
Wall time: 7.52 s


In [12]:
%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X, y)

CPU times: user 29.9 s, sys: 87.5 ms, total: 30 s
Wall time: 30 s


In [15]:
%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X, y)

CPU times: user 395 ms, sys: 0 ns, total: 395 ms
Wall time: 400 ms


In [16]:
%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X, y)

CPU times: user 184 ms, sys: 4.92 ms, total: 189 ms
Wall time: 127 ms


Think about: **How long did each take to train?** How much faster were some compared to others?

## Use our models

Now that we've trained our models, **they can try to predict whether some content is positive or negative**.

### Preparing the data

**Add a few more sentences below.** They should be a mix of positive and negative. They can be boring, they can be exciting, they can be short, they can be long.

In [17]:
# Create some test data

pd.set_option("display.max_colwidth", 200)

unknown = pd.DataFrame({'content': [
    "I love love love love this kitten",
    "I hate hate hate hate this keyboard",
    "I'm not sure how I feel about toast",
    "Did you see the baseball game yesterday?",
    "The package was delivered late and the contents were broken",
    "Trashy television shows are some of my favorites",
    "I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",
    "I find chirping birds irritating, but I know I'm not the only one",
]})
unknown

Unnamed: 0,content
0,I love love love love this kitten
1,I hate hate hate hate this keyboard
2,I'm not sure how I feel about toast
3,Did you see the baseball game yesterday?
4,The package was delivered late and the contents were broken
5,Trashy television shows are some of my favorites
6,"I'm seeing a Kubrick film tomorrow, I hear not so great things about it."
7,"I find chirping birds irritating, but I know I'm not the only one"


First we need to **vectorizer** our sentences into numbers, so the algorithm can understand them.

Our algorithm only knows **certain words.** Run `vectorizer.get_feature_names()` to show you the list of the words it knows.

In [18]:
print(vectorizer.get_feature_names())

['10', '100', '11', '12', '15', '1st', '20', '2day', '2nd', '30', 'able', 'about', 'account', 'actually', 'add', 'after', 'afternoon', 'again', 'ago', 'agree', 'ah', 'ahh', 'ahhh', 'air', 'album', 'all', 'almost', 'alone', 'already', 'alright', 'also', 'although', 'always', 'am', 'amazing', 'amp', 'an', 'and', 'annoying', 'another', 'any', 'anymore', 'anyone', 'anything', 'anyway', 'app', 'apparently', 'apple', 'appreciate', 'are', 'around', 'art', 'as', 'ask', 'asleep', 'ass', 'at', 'ate', 'aw', 'awake', 'awards', 'away', 'awesome', 'aww', 'awww', 'baby', 'back', 'bad', 'band', 'bbq', 'bday', 'be', 'beach', 'beautiful', 'because', 'bed', 'been', 'beer', 'before', 'behind', 'being', 'believe', 'best', 'bet', 'better', 'big', 'bike', 'birthday', 'bit', 'bitch', 'black', 'blip', 'blog', 'blue', 'body', 'boo', 'book', 'books', 'bored', 'boring', 'both', 'bought', 'bout', 'box', 'boy', 'boys', 'break', 'breakfast', 'bring', 'bro', 'broke', 'broken', 'brother', 'brothers', 'btw', 'bus', 'bu

Usually when we use the vectorizer, we write code like this:
    
```python
vectors = vectorizer.fit_transform(....)
```

Which both learns all the words **and** counts them. In this case **we already have the list of words we know, we only want to count them.** So instead of `.fit_transform`, we just use `.transform`:

```python
unknown_vectors = vectorizer.transform(unknown.content)
unknown_words_df = ......
```

Finish making your `unknown_words_df` in the cell below.

In [19]:
# Put it through the vectoriser

# transform, not fit_transform, because we already learned all our words
unknown_vectors = vectorizer.transform(unknown.content)
unknown_words_df = pd.DataFrame(unknown_vectors.toarray(), columns=vectorizer.get_feature_names())
unknown_words_df.head()

Unnamed: 0,10,100,11,12,15,1st,20,2day,2nd,30,able,about,account,actually,add,after,afternoon,again,ago,agree,ah,ahh,ahhh,air,album,all,almost,alone,already,alright,also,although,always,am,amazing,amp,an,and,annoying,another,...,work,worked,working,works,world,worried,worry,worse,worst,worth,would,wouldn,wow,write,writing,wrong,wtf,www,xd,xoxo,xx,xxx,ya,yay,yea,yeah,year,years,yep,yes,yesterday,yet,yo,you,young,your,yourself,youtube,yum,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.417209,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.537291,0.0,0.0,0.244939,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.215967,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Confirm `unknown_words_df` is 11 rows and 2,000 columns.

In [20]:
unknown_words_df.shape

(8, 1000)

### Predicting with our models

To make a prediction for each of our sentences, you can use `.predict` with each of our models. For example, it would look like this for linear regression:

```python
unknown['pred_logreg'] = logreg.predict(unknown_words_df)
```

To add the prediction for logistic regression, you'd run similar `.predict` code, which will give you a `0` (negative) or a `1` (positive). A difference between the two is that for logistic regression, you can **also ask for the probability that the sentence is in the `1` category** instead of just simply the category. To do that, you use this code:

```python
unknown['pred_logreg_prob'] = linreg.predict_proba(unknown_words_df)[:,1]
```

**Add new columns for each of the models you trained.** If the model has a `.predict_proba`, add that as a column as well. 

* **Tip:** Tab is helpful for knowing whether `.predict_proba` is an option.
* **Tip:** Don't forget the `[:,1]` after `.predict_proba`, it means "give me the probability for category `1`

In [22]:
# Predict using all our models. 

# Logistic Regression predictions + probabilities
unknown['pred_logreg'] = logreg.predict(unknown_words_df)
unknown['pred_logreg_proba'] = logreg.predict_proba(unknown_words_df)[:,1]

# Random forest predictions + probabilities
unknown['pred_forest'] = forest.predict(unknown_words_df)
unknown['pred_forest_proba'] = forest.predict_proba(unknown_words_df)[:,1]

# SVC predictions
unknown['pred_svc'] = svc.predict(unknown_words_df)

# Bayes predictions + probabilities
unknown['pred_bayes'] = bayes.predict(unknown_words_df)
unknown['pred_bayes_proba'] = bayes.predict_proba(unknown_words_df)[:,1]

In [None]:
unknown

Unnamed: 0,content,pred_logreg,pred_logreg_proba,pred_forest,pred_forest_proba,pred_svc,pred_bayes,pred_bayes_proba
0,I love love love love this kitten,1,0.950442,1,0.848665,1,1,0.747222
1,I hate hate hate hate this keyboard,0,0.009593,0,0.0,0,0,0.122383
2,I'm not sure how I feel about toast,0,0.180952,0,0.24,0,0,0.416819
3,Did you see the baseball game yesterday?,1,0.615063,1,0.66,1,1,0.509662
4,The package was delivered late and the contents were broken,0,0.058171,0,0.46,0,0,0.219788
5,Trashy television shows are some of my favorites,0,0.330293,0,0.44,0,1,0.534234
6,"I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",1,0.558548,0,0.26,1,1,0.533493
7,"I find chirping birds irritating, but I know I'm not the only one",0,0.060122,0,0.44,0,0,0.295739


### Questions

* What do the numbers mean? What's the difference between a 0 and a 1? A 0.5? Negative numbers?
* Were there any sentences where the classifiers seemed to disagree about? How do you feel about the amount they disagree? 
* What's the difference between using a 0/1 to talk about sentiment compared to 0-1? When might you use one compared to another?
* What's the difference between the linear regression model and the other models we're using? Why might it fit or not fit?
* Between 0-1, what range do you think counts as "negative," "positive" and "neutral"?
* Does the variation in scores reflect the variation you would see among people? Or is it better or worse?

## Testing our models

We can actually see **which model performs the best!** Remember how we trained our models on tweets? We can ask each model about each tweet, and see if it gets the right answer.

In [23]:
df.head()

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was afraid I was gonna crash twitter with all the spamming I did 2 RR..sorry bout that"
3,1,Wii fit says I've lost 10 pounds since last time
4,0,@MrKinetik Not a thing!!! I don't really have a life.....


Our original dataframe is a list of many, many tweets. We turned this into `X` - vectorized words - and `y` - whether the tweet is negative or positive.

Before we used `.fit(X, y)` to train on all of our data. Instead, **we can test our models** by doing a test/train split and see if the predictions match the actual labels.

In [24]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [25]:
%%time

print("Training logistic regression")
logreg.fit(X_train, y_train)

print("Training random forest")
forest.fit(X_train, y_train)

print("Training SVC")
svc.fit(X_train, y_train)

print("Training Naive Bayes")
bayes.fit(X_train, y_train)

Training logistic regression
Training random forest
Training SVC
Training Naive Bayes
CPU times: user 31.7 s, sys: 790 ms, total: 32.5 s
Wall time: 26.8 s


### Confusion matrices

To see how well they did, we'll use a ["confusion matrix"](https://en.wikipedia.org/wiki/Confusion_matrix) for each one. I think confusion matrices are called that because they are confusing.

In [26]:
from sklearn.metrics import confusion_matrix

#### Logistic Regression

In [28]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2722,974
Is positive,886,2918


#### Random forest

In [29]:
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2764,932
Is positive,1040,2764


#### SVC

In [30]:
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2719,977
Is positive,882,2922


#### Multinomial Naive Bayes

In [31]:
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,2772,924
Is positive,983,2821


### Percentage-based confusion matrices

Those are kind of irritating in that they're just numbers. Let's try percentages instead

#### Logisitic

In [32]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.736472,0.263528
Is positive,0.232913,0.767087


#### Logistic regression

In [33]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.736472,0.263528
Is positive,0.232913,0.767087


#### Random forest

In [34]:
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.747835,0.252165
Is positive,0.273396,0.726604


#### SVC

In [35]:
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.73566,0.26434
Is positive,0.231861,0.768139


#### Multinomial Naive Bayes

In [36]:
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.75,0.25
Is positive,0.258412,0.741588


## Review

If you find yourself unsatisfied with a tool, you can try to build your own! This is exactly what we tried to do, using the **Sentiment140 dataset** and several machine learning algorithms.

Sentiment140 is a database of tweets that come pre-labeled with positive or negative sentiment, assigned automatically by presence of a `:)` or `:(`.  Our first step was using a **vectorizer** to convert the tweets into numbers a computer could understand.

After that, we build four different **models** using different machine learning algorithms. Each one was fed a list of each tweet's **features** - the words - and each tweet's **label** - the sentiment - in the hopes that later it could predict labels if given a new tweets. This process of teaching the algorithm is called **training**.

In order to test our algorithms, we split our data into sections - **train** and **test** datasts. You teach the algorithm with the first group, and then ask it for predictions on the second set. You can then compare its predictions to the right answers using a **confusion matrix**.

Although **different algorithms took different amounts of time to train**, they all ended up with about 70-75% accuracy.

## Discussion topics

* Which models performed the best? Were there big differences?
* Do you think it's more important to be sensitive to negativity or positivity? Do we want more positive things incorrectly marked as negative, or more negative things marked as positive?
* They all had very different training times. Which ones offer the best combination of performance and not making you wait around for an hour?
* If you have a decent algorithm that trains more quickly, that could that mean about feature selection or the size of your training set? Why did we use `max_features=` and `df.sample`?
* Is 75% accuracy good?
* Do your feelings change if the performance is described as "incorrect one out of every four times?"
* What would your accuracy be for a random guess?
* How do you feel about sentiment analysis?
* What would you feel comfortable using our sentiment classifier for?

# Sentiment Analysis using Libraries

Source: https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/sentiment-analysis-is-bad/notebooks/Comparing%20sentiment%20analysis%20tools.ipynb#scrollTo=JVOidBrYXS3r

# NLTK: Natural Language Tooklit

[Natural Language Toolkit](https://www.nltk.org/) is the basis for a lot of text analysis done in Python. It's old and terrible and slow, but it's just been used for so long and does so many things that it's generally the default when people get into text analysis. The new kid on the block is [spaCy](https://spacy.io/) (but it doesn't do sentiment analysis out of the box so we're leaving it out of this).

When you first run NLTK, you need to download some datasets to make sure it will be able to do everything you want.

In [37]:
import nltk
nltk.download('vader_lexicon')
nltk.download('movie_reviews')
nltk.download('punkt')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

To do sentiment analysis with NLTK, it only takes a couple lines of code. To determine sentiment, it's using a tool called **VADER**.

In [38]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

sia = SIA()
sia.polarity_scores("This restaurant was great, but I'm not sure if I'll go there again.")



{'compound': 0.0276, 'neg': 0.153, 'neu': 0.688, 'pos': 0.159}

Asking `SentimentIntensityAnalyzer` for the `polarity_score` gave us four values in a dictionary:

- **negative:** the negative sentiment in a sentence
- **neutral:** the neutral sentiment in a sentence
- **positive:** the postivie sentiment in the sentence
- **compound:** the aggregated sentiment. 
    
Seems simple enough!

In [39]:
text = "I just got a call from my boss - does he realise it's Saturday?"
sia.polarity_scores(text)

{'compound': 0.0, 'neg': 0.0, 'neu': 1.0, 'pos': 0.0}

Just like in real life, if you use an emoticon you can be read as being more positive:

In [40]:
text = "I just got a call from my boss - does he realise it's Saturday? :)"
sia.polarity_scores(text)

{'compound': 0.4588, 'neg': 0.0, 'neu': 0.786, 'pos': 0.214}

But what if we swap out the emoticon for an emoji?

In [41]:
text = "I just got a call from my boss - does he realise it's Saturday? 😊"
sia.polarity_scores(text)

{'compound': 0.0, 'neg': 0.0, 'neu': 1.0, 'pos': 0.0}

Back to neutral! Why didn't it understand the emoji the same way it understood the emoticon? Well, **text analysis tools only knows the words that they've been taught,** and if VADER's never seen 😊 before it won't know what to think of it.

# TextBlob

TextBlob is built on top of NLTK, but is infinitely easier to use. It's still slow, but _it's so so so easy to use_. 

You can just feed TextBlob your sentence, then ask for a `.sentiment`!

In [42]:
from textblob import TextBlob
from textblob import Blobber
from textblob.sentiments import NaiveBayesAnalyzer

In [43]:
blob = TextBlob("This restaurant was great, but I'm not sure if I'll go there again.")
blob.sentiment

Sentiment(polarity=0.275, subjectivity=0.8194444444444444)

**How could it possibly be easier than that?!?!?** This time we get a `polarity` and a `subjectivity` instead of all of those different scores, but it's basically the same idea.

If you like options: it turns out TextBlob actually has multiple sentiment analysis tools! How fun! We can plug in a different analyzer to get a different result.

In [44]:
blobber = Blobber(analyzer=NaiveBayesAnalyzer())

blob = blobber("This restaurant was great, but I'm not sure if I'll go there again.")
blob.sentiment

Sentiment(classification='pos', p_pos=0.5879425317005774, p_neg=0.41205746829942275)

Wow, that's a **very different result.** To understand why it's so different, we need to talk about where these sentiment numbers come from.

# But where do those numbers come from?

The most important thing to understand is **sentiment is always just an opinion.** In this case it's an opinion, yes, but specifically **the opinion of a machine.**

## VADER

NLTK's Sentiment Intensity Analyzer works is using something called **VADER**, which is a list of words that have a sentiment associated with each of them.

|Word|Sentiment rating|
|---|---|
|tragedy|-3.4|
|rejoiced|2.0|
|disaster|-3.1|
|great|3.1|

If you have more positives, the sentence is more positive. If you have more negatives, it's more negative. It can also take into account things like capitalization - you can read more about the classifier [here](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html), or the actual paper it came out of [here](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf).

**How do they know what's positive/negative?** They came up with a very big list of words, then asked people on the internet and paid them one cent for each word they scored.

## TextBlob's `.sentiment`

TextBlob's sentiment analysis is based on a separate library called [pattern](https://www.clips.uantwerpen.be/pattern).

> The sentiment analysis lexicon bundled in Pattern focuses on adjectives. It contains adjectives that occur frequently in customer reviews, hand-tagged with values for polarity and subjectivity.

Same kind of thing as NLTK's VADER, but it specifically looks at words from customer reviews.

**How do they know what's positive/negative?** They look at (mostly) adjectives that occur in customer reviews and hand-tag them.

## TextBlob's `.sentiment` + NaiveBayesAnalyzer

TextBlob's other option uses a `NaiveBayesAnalyzer`, which is a machine learning technique. When you use this option with TextBlob, the sentiment is coming from "an NLTK classifier trained on a movie reviews corpus."

**How do they know what's positive/negative?** Looked at movie reviews and scores using machine learning, the computer _automatically learned_ what words are associated with a positive or negative rating.

## What's this mean for me?

When you're doing sentiment analysis with tools like this, you should have a few major questions: 

* Where kind of dataset does the list of known words come from?
* Do they use all the words, or a selection of the words?
* Where do the positive/negative scores come from?

Let's compare the tools we've used so far.

|technique|word source|word selection|scores|
|---|---|---|---|
|NLTK (VADER)|everywhere|hand-picked|internet people, word-by-word|
|TextBlob|product reviews|hand-picked, mostly adjectives|internet people, word-by-word|
|TextBlob + NaiveBayesAnalyzer|movie reviews|all words|automatic based on score|

A major thing that should jump out at you is **how different the sources are.**

While VADER focuses on content found everywhere, TextBlob's two options are specific to certain domains. The [original paper for VADER](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf) passive-aggressively noted that VADER is effective at general use, but being trained on a specific domain can have benefits: 

> While some algorithms performed decently on test data from the specific domain for which it was expressly trained, they do not significantly outstrip the simple model we use.

They're basically saying, "if you train a model on words from a certain field, it will be good at sentiment in that certain field."

# Analyzing differences in sentiment analysis tools

Because they're build differently, sentiment analysis tools don't always agree. Let's take a set of sentences and compare each analyzer's understanding of them.

In [45]:
import pandas as pd
pd.set_option("display.max_colwidth", 200)

df = pd.DataFrame({'content': [
    "I love love love love this kitten",
    "I hate hate hate hate this keyboard",
    "I'm not sure how I feel about toast",
    "Did you see the baseball game yesterday?",
    "The package was delivered late and the contents were broken",
    "Trashy television shows are some of my favorites",
    "I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",
    "I find chirping birds irritating, but I know I'm not the only one",
]})
df

Unnamed: 0,content
0,I love love love love this kitten
1,I hate hate hate hate this keyboard
2,I'm not sure how I feel about toast
3,Did you see the baseball game yesterday?
4,The package was delivered late and the contents were broken
5,Trashy television shows are some of my favorites
6,"I'm seeing a Kubrick film tomorrow, I hear not so great things about it."
7,"I find chirping birds irritating, but I know I'm not the only one"


In [46]:
def get_scores(content):
    blob = TextBlob(content)
    nb_blob = blobber(content)
    sia_scores = sia.polarity_scores(content)
    
    return pd.Series({
        'content': content,
        'textblob': blob.sentiment.polarity,
        'textblob_bayes': nb_blob.sentiment.p_pos - nb_blob.sentiment.p_neg,
        'nltk': sia_scores['compound'],
    })

scores = df.content.apply(get_scores)
scores.style.background_gradient(cmap='RdYlGn', axis=None, low=0.4, high=0.4)

Unnamed: 0,content,textblob,textblob_bayes,nltk
0,I love love love love this kitten,0.5,-0.087933,0.9571
1,I hate hate hate hate this keyboard,-0.8,-0.214151,-0.9413
2,I'm not sure how I feel about toast,-0.25,0.394659,-0.2411
3,Did you see the baseball game yesterday?,-0.4,0.61305,0.0
4,The package was delivered late and the contents were broken,-0.35,-0.57427,-0.4767
5,Trashy television shows are some of my favorites,0.0,0.040076,0.4215
6,"I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",0.8,0.717875,-0.6296
7,"I find chirping birds irritating, but I know I'm not the only one",-0.2,0.257148,-0.25


Wow, those really don't agree with one another! Which one do you agree with the most? Did it get everything "right?"

While it seemed like magic to be able to plug a sentence into a sentiment analyzer and get a result back... maybe things aren't as magical as we thought.

# Review

**Sentiment analysis** is judging whether a piece of text has positive or negative emotion. We covered several tools for doing automatic sentiment analysis: **NLTK**, and two techniques inside of **TextBlob**.

Each tool uses a different data to determine what is positive and negative, and while some use **humans** to flag things as positive or negative, others use a automatic **machine learning**.

As a result of these differences, each tool can come up with very **different sentiment scores** for the same piece of text.

# Discussion topics

The first questions are about whether an analyzer can be applied in situations other than where it was trained. Among other things, you'll want to think about whether the language it was trained on is similar to the language you're using it on.

**Is it okay to use a sentiment analyzer built on product reviews to check the sentiment of tweets?** How about to check the sentiment of wine reviews?

**Is it okay to use a sentiment analyzer trained on everything to check the sentiment of tweets?** How about to check the sentiment of wine reviews?

**Let's say it's a night of political debates.** If I'm trying to report on whether people generally like or dislike what is happening throughout the debates, could I use these sorts of tools on tweets?

We're using the incredibly vague word "okay" on purpose, as there are varying levels of comfort depending on your sitaution. Are you doing this for preliminary research? Are you publishing the results in a journal, in a newspaper, in a report at work, in a public policy recommendation? What if I tell you that the ideal of "I'd only use a sentiment analysis tool trained exactly for my specific domain" is both _rare and impractical?_

As we saw in the last section, **these tools don't always agree with one another, which might be problematic.**

* What might make them agree or disagree?
* Do we think one is the "best?"
* Can you think of any ways to test which one is the 'best' for our purposes?