# Designing your own sentiment analysis tool

While there are a lot of tools that will automatically give us a sentiment of a piece of text, we learned that they don't always agree! Let's design our own to see both how these tools work internally, along with how we can test them to see how well they might perform.

## Training on tweets

Let's say we were going to analyze the sentiment of tweets. **If we had a list of tweets that were scored positive vs. negative, we could see which words are usually associated with positive scores and which are usually associated with negative scores.** We wouldn't need VADER or pattern or anything like that, we'd be able to _know_ we had a good dataset!

Luckily, we have **Sentiment140** - http://help.sentiment140.com/for-students - a list of 1.6 million tweets along with a score as to whether they're negative or positive. We'll use it to build our own machine learning algorithm to see separate positivity from negativity.

I'm providing **sentiment140-subset.csv** for you: a _cleaned_ subset of Sentiment140 data. It contains half a million tweets marked as positive or negative.

### Read in our data

Read in `sentiment140-subset.csv` and take a look at it.

In [1]:
import pandas as pd
pd.set_option("display.max_colwidth", 200)
pd.set_option("display.max_columns", 200)

# Read in your dataset
df = pd.read_csv("sentiment140-subset.csv")
df.head()



Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was afraid I was gonna crash twitter with all the spamming I did 2 RR..sorry bout that"
3,1,Wii fit says I've lost 10 pounds since last time
4,0,@MrKinetik Not a thing!!! I don't really have a life.....


The subset is originally 500,000 tweets, but we don't have all the time in the world! I'm going to cut it down to 3,000 instead. **Be sure you run this code, or else you might be stuck training your language models for a very long time!**

In [2]:
# In theory we would like a sample of 3000 random tweets, which you
# can do with this code:
# df = df.sample(3000)
# the problem is I'd like to say things later about specific
# tweets, so I'm going to force us to keep the first 3000 instead
df = df[:3000]

It isn't a very complicated dataset. `polarity` is whether it's positive or not, `text` is the text of the tweet itself.

How many rows do we have? **Make sure it's 3,000.**

In [3]:
df.shape

(3000, 2)

How many **positive** tweets compared to how many **negative** tweets?

In [7]:
df.polarity.value_counts()

0    1504
1    1496
Name: polarity, dtype: int64

## Train our model

To build our model, we're going to use a machine learning library called [scikit-learn](https://scikit-learn.org/stable/). It's a "classical" machine learning library, which means it isn't the "this is a black-box neural network doing magic that we don't understand" kind of machine learning. We'll be able to easily look inside.

You can install it with `pip install sklearn`.

> This section is going to be a lot of cut and paste/just running code I've already put together (and maybe tweaking it a little). We'll get deeper into sklearn as we go forward in our machine learning journey!

In [8]:
!pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting scikit-learn
  Downloading scikit_learn-0.24.2-cp38-cp38-macosx_10_13_x86_64.whl (7.2 MB)
[K     |████████████████████████████████| 7.2 MB 8.6 MB/s eta 0:00:01
[?25hCollecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.2.0-py3-none-any.whl (12 kB)
Collecting scipy>=0.19.1
  Downloading scipy-1.7.1-cp38-cp38-macosx_10_9_x86_64.whl (32.6 MB)
[K     |████████████████████████████████| 32.6 MB 10.0 MB/s eta 0:00:01    |█                               | 1.1 MB 13.0 MB/s eta 0:00:03
[?25hUsing legacy 'setup.py install' for sklearn, since package 'wheel' is not installed.
Installing collected packages: threadpoolctl, scipy, scikit-learn, sklearn
    Running setup.py install for sklearn ... [?25ldone
[?25hSuccessfully installed scikit-learn-0.24.2 scipy-1.7.1 sklearn-0.0 threadpoolctl-2.2.0
You should consider upgrading via the '/Users/nisha/.pyenv/versions/3.8.10/bin/python -m pip install --upgrade pip' comma

### Counting words

Remember how we could just make a word cloud and call it a language model? We're going to do the same thing here! It's specifically going to be a **bag of words** model, where we don't care about the order that words are in.

It's also going to do a little trick that makes **less common words more meaningful.** This makes common words like `the` and `a` fade away in importance. Technically speaking this "little trick" is called TF-IDF (term-frequency inverse-document-frequency), but all you need to know is "the more common a word is, the less we'll pay attention to it."

The code below creates a `TfidfVectorizer` – a fancy word counter – and uses it to convert our tweets into word counts.

**Since we don't have all the time and energy in the world and want to keep our CO2 to a minimum,** let's only take a selection of words. We can use `max_features` to only take the most common words - let's try the top 1000 for now.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [10]:
vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

Unnamed: 0,10,100,11,12,15,16,1st,20,2nd,30,able,about,accident,account,actually,after,afternoon,again,ago,agree,ah,ahhh,air,airport,alive,all,allowed,almost,alone,already,alright,also,always,am,amazing,amp,an,and,annoying,another,answer,any,anymore,anyone,anything,anyway,apartment,apparently,apple,appreciate,are,area,aren,arent,argh,army,around,arrived,art,article,as,ask,asked,asleep,ass,at,ate,august,austin,available,avatar,aw,awake,awards,away,awesome,aww,awww,awwww,baby,back,background,bad,ball,band,baseball,be,beach,beautiful,because,bed,been,before,being,believe,best,better,big,bike,birthday,...,using,ve,version,very,via,video,videos,visit,voice,vote,vs,wait,waiting,wake,walk,walking,wanna,want,wanted,wants,was,wasn,wasnt,watch,watched,watching,water,way,we,wear,weather,wedding,wednesday,week,weekend,weeks,weird,welcome,well,went,were,what,whats,when,where,which,while,white,who,whole,why,wife,will,win,windows,wine,wish,wishing,with,without,woke,won,wonder,wont,woo,woot,words,work,working,works,world,worry,worse,worst,would,wow,write,writing,wrong,wtf,www,xd,xx,xxx,xxxx,ya,yay,yea,yeah,year,years,yes,yesterday,yet,yo,you,your,youre,youtube,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.328873,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.224692,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.439415,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.199659,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.429504,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.340996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Each word (or token, as we learned!) gets a column, and each tweet gets a row. A zero means the word did not show up in the tweet, while any other number means it did. A score of `1.0` means it's the only word in the tweet (or the only word that the language model is paying attention to).

For example, you see `0.427465` under `10` for the fourth tweet. That means `10` was a pretty important word in the fourth tweet! In the same vein, if you scroll to the far far right you can see our first tweet got a score under `you` for `0.334095`.

Tweets aren't very long so you usually have only a handful of non-zero values for each row. If each row was a book with a lot of words, you'd have lower values spread out across all of the words.

### Checking our word list

Use `vectorizer.get_feature_names()` to look at the words that were chosen. Do you have any thoughts or feelings about this list?

In [11]:
vectorizer.get_feature_names()

['10',
 '100',
 '11',
 '12',
 '15',
 '16',
 '1st',
 '20',
 '2nd',
 '30',
 'able',
 'about',
 'accident',
 'account',
 'actually',
 'after',
 'afternoon',
 'again',
 'ago',
 'agree',
 'ah',
 'ahhh',
 'air',
 'airport',
 'alive',
 'all',
 'allowed',
 'almost',
 'alone',
 'already',
 'alright',
 'also',
 'always',
 'am',
 'amazing',
 'amp',
 'an',
 'and',
 'annoying',
 'another',
 'answer',
 'any',
 'anymore',
 'anyone',
 'anything',
 'anyway',
 'apartment',
 'apparently',
 'apple',
 'appreciate',
 'are',
 'area',
 'aren',
 'arent',
 'argh',
 'army',
 'around',
 'arrived',
 'art',
 'article',
 'as',
 'ask',
 'asked',
 'asleep',
 'ass',
 'at',
 'ate',
 'august',
 'austin',
 'available',
 'avatar',
 'aw',
 'awake',
 'awards',
 'away',
 'awesome',
 'aww',
 'awww',
 'awwww',
 'baby',
 'back',
 'background',
 'bad',
 'ball',
 'band',
 'baseball',
 'be',
 'beach',
 'beautiful',
 'because',
 'bed',
 'been',
 'before',
 'being',
 'believe',
 'best',
 'better',
 'big',
 'bike',
 'birthday',
 'bit'

<i>The tweets seem fairly benign because most words are pretty neutral.</i>

### Setting up our variables and training a language model

Now we'll use our word counts to build a language model that can do sentiment analysis! Because we want to fit in with all the other progammers who use machine learning, we need to create two variables: one called `X` and one called `y`.

`X` is our **features**, the things we use to predict positive or negative. In this case, it's going to be our words. We'll be using words to predict whether a tweet is positive or negative.

`y` is our **labels**, the positive or negative rating that we want to predict. We'll use the `polarity` column for that.

In [12]:
X = words_df
y = df.polarity

### Picking an architecture

We talked about picking an **architecture** in class. To a large degree, a model (language model, vision model, etc) is a combination of an architecture, a dataset, and a handful of other choices. The models we talked about in class were mostly "neural nets" that had components like "bidirectional masking" and other buzzwords we couldn't understand. It's the exact same thing for classical machine learning!

So what kind of architecture do we want? Who knows, we don't know anything about machine learning! **Let's just pick ALL OF THEM.**

> **Sidenote:** Blindly picking multiple architectures and seeing which one performs the best is a completely valid thing to do in data science. To a large degree, it's a lot of "if it works, it works! who cares why?"

In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

### Training our language models

When we teach our language model about what a positive or a negative tweet looks like, this is called **training**. Training can take different amounts of time based on what kind of algorithm you are using.

For the scikit-learn library, you use `.fit(X, y)` to teach a model how to predict the labels (`y`: positive, negative) from the features (`X`: the word usage).

In [14]:
%%time
# Create and train a logistic regression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X, y)

CPU times: user 3.7 s, sys: 83.4 ms, total: 3.79 s
Wall time: 1.96 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(C=1000000000.0, max_iter=1000)

In [15]:
%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X, y)

CPU times: user 945 ms, sys: 12.9 ms, total: 958 ms
Wall time: 978 ms


RandomForestClassifier(n_estimators=50)

In [16]:
%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X, y)

CPU times: user 32.7 ms, sys: 1.61 ms, total: 34.3 ms
Wall time: 38.7 ms


LinearSVC()

In [17]:
%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X, y)

CPU times: user 20.2 ms, sys: 2.44 ms, total: 22.6 ms
Wall time: 22.5 ms


MultinomialNB()

**How long did each take to train?** Were any much faster than others? While we didn't fly any planes across the ocean to build these, at the very least a model that takes a long time to train can be *annoying*.

<i>
<br> Logistic regression took 1.96 seconds - this was the slowest
<br> RandomForestClassifier took 978 milliseconds
<br> LinearSVC took 38.7 milliseconds
<br> MultinomialNB took 22. 5 milliseconds - this was the fastest
</i>

## Use our models

Now that we've trained our language models, **we can use them to predict whether some text is positive or negative**.

### Preparing the data

I started us off, but **add a few more sentences below.** They should be a mix of positive and negative. They can be boring, they can be exciting, they can be short, they can be long. Honestly, you could paste a book in there if you were dedicated enough.

In [40]:
# Create some test data
unknown = pd.DataFrame({'content': [
    "I love love love love this kitten",
    "I hate hate hate hate this keyboard",
    "I'm not sure how I feel about toast",
    "Did you see the baseball game yesterday?",
    "The package was delivered late and the contents were broken",
    "Trashy television shows are some of my favorites",
    "I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",
    "I find chirping birds irritating, but I know I'm not the only one",
    "I'm going to miss you so much when you're gone",
    "I had a marvellous time",
    "You must be fun at parties",
    "It's Monday and I'm already tired",
    "Chocolate is my comfort food",
    "This documentary is both fascinating and creepy.",
    "Ridiculous, I'm speechless",
    "Thank you for coming to my TED talk"    
]})
unknown

Unnamed: 0,content
0,I love love love love this kitten
1,I hate hate hate hate this keyboard
2,I'm not sure how I feel about toast
3,Did you see the baseball game yesterday?
4,The package was delivered late and the contents were broken
5,Trashy television shows are some of my favorites
6,"I'm seeing a Kubrick film tomorrow, I hear not so great things about it."
7,"I find chirping birds irritating, but I know I'm not the only one"
8,I'm going to miss you so much when you're gone
9,I had a marvellous time


First we need to **vectorize** our new sentences into numbers, so the language model can understand them. In this case, we're doing the fancy word counting we talked about before.

Our algorithm only knows **certain words.** It learned them when we were training it! Run `vectorizer.get_feature_names()` to remind yourself of the words the vectorizer knows.

In [41]:
vectorizer.get_feature_names()

['10',
 '100',
 '11',
 '12',
 '15',
 '16',
 '1st',
 '20',
 '2nd',
 '30',
 'able',
 'about',
 'accident',
 'account',
 'actually',
 'after',
 'afternoon',
 'again',
 'ago',
 'agree',
 'ah',
 'ahhh',
 'air',
 'airport',
 'alive',
 'all',
 'allowed',
 'almost',
 'alone',
 'already',
 'alright',
 'also',
 'always',
 'am',
 'amazing',
 'amp',
 'an',
 'and',
 'annoying',
 'another',
 'answer',
 'any',
 'anymore',
 'anyone',
 'anything',
 'anyway',
 'apartment',
 'apparently',
 'apple',
 'appreciate',
 'are',
 'area',
 'aren',
 'arent',
 'argh',
 'army',
 'around',
 'arrived',
 'art',
 'article',
 'as',
 'ask',
 'asked',
 'asleep',
 'ass',
 'at',
 'ate',
 'august',
 'austin',
 'available',
 'avatar',
 'aw',
 'awake',
 'awards',
 'away',
 'awesome',
 'aww',
 'awww',
 'awwww',
 'baby',
 'back',
 'background',
 'bad',
 'ball',
 'band',
 'baseball',
 'be',
 'beach',
 'beautiful',
 'because',
 'bed',
 'been',
 'before',
 'being',
 'believe',
 'best',
 'better',
 'big',
 'bike',
 'birthday',
 'bit'

Run the code below to complete `unknown_words_df`, the word counts for all of the texts we wrote above.

> When I say "word counts" I mean "TF-IDF word counts that are word counts but adjusted in a very specific way to make more common words less important" (but you knew that already!)

It **only counts words that were in the training data**, because those are the only words it can understand as being positive or negative. Any new or unknown words will be thrown out!

In [42]:
# Put it through the vectorizer
unknown_vectors = vectorizer.transform(unknown.content)
unknown_words_df = pd.DataFrame(unknown_vectors.toarray(), columns=vectorizer.get_feature_names())
unknown_words_df.head()

Unnamed: 0,10,100,11,12,15,16,1st,20,2nd,30,able,about,accident,account,actually,after,afternoon,again,ago,agree,ah,ahhh,air,airport,alive,all,allowed,almost,alone,already,alright,also,always,am,amazing,amp,an,and,annoying,another,answer,any,anymore,anyone,anything,anyway,apartment,apparently,apple,appreciate,are,area,aren,arent,argh,army,around,arrived,art,article,as,ask,asked,asleep,ass,at,ate,august,austin,available,avatar,aw,awake,awards,away,awesome,aww,awww,awwww,baby,back,background,bad,ball,band,baseball,be,beach,beautiful,because,bed,been,before,being,believe,best,better,big,bike,birthday,...,using,ve,version,very,via,video,videos,visit,voice,vote,vs,wait,waiting,wake,walk,walking,wanna,want,wanted,wants,was,wasn,wasnt,watch,watched,watching,water,way,we,wear,weather,wedding,wednesday,week,weekend,weeks,weird,welcome,well,went,were,what,whats,when,where,which,while,white,who,whole,why,wife,will,win,windows,wine,wish,wishing,with,without,woke,won,wonder,wont,woo,woot,words,work,working,works,world,worry,worse,worst,would,wow,write,writing,wrong,wtf,www,xd,xx,xxx,xxxx,ya,yay,yea,yeah,year,years,yes,yesterday,yet,yo,you,your,youre,youtube,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.412292,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.532476,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.453394,0.0,0.0,0.207613,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.261463,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.371519,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.512289,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Notice how it only has 1,000 rows: those are the 1,000 features (words) that we told our model to pay attention to.

Now that we've counted the words for the sentences of unknown sentiment, **we can use our model to make predictions about whether they're postive or negative.**

### Predicting with our models

To make a prediction for each of these new, unknown-sentiment sentences, we can use `.predict` with each of our models. For example, it would look like this for logistic regression:

```python
unknown['pred_logreg'] = logreg.predict(unknown_words_df)
```

To add the prediction for the "random forest," we'd run similar `forest.predict` code, which will give you a `0` (negative) or a `1` (positive).

#### But: probabilities!

**We don't always want just a `0` or a `1`, though**. That "YES IT'S POSITIVE" or "NO, IT'S NEGATIVE" energy is very forceful but not always appropriate: sometimes a sentence is just *kind of* positive or there's just a *little bit of a chance* that it's negative, and we're interested in the *degree*.

To know the *chance* that something is positive, we can use this code:

```python
unknown['pred_logreg_prob'] = linreg.predict_proba(unknown_words_df)[:,1]
```

**Add these new columns for each of the models you trained** - `logreg`, `forest`, `svc` and `bayes`. Everything except for LinearSVC can also do `.predict_proba`, so you should add those values as columns as well.

* **Tip:** Tab is helpful for knowing whether `.predict_proba` is an option for a given model.
* **Tip:** Don't forget the `[:,1]` after `.predict_proba`! It means "give me the probability that it's category `1` (aka positive)

In [43]:
# Predict using all our models. 

# Logistic Regression predictions + probabilities
unknown['pred_logreg'] = logreg.predict(unknown_words_df)
unknown['pred_logreg_proba'] = logreg.predict_proba(unknown_words_df)[:,1]

# Random forest predictions + probabilities
unknown['pred_forest'] = forest.predict(unknown_words_df)
unknown['pred_forest_prob'] = forest.predict_proba(unknown_words_df)[:,1]

# SVC predictions (doesn't support probabilities)
unknown['pred_svc'] = svc.predict(unknown_words_df)

# Bayes predictions + probabilities
unknown['pred_bayes'] = bayes.predict(unknown_words_df)
unknown['pred_bayes_prob'] = bayes.predict_proba(unknown_words_df)[:,1]

Once you're done making your predictions, **let's look at the results!**

In [44]:
unknown

Unnamed: 0,content,pred_logreg,pred_logreg_proba,pred_forest,pred_forest_prob,pred_svc,pred_bayes,pred_bayes_prob
0,I love love love love this kitten,1,0.9973705,1,0.965361,1,1,0.764178
1,I hate hate hate hate this keyboard,0,0.0002454401,0,0.061246,0,0,0.159454
2,I'm not sure how I feel about toast,0,0.4809118,0,0.3,0,0,0.491845
3,Did you see the baseball game yesterday?,1,0.9969174,1,0.76,1,1,0.560423
4,The package was delivered late and the contents were broken,0,0.01487959,1,0.58,0,0,0.3258
5,Trashy television shows are some of my favorites,0,0.3909988,0,0.34,0,1,0.50935
6,"I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",1,0.9999924,0,0.44,1,1,0.668624
7,"I find chirping birds irritating, but I know I'm not the only one",0,0.2167344,0,0.34,0,0,0.328348
8,I'm going to miss you so much when you're gone,0,1.793904e-05,0,0.26,0,0,0.346803
9,I had a marvellous time,1,0.5576659,1,0.712228,1,1,0.538495


### Questions

**What do the numbers mean?** What's the difference between a 0 and a 1? A 0.5? (I don't *think* you should have any negative numbers)

<i>0 means negative. 1 means positive. 0.5 means neutral.</i>

**Were there any sentences where the language models seemed to disagree about?** How do you feel about the amount they disagree? Do any of the disagreements make you specific models are useless/super smart?

<i>
<br> The package was delivered late and the contents were broken - Strangely, Forest thought this was positive.
<br> Trashy television shows are some of my favorites - Only Bayes realized that this was meant to be positive.
<br> I'm seeing a Kubrick film tomorrow, I hear not so great things about it. - Only Forest figured out this is negative.
<br> Ridiculous, I'm speechless	 - Forest missed the mark here, thinking this could be positive
<br>
<br> They didn't disagree too much, but I thought Bayes was more right than wrong.
</i>

**What's the difference between using a simple 0/1 to talk about sentiment compared to the range between 0-1?** When might you use one or the other?

<i>0/1 should be used for simple sentiment analysis while 0-1 should be used if you need to get a more accurate assessment and the subject is more serious.</i>

**Between 0-1, what range do you think counts as "negative," "positive" and "neutral"?** For example, are things positive as soon as you hit 0.5? Or does it take getting to 0.7 or 0.8 or 0.95 to really be able to call something "positive"?

<i>I think once it hits 0.5, that means it could be called "positive."</i>

## Testing our models

Instead of talking about our *feelings* about which model is our favorite, **we can actually test our language models to see which performs the best!** Our metrics aren't going to end up on [paperswithcode.com](https://paperswithcode.com/) but they'll be good enough for us.

Remember our original tweets, the ones we used to train our models? We were able to teach our model what a positive and a negative tweet was because each tweet was marked as positive or negative.

To see how good our model is, we can give each model a known tweet and say "is this positive or negative?" Then we'll compare the result to what's in our dataset. If the tweet was positive, did it predict positive?

In [45]:
# Let's remind ourselves what our data looks like
df.head()

Unnamed: 0,polarity,text
0,0,@kconsidder You never tweet
1,0,Sick today coding from the couch.
2,1,"@ChargerJenn Thx for answering so quick,I was afraid I was gonna crash twitter with all the spamming I did 2 RR..sorry bout that"
3,1,Wii fit says I've lost 10 pounds since last time
4,0,@MrKinetik Not a thing!!! I don't really have a life.....


Our original dataframe is a list of many, many tweets. We turned this into `X` - vectorized words - and `y` - whether the tweet is negative or positive.

Before we used `.fit(X, y)` to train each model on all of our data, so we have these wonderful pre-trained models now. **But if we're testing our language model on a tweet it's already seen, isn't that kind of like cheating?** It already knows the answer!

Instead, we'll give our models 80% of our tweets as training data to learn from, and then keep 20% separate to quiz it on later. It's like when a teacher gives you a study guide that's *similar* to what will be on the test, but not *exactly* the same.

This is called a **train-test split**, and you always use the exact same code to do it. Yes, the models would be smarter if we gave it all of the data, but then we wouldn't be able to test it!

In [46]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

> Note about real life: When deploying a model into actual use, you typically pick the best-performing model after train/test split evaluation and then train it *again* using all of your data. If it was the best with 80% of the data it's probably even better with 100% of the data! Kind of like how you like to have homework answer keys after you turn the homework in.

Now that we've split our tweets into training and testing tweets, we can use our training data to teach our model what positive and negative tweets look like. **Add training for random forest, linear SVC, and Naive Bayes models.**

Later we'll see how accurate it is when looking at the other 20% of the tweets.

In [48]:
print("Training logistic regression")
logreg.fit(X_train, y_train)

print("Training random forest")
forest.fit(X_train, y_train)

print("Training SVC")
svc.fit(X_train, y_train)

print("Training Naive Bayes")
bayes.fit(X_train, y_train)

Training logistic regression


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training random forest
Training SVC
Training Naive Bayes


MultinomialNB()

### Confusion matrices

To see how well each model performs on the test dataset, we'll use a ["confusion matrix"](https://en.wikipedia.org/wiki/Confusion_matrix) for each one. I think confusion matrices are called that because they are confusing.

**We'll talk about them a lot more in class because they're my favorite thing on the entire planet.**

In [49]:
from sklearn.metrics import confusion_matrix

#### Logistic Regression confusion matrix

The basic idea of a confusion matrix is it **compares the actual values to the predicted values for each tweet.** It's just like how a teacher would compare the answers on your quiz to the answer key.

If the language model predicts the same as the actual answer, great! But instead of just giving you the percent you got correct, the benefit of a confusion matrix is that **it also tells you which types of questions you got wrong.** 

For example, we can know if we always accidentally predict negative tweets as positive ones. That's more useful than just knowing we got 75% correct!

In [50]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,234,147
Is positive,158,211


In [51]:
# Yes, we can also be lazy and ask for just the score
logreg.score(X_test, y_test)

0.5933333333333334

#### Random forest

In [53]:
# YOUR CODE HERE
# Add a confusion matrix for the random forest
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,255,126
Is positive,114,255


In [52]:
# YOUR CODE HERE
# Find the overall score for the random forest
forest.score(X_test, y_test)

0.68

#### SVC

In [54]:
# YOUR CODE HERE
# Add a confusion matrix for the linear SVC
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,254,127
Is positive,120,249


In [56]:
# YOUR CODE HERE
# Find the overall score for the linear SVC
svc.score(X_test, y_test)

0.6706666666666666

#### Multinomial Naive Bayes

In [55]:
# YOUR CODE HERE
# Add a confusion matrix for the naive bayes
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,266,115
Is positive,117,252


In [57]:
# YOUR CODE HERE
# Find the overall score for the naive bayes
bayes.score(X_test, y_test)

0.6906666666666667

### Percentage-based confusion matrices

Sometimes it's kind of irritating that they're just raw numbers. With a little crazy code, we can calculate them as percentages instead.

#### Logisitic regression

In [58]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.614173,0.385827
Is positive,0.428184,0.571816


Out of all of the negative tweets, what percent did we accurately predict?

<i>61%</i>

Did we do better predicting negative tweets or positive tweets?

<i>It's better at predicting negative tweets.</i>

#### Random forest

In [59]:
# YOUR CODE HERE
# Calculate a percentage-based confusion matrix for the random forest
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.669291,0.330709
Is positive,0.308943,0.691057


How does the random forest compare to the logistic regression?

<i>It's better at predicting negative tweets and much better at predicting positive tweets.</i>

#### Linear SVC

In [60]:
# YOUR CODE HERE
# Calculate a percentage-based confusion matrix for linear SVC
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.666667,0.333333
Is positive,0.325203,0.674797


The linear SVC doesn't do as well as the random forest, but it does have one benefit. **Can you remember what it was?** We discovered it even before we used our models!

<i>It's better at predicting negative tweets and positive tweets than Logistic Regression but Forest still has the edge.</i>

#### Multinomial Naive Bayes

In [62]:
# YOUR CODE HERE
# Calculate a percentage for naive bayes
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Unnamed: 0,Predicted negative,Predicted positive
Is negative,0.698163,0.301837
Is positive,0.317073,0.682927


## Review

If you find yourself unsatisfied with a tool, you can try to build your own! This is exactly what we tried to do, using the **Sentiment140 dataset** and several machine learning algorithms.

Sentiment140 is a database of tweets that come pre-labeled with positive or negative sentiment, assigned automatically by presence of a `:)` or `:(`.  Our first step was using a **vectorizer** to convert the tweets into numbers a computer could understand.

After that, we built four different **language models** using different machine learning algorithms. Each one was fed a list of each tweet's **features** - the words - and each tweet's **label** - the sentiment - in the hopes that later it could predict labels if given a new tweets. This process of teaching the algorithm is called **training**.

In order to test our algorithms, we split our data into two parts - **train** and **test** datasets. You teach the algorithm with the first group, and then ask it for predictions on the second set. You can then compare its predictions to the right answers and view the results in a **confusion matrix**.

Although **different algorithms took different amounts of time to train**, they all ended up with over 70%+ accuracy.

## Discussion topics

Which models performed the best? Were there big differences?

<i>I thought Bayes was the best although it was second to Forest when we did the percentage-based confusion matrix. I don't think there were glaring differences.</i>

**Do you think it's more important to be sensitive to negativity or positivity?** Do we want more positive things incorrectly marked as negative, or more negative things marked as positive?

If your answer is "it depends," give me an example.

<i>If you want to improve your product, for example, it would be better if more positive things were incorrectly marked as negative. You want to know what the negatives are.</i>

**Our models all had very different training times.** Which model(s) do you think offer the best combination of performance and not making you wait around for an hour?

<i>Bayes was the fastest and most accurate of them all.</i>

In the Gebru paper, "language model size" was discussed frequently. Google, Facebook, Microsoft and others are all trying to build larger and larger models in the hopes that they do a better job representing language.

**What are two ways we could increase our model size?**

<i>Train it using more data and from various sources.</i>

If you're feeling like having a wild time, **experiment with how increasing your model size affects training time and accuracy.** You'll just need to change a few numbers and run all of the cells again.

Is 75% accuracy good?

<i>For non-journalism purposes, yes. But if this is being used in journalism, you need it to be more than 95% accurate.</i>

Do your feelings change if the performance is described as "incorrect one out of every four times?"

<i>Yes, but then again I'm not the biggest fan of this anyway!</i>

If you randomly guessed positive or negative for each tweet, what would (roughly) your performance be?

<i>80 to 90%.</i>

**How do you feel about sentiment analysis?** Did this and/or the previous notebook make you feel any differently about it?

<i>I never knew anything about it before and now I'm higly sceptical of it.</i>

What would you feel comfortable using our sentiment classifier for?

<i>For something low-stakes.</i>