# Train A Sentiment Classifier

The Yelp dataset is generated from the [Yelp academic download](https://www.yelp.com/dataset/download). The lesson is derived from this example in [Textblob's documentation](https://textblob.readthedocs.io/en/dev/classifiers.html#classifiers).

In [1]:
import pandas as pd
from textblob.classifiers import NaiveBayesClassifier

In [2]:
# import nltk
# nltk.download('punkt_tab') # we need to do this once for the tokenizer

In [3]:
reviews = pd.read_csv('small_yelp_reviews.csv')

In [4]:
reviews.head()

Unnamed: 0,review_id,business_id,text,date
0,X6EkKbJpOgcwO7zBHVtLrw,XiATUgtzkuxn1IoOwFy1Wg,5 stars every time. Only reason you wouldn't l...,2022-01-08 19:52:50
1,-2yNYo2eVEqdeV5dELFPLQ,hA8c2kWI_8DXZE1jyVxlBQ,GREAT staff and GREAT food!! My boyfriend had ...,2022-01-02 00:48:22
2,LIRhXeeHR2RSFB2SRJjIPg,mRKxPMk9jHgCj07MwAMUzw,Got a chicken bowl. The flavor wasn't bad but ...,2022-01-16 19:48:40
3,MAbABE4lnYuKdPWvDUCUOA,fgtnOag-DaTsZTHPsgnWSQ,Pick out this place by Yelp. Food was just so-...,2022-01-17 18:04:06
4,FOr1Q4dZByod684CHRRphA,vHr8qhM4CXYB3Ol_9yS6QQ,"Sushi pier has been on our list for a while, s...",2022-01-06 20:00:34


## We Need To Annotate Our Data

We need to assign a positive or negative sentiment to each review in order to train our classifier. You're each going to annotate 24 reviews as positive or negative. This will give us a dataset of 192 reviews, which we'll split into test and train. (80% train, 20% test) And then we'll pass the final 8 unseen reviews into the classifier to see how we did!

In order to annotate the data you'll add a column, "sentiment" to the CSV for your assigned rows.

The possible annotations for the column are: `pos` or `neg`.

In [5]:
# Saving this for combining our CSVs!
train = [
    ("I love this sandwich.", "pos"),
    ("this is an amazing place!", "pos"),
    ("I feel very good about these beers.", "pos"),
    ("this is my best work.", "pos"),
    ("what an awesome view", "pos"),
    ("I do not like this restaurant", "neg"),
    ("I am tired of this stuff.", "neg"),
    ("I can't deal with this", "neg"),
    ("he is my sworn enemy!", "neg"),
    ("my boss is horrible.", "neg"),
]
test = [
    ("the beer was good.", "pos"),
    ("I do not enjoy my job", "neg"),
    ("I ain't feeling dandy today.", "neg"),
    ("I feel amazing!", "pos"),
    ("Gary is a friend of mine.", "pos"),
    ("I can't believe I'm doing this.", "neg"),
]

## Train Our Classifier

In [6]:
cl = NaiveBayesClassifier(train)

Let's see what's driving the model:

In [7]:
cl.show_informative_features(5)

Most Informative Features
            contains(my) = True              neg : pos    =      1.7 : 1.0
            contains(an) = False             neg : pos    =      1.6 : 1.0
             contains(I) = False             pos : neg    =      1.4 : 1.0
             contains(I) = True              neg : pos    =      1.4 : 1.0
            contains(my) = False             pos : neg    =      1.3 : 1.0


## Remember Our Accuracy Metric?

In [8]:
cl.accuracy(test)

0.8333333333333334

## Let's Try Something Completely New!

In [9]:
cl.classify("This is an amazing library!")

'pos'

In [10]:
prob_dist = cl.prob_classify("This one's a doozy.") # this also shows us the parts

In [11]:
prob_dist.prob('pos'), prob_dist.prob('neg')

(0.6311475409836058, 0.3688524590163945)

---

---

---