# Train A Sentiment Classifier

The Yelp dataset is generated from the [Yelp academic download](https://www.yelp.com/dataset/download). The lesson is derived from this example in [Textblob's documentation](https://textblob.readthedocs.io/en/dev/classifiers.html#classifiers).

In [1]:
import pandas as pd
from textblob.classifiers import NaiveBayesClassifier

In [2]:
# import nltk
# nltk.download('punkt_tab') # we need to do this once for the tokenizer

In [3]:
reviews = pd.read_csv('small_yelp_reviews.csv')

In [4]:
reviews.head()

Unnamed: 0,review_id,business_id,text,date
0,X6EkKbJpOgcwO7zBHVtLrw,XiATUgtzkuxn1IoOwFy1Wg,5 stars every time. Only reason you wouldn't l...,2022-01-08 19:52:50
1,-2yNYo2eVEqdeV5dELFPLQ,hA8c2kWI_8DXZE1jyVxlBQ,GREAT staff and GREAT food!! My boyfriend had ...,2022-01-02 00:48:22
2,LIRhXeeHR2RSFB2SRJjIPg,mRKxPMk9jHgCj07MwAMUzw,Got a chicken bowl. The flavor wasn't bad but ...,2022-01-16 19:48:40
3,MAbABE4lnYuKdPWvDUCUOA,fgtnOag-DaTsZTHPsgnWSQ,Pick out this place by Yelp. Food was just so-...,2022-01-17 18:04:06
4,FOr1Q4dZByod684CHRRphA,vHr8qhM4CXYB3Ol_9yS6QQ,"Sushi pier has been on our list for a while, s...",2022-01-06 20:00:34


In [5]:
reviews.shape

(200, 4)

## We Need To Annotate Our Data

We need to assign a positive or negative sentiment to each review in order to train our classifier. You're each going to annotate 24 reviews as positive or negative. This will give us a dataset of 192 reviews, which we'll split into test and train. (80% train, 20% test) And then we'll pass the final 8 unseen reviews into the classifier to see how we did!

In order to annotate the data you'll add a column, "sentiment" to the CSV for your assigned rows.

The possible annotations for the column are: `pos` or `neg`.

In [6]:
import glob

In [28]:
annotated_data = (
    pd
    .concat( [ 
        pd
        .read_csv(
            f, 
            names=['review_id', 'business_id', 'text', 'date', 'sentiment']
        )
        .dropna(subset=['sentiment'])
        .assign(annotator = f.split("_")[-1].rstrip('.csv')) for f in glob.glob('./manually-annotated-reviews/*') 
    ] )
)

In [29]:
annotated_data['annotator'].value_counts()

annotator
NK            35
cf            26
reviews-MR    25
borja         25
LM            24
msh           23
GSB           23
g             16
Name: count, dtype: int64

In [30]:
annotated_data.shape

(197, 6)

In [31]:
annotated_data.to_csv('annoted_data.csv', index=None)

In [40]:
annotated_data = pd.read_csv('./annoted_data.csv')

In [99]:
clean_df = annotated_data.loc[lambda x: x['review_id'] != 'review_id'].copy()
clean_df['sentiment'] = clean_df['sentiment'].apply(lambda x: x.strip() if x in ['pos', 'neg'] else 'pos')

In [100]:
clean_df.head()

Unnamed: 0,review_id,business_id,text,date,sentiment,annotator
1,LaO2ZSqc6rN3AZmDXcgU2Q,FpyjR9TiaO3JyynpF-y-7g,Service and food are meh. I came here once in ...,2022-01-08 14:52:13,neg,cf
2,m8wfNY6s0YaQCF08Ad-row,3YqUe2FTCQr0pPVK8oCv6Q,"After so much talk from my sister, I had to co...",2022-01-12 2:10:19,pos,cf
3,FgepcIqW9uWMBRGX_4xCig,dECEn8-37NHSyZbq2a1nQw,Amazing gem in the rough! They have a variety ...,2022-01-15 20:47:05,pos,cf
4,31DWipZCMv4M8RAU7LIb1Q,ttDkz3SO_58bAkEp7rSsNA,Hidden is key word. Upper floor of gym buil...,2022-01-09 21:45:48,pos,cf
5,9sMReBdqs47Mf3mrA2CzZA,1VPpbFms0augW1raf8cycw,"The staff was friendly and helpful, the boat r...",2022-01-16 23:51:15,neg,cf


In [101]:
clean_df.shape

(193, 6)

In [102]:
clean_df = clean_df.sample(frac=1, random_state=42).reset_index(drop=True)

In [103]:
clean_df.head()

Unnamed: 0,review_id,business_id,text,date,sentiment,annotator
0,XUeXPZG2ZtqEeP3J936TQA,KHQXmUFiAD0FHvrMSakJBA,"Came here on a weeknight, they allowed us to c...",2022-01-09 19:03:24,pos,LM
1,HfHjM6sZqCE26Bv1OXpdrw,5MusOgK528q7uyGw0H9MVQ,Possibly the worst food I've ever had. You're ...,2022-01-12 00:51:49,neg,NK
2,AhZHHXu-f46v4zcv6G-9WA,dG-gZOWzn8iO1Rvv_fbXxA,I love coming here whether it's for food or dr...,2022-01-07 16:11:22,pos,msh
3,LL-N7hmC-HfTEbYKwHLYYA,GXFMD0Z4jEVZBCsbPf4CTQ,Everything was delish. And I'm pretty sure we ...,2022-01-14 22:00:31,pos,NK
4,4_YLMltSgfLpbOIyRn7dZQ,9Abj5AABzqdXREl1wzSudg,Visiting Peddlers Village for the 1st time. We...,1/16/22 3:22,pos,GSB


In [104]:
# Saving this for combining our CSVs!
train = [ (r['text'], r['sentiment'].strip()) for i, r in clean_df[:154][['text', 'sentiment']].iterrows() ]
test = [ (r['text'], r['sentiment'].strip()) for i, r in clean_df[154:][['text', 'sentiment']].iterrows() ]

In [105]:
train

[('Came here on a weeknight, they allowed us to choose if we wanted to sit inside or in the outside heated tent. We sat outside and it was cozy and secluded which was amazing. The crab cakes and oysters were the best out of the four things we ordered. They have a great happy hour, definitely recommend the chicken platter and mushroom flatbread. Will definitely be back when we are in Philly again!',
  'pos'),
 ("Possibly the worst food I've ever had. You're better off buying the Boston market TV dinner or literally anything else.",
  'neg'),
 ("I love coming here whether it's for food or drinks. The staff is typically friendly; last time I was there the bartender was super engaging. The aesthetic is also pleasing. I'm never waiting long.",
  'pos'),
 ("Everything was delish. And I'm pretty sure we ordered half the menu. The only negative thing I can say, and it's really just a preference.. The peach cobbler wasn't my favorite. Kind of dry.",
  'pos'),
 ("Visiting Peddlers Village for th

In [106]:
test

[("5 stars every time. Only reason you wouldn't like it is cause you don't know what's good for you. Love the while spot.",
  'pos'),
 ("Cozy Italian spot located in Davis island. One of my favorite things about Oggi is it's a full 4 course dining experience with the entree. Nothing is absolute spectacular but I've never had anything I wouldn't order again. My favorite is their bucatini.",
  'pos'),
 ("This place was amazing.  We got there at 5:30 pm, we had theater tickets at 7 pm.  There was a group of probably 25 that were sitting when we got there and a few other tables of people.  We know that everything is cooked fresh so with the giant table we thought of leaving not wanting to get to the theater late the server said you'll have time.  We ordered.  Then music starts blasting (we were under the speaker) then the owner (I think) sang some beautiful authentic Mexican songs!  She was amazing.  Insider note: ask her to sing you a song you will not be disappointed!  Food came out hot 

## Train Our Classifier

In [107]:
cl = NaiveBayesClassifier(train)

Let's see what's driving the model:

In [108]:
cl.show_informative_features(20)

Most Informative Features
           contains(bad) = True              neg : pos    =      9.8 : 1.0
          contains(know) = True              neg : pos    =      9.8 : 1.0
        contains(looked) = True              neg : pos    =      9.8 : 1.0
          contains(most) = True              neg : pos    =      9.8 : 1.0
         contains(where) = True              neg : pos    =      9.8 : 1.0
       contains(minutes) = True              neg : pos    =      9.5 : 1.0
          contains(cold) = True              neg : pos    =      8.2 : 1.0
    contains(completely) = True              neg : pos    =      8.2 : 1.0
      contains(customer) = True              neg : pos    =      8.2 : 1.0
        contains(either) = True              neg : pos    =      8.2 : 1.0
         contains(extra) = True              neg : pos    =      8.2 : 1.0
       contains(instead) = True              neg : pos    =      8.2 : 1.0
          contains(open) = True              neg : pos    =      8.2 : 1.0

## Remember Our Accuracy Metric?

In [109]:
cl.accuracy(test)

0.8717948717948718

In [110]:
len(test)

39

In [111]:
test[0]

("5 stars every time. Only reason you wouldn't like it is cause you don't know what's good for you. Love the while spot.",
 'pos')

In [112]:
cl.classify(test[0])

'pos'

In [113]:
test_df = pd.DataFrame(test, columns=['text', 'sentiment'])
test_df['classify_result'] = test_df['text'].apply(lambda x: cl.classify(x))

In [114]:
test_df.loc[lambda x: x['sentiment'] != x['classify_result']]

Unnamed: 0,text,sentiment,classify_result
7,I got here at 6:02pm on a Sunday and it was al...,neg,pos
8,"Service was great, but the food was meh. I ord...",neg,pos
18,"Context: Over 600,000 new COVID-19 cases on av...",neg,pos
23,Worst service ever! We used to enjoy going her...,neg,pos
33,"Returning customer, Been here a couple times....",neg,pos


In [124]:
test_df.loc[lambda x: x['sentiment'] == 'neg']

Unnamed: 0,text,sentiment,classify_result
3,Hate to say it but I don't understand what all...,neg,neg
7,I got here at 6:02pm on a Sunday and it was al...,neg,pos
8,"Service was great, but the food was meh. I ord...",neg,pos
13,Rating one star because there is no option for...,neg,neg
15,"I have two dogs, Isabelle and Willy. On Januar...",neg,neg
18,"Context: Over 600,000 new COVID-19 cases on av...",neg,pos
23,Worst service ever! We used to enjoy going her...,neg,pos
27,Buyer beware. I bought my 2021 Tacoma last Feb...,neg,neg
33,"Returning customer, Been here a couple times....",neg,pos
38,Do not hire these crook. They left big holes i...,neg,neg


In [115]:
test_df.loc[lambda x: x['sentiment'] != x['classify_result']]['text'].values

array(['I got here at 6:02pm on a Sunday and it was already closed and no one could help. The hours of operation should be adjusted.',
       'Service was great, but the food was meh. I ordered the Western omelette and the onion and peppers were kind of raw. The pancakes my partner ordered upset both of our stomachs (either that or it was the water). Utensils were also a little dirty. Standard diner with room for improvement. The server was wearing a face mask the entire time, and the booths are distanced/separated, which we appreciated',
       'Context: Over 600,000 new COVID-19 cases on average each day now in the U.S., and Arizona cases at their highest level ever, and hospitals overwhelmed. This natural grocers has a big sign on the door as you go in, stating (correctly) that masks are required by government mandate, and further stating that people not wearing a mask will not be permitted in the store. \nOf the 3 dozen people I saw shopping, at least a third were not masked. I exp

In [116]:
clean_df['sentiment'].apply(lambda x: x.strip()).value_counts()

sentiment
pos    136
neg     57
Name: count, dtype: int64

## Let's Try Something Completely New!

In [117]:
cl.classify("This is an amazing library!")

'pos'

In [122]:
prob_dist = cl.prob_classify("What a terrible restaurant with great service.")

In [123]:
prob_dist.prob('pos'), prob_dist.prob('neg')

(1.0, 1.9555702103835987e-16)

---

---

---