<a href="https://colab.research.google.com/github/madhugopinathan/deep-nlu/blob/master/yelp_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT (Bidirectional Encoder Representations from Transformers)

In [0]:
%config InlineBackend.figure_format = 'retina'

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [0]:
pd.options.display.max_colwidth=-1


In [0]:
import spacy

In [2]:
!pip install pytorch-transformers

Collecting pytorch-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/89/ad0d6bb932d0a51793eaabcf1617a36ff530dc9ab9e38f765a35dc293306/pytorch_transformers-1.1.0-py3-none-any.whl (158kB)
[K     |████████████████████████████████| 163kB 4.8MB/s 
Collecting sentencepiece (from pytorch-transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/14/3d/efb655a670b98f62ec32d66954e1109f403db4d937c50d779a75b9763a29/sentencepiece-0.1.83-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 43.8MB/s 
Collecting regex (from pytorch-transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/6f/a6/99eeb5904ab763db87af4bd71d9b1dfdd9792681240657a4c0a599c10a81/regex-2019.08.19.tar.gz (654kB)
[K     |████████████████████████████████| 655kB 40.6MB/s 
Building wheels for collected packages: regex
  Building wheel for regex (setup.py) ... [?25l[?25hdone
  Created wheel for regex: filename=regex-2019.8.19-cp36-cp36m

In [0]:
import torch
from pytorch_transformers import BertModel, BertTokenizer

In [4]:
pretrained_weights = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(pretrained_weights)

100%|██████████| 231508/231508 [00:00<00:00, 1233609.73B/s]


In [6]:
!wget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_full_csv.tgz


--2019-08-22 05:08:15--  https://s3.amazonaws.com/fast-ai-nlp/yelp_review_full_csv.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.164.229
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.164.229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 196146755 (187M) [application/x-tar]
Saving to: ‘yelp_review_full_csv.tgz’


2019-08-22 05:08:20 (40.0 MB/s) - ‘yelp_review_full_csv.tgz’ saved [196146755/196146755]



In [7]:
!tar xvfz yelp_review_full_csv.tgz


yelp_review_full_csv/
yelp_review_full_csv/train.csv
yelp_review_full_csv/readme.txt
yelp_review_full_csv/test.csv


## Load Yelp Reviews

In [0]:
DATA_DIR = "./yelp_review_full_csv/"


In [0]:
df = pd.read_csv(DATA_DIR + "train.csv", header=None, names=['rating', 'review'])


In [14]:
df[df.review.str.contains("indian")].sample(5)


Unnamed: 0,rating,review
16328,1,"Placed a take-out order of vegetable pakoras, shahi paneer, garlic naan, and papadum. Did not receive the pakoras at all, the naan was COLD (not even lukewarm), the papadum was burned, and the shahi paneer was crappy. As your lawyer, I advise you to become a patron of any other indian place in the valley but this one."
635579,3,"Simple store with basic indian groceries. Store is clean and staff are courteous. Items are pricey , minimum of 20% more than market price. I like the store except for the prices"
33289,2,"I've eaten here several times and I think the food tastes very good. The buffet is the only way to go here because the meals are overpriced and portions are small, Now the service is horrible, inattentive at best , I don't feel welcomed and the servers I have had were flippant and the last time argued with us regarding the tea, we were served tea that was old but he insisted it was fine. Kinda aggravating. So if you are just looking for some good indian buffet, with out service not a bad place."
85181,4,"Took some friends there to thank them for helping me move. We needed good vegetarian food and good beer within walking distance of my place, and it fit the bill perfectly. \nI thought the chef's choice appetizers were delicious and a great deal, as well as the beer selection and prices. The entrees were tasty too, but I've had much better indian food.\nOur waiter was patient and cheerful, and best of all, we ate al fresco, and the weather was so gorgeous, a man at the table next to us in his magnanimity, treated us to almond joy ice cream. mmm."
361192,2,The place was just ill maintained and dirty. All the tables had food remnants. The food was really bad - the snacks were soaked in oil and served in paper plates and cups. The batata wada was stale. The thali was tolerable but mostly insipid. The cherry on the pan had gone bad. The serving staff were simultaneously serving and mopping the floor. \n\nWe go often to the Rajbhog in the triangle area but this one in Charlotte just does not cut the mustard. You'll be better off at Woodlands or some other indian joint.


## BERT Tokenization

Notice the small vocabulary size: 30522!

In [33]:
tokenizer.vocab_size

30522

In [0]:

# 16328
review = """Placed a take-out order of vegetable pakoras, shahi paneer, 
            garlic naan, and papadum. Did not receive the pakoras at all, 
            the naan was COLD (not even lukewarm), the papadum was burned, 
            and the shahi paneer was crappy. As your lawyer, I advise you 
            to become a patron of any other indian place in the valley 
            but this one."""

In [0]:
nlp = spacy.load("en")

In [46]:
list(nlp(review))

[Placed, a, take, -, out, order, of, vegetable, pakoras, ,, shahi, paneer, ,, 
             , garlic, naan, ,, and, papadum, ., Did, not, receive, the, pakoras, at, all, ,, 
             , the, naan, was, COLD, (, not, even, lukewarm, ), ,, the, papadum, was, burned, ,, 
             , and, the, shahi, paneer, was, crappy, ., As, your, lawyer, ,, I, advise, you, 
             , to, become, a, patron, of, any, other, indian, place, in, the, valley, 
             , but, this, one, .]

In [47]:
len(list(nlp(review)))

77

In [49]:
len(tokenizer.tokenize(review))

89

In [50]:
tokenizer.tokenize(review)

['placed',
 'a',
 'take',
 '-',
 'out',
 'order',
 'of',
 'vegetable',
 'pak',
 '##ora',
 '##s',
 ',',
 'shah',
 '##i',
 'pan',
 '##eer',
 ',',
 'garlic',
 'na',
 '##an',
 ',',
 'and',
 'papa',
 '##du',
 '##m',
 '.',
 'did',
 'not',
 'receive',
 'the',
 'pak',
 '##ora',
 '##s',
 'at',
 'all',
 ',',
 'the',
 'na',
 '##an',
 'was',
 'cold',
 '(',
 'not',
 'even',
 'luke',
 '##war',
 '##m',
 ')',
 ',',
 'the',
 'papa',
 '##du',
 '##m',
 'was',
 'burned',
 ',',
 'and',
 'the',
 'shah',
 '##i',
 'pan',
 '##eer',
 'was',
 'crap',
 '##py',
 '.',
 'as',
 'your',
 'lawyer',
 ',',
 'i',
 'advise',
 'you',
 'to',
 'become',
 'a',
 'patron',
 'of',
 'any',
 'other',
 'indian',
 'place',
 'in',
 'the',
 'valley',
 'but',
 'this',
 'one',
 '.']

In [51]:
tokenizer.decode(tokenizer.encode('papadum'))

'papadum'

In [16]:
tokenizer.decode(tokenizer.encode(review))

'placed a take - out order of vegetable pakoras, shahi paneer, garlic naan, and papadum. did not receive the pakoras at all, the naan was cold ( not even lukewarm ), the papadum was burned, and the shahi paneer was crappy. as your lawyer, i advise you to become a patron of any other indian place in the valley but this one.'

BERT uses a sub-word tokenization algorithm called [WordPiece](https://stackoverflow.com/questions/55382596/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble) to handle rare words.
