In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from fastai.tabular import *
from fastai.text import *

%reload_ext autoreload
%autoreload 2
%matplotlib inline

# Data Science Task
1. Build a ML model which can predict the `sentiment`
2. Apply the model to a single restuarant, to reveal key aspects which drive overall customer `perception`
3. Anything else you think might be interesting!

In [15]:
path = untar_data(URLs.YELP_REVIEWS)
df_train = pd.read_csv(path/'train.csv', header=None, names=['rating', 'text']).sample(frac=0.01, random_state=1)
df_test = pd.read_csv(path/'test.csv', header=None, names=['rating', 'text']).sample(frac=0.01, random_state=1)

Taking small sample set for fast development

In [16]:
print(df_train.shape, df_test.shape)
df_train.head()

(6500, 2) (500, 2)


Unnamed: 0,rating,text
21194,1,Thank you for all the emails you sent me on my...
373117,4,"I, myself, and vietnamese and have tried all d..."
470627,3,"3.5 stars.\n\nI have had BFG on my list of \""m..."
256672,3,Edamame with truffle salt was great; really li...
465495,4,So I'm running errands before I go back to th...


In [17]:
# First review
df_train['text'][21194][:500]

'Thank you for all the emails you sent me on my review! I was surprised at how many responses I recieved from people searching for the right dentist..\\nI shared my new dentist information and even got some movie tickets from my dentist for the referrals!\\nI find it funny how since I wrote this review how many people have reviewed with 5 stars... They must have a lot of friends and family! \\nI hope everyone reads my review and picks the right dentist for your needs!\\nHappy Holidays'

In [18]:
# Flip columns for databunch processor
df_train = df_train[['text', 'rating']].set_index('rating')
df_test = df_test[['text', 'rating']].set_index('rating')
df_train.head()

Unnamed: 0_level_0,text
rating,Unnamed: 1_level_1
1,Thank you for all the emails you sent me on my...
4,"I, myself, and vietnamese and have tried all d..."
3,"3.5 stars.\n\nI have had BFG on my list of \""m..."
3,Edamame with truffle salt was great; really li...
4,So I'm running errands before I go back to th...


In [19]:
df_train.to_csv(path / 'train_df_sample.csv')
df_test.to_csv(path / 'test_df_sample.csv')

### Tokenization
The first step of processing we make the texts go through is to split the raw sentences into words, or `tokens`.

In [20]:
data = TextClasDataBunch.from_csv(path, 'train_df_sample.csv')
data.show_batch()

text,target
"xxbos xxmaj dear xxmaj pastavino , xxmaj chef xxmaj marc , and staff : \n xxmaj why did you screw me ? i walked out of your restaurant feeling dirty , ashamed , and ultimately screwed . \n 5-star reviews left and right , family members telling me it is impeccable , xxunk - xxunk of a chef painting a picture of a truly delicious xxmaj italian meal",2
"xxbos 3 or 4 ... 3 or 4 .. 3 or 4 .. xxup xxunk ! xxmaj darn this thing for not having xxunk . i knew this would be tough the minute i left this place . \n \n i find myself in food xxunk very often . i know that if i want xxmaj mexican food i can go to "" insert location here "" and leave",3
"xxbos xxmaj this place deserves a 2-star for the following reasons : xxup rude , xxup unprofessional and xxup mediocre drinks . xxmaj most of the time we would 've taken our money else where but hey , we drove all the way here , so we might as well just get our drinks and because the place itself might be worth it . \n \n xxmaj the place",2
"xxbos xxmaj let me preface the following review by saying that if i did n't absolutely have a terrible experience , i would n't have said anything . xxmaj this review is also concerning their window xxup only . i really wanted to go to their restaurant the next time we go to xxmaj new xxmaj york , but if this window is a representation of xxmaj serendipity 3 's",1
"xxbos xxmaj terrible . xxmaj food and service . \n \n i have been to just about every strip steak place and this is hands down my least favorite . xxmaj and not just by a little bit , by a rather large margin . i love steak and i love xxmaj french so this should have been a match made in heaven , perhaps how much i looked",1


* The "'s" are grouped together in one token
* The contractions are separated like this: "did", "n't"
* Content has been cleaned for any HTML symbol and lower cased
* There are several special tokens (all those that begin by xx), to replace unknown tokens (see below) or to introduce different text fields (here we only have one).

### Numericalization into `vocab`

* Creating unique tokens for all the words
* Top 60,000 used - unknown token `xxunk` used for remainders

In [21]:
# Top 10 words (unknown, space, end of line.. etc)
data.vocab.itos[:10]

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 '.']

In [22]:
# Example tokenised review
data.train_ds[0][0]

Text xxbos xxmaj the bad- xxmaj well for the price i expected more . xxmaj do n't get me wrong the food was good but boy was it expensive . i do nt usually eat at high end steakhouses because i can cook a steak at home . xxmaj we were out for a night on the town so why not splurge a little . i think our server was not as experienced as the others . i heard other servers xxunk the menu as well as mentioning things that were not on the menu . xxmaj ours did not do that . xxmaj the place smelled like sewage in the main area . xxmaj luckily our section was fine . xxmaj the air conditioning was awfully cold . xxmaj we spent just under $ 200 once tip was added for my wife and xxup i. xxmaj and that was without drinks . 
  xxmaj the good- 32 oz was cooked and seasoned just the way i like it . xxmaj its nice having a steak being served on an extremely hot plate . xxmaj mashed sweet potato were really tasty . xxmaj and the asparagus was perfectly cooked and still crunchy . xxmaj the chocolate m

`xxbos` is the token for beginning of sentence

In [23]:
# Replacing tokens with their index position (as model requires)
data.train_ds[0][0].data[:10]

array([   2,    5,   10, 7629,    5,  112,   20,   10,  204,   13], dtype=int64)

### Language Model
Model pretrained on a cleaned subset of wikipedia called wikitext-103.
That model has been trained to guess what the next word is, its input being all the previous words.

In [27]:
# Decrease batchsize if GPU can't handle the load

bs = 24       # range 12 - 48

In [28]:
path = untar_data(URLs.YELP_REVIEWS)
path.ls()

[WindowsPath('C:/Users/luked/.fastai/data/yelp_review_full_csv/data_lm.pkl'),
 WindowsPath('C:/Users/luked/.fastai/data/yelp_review_full_csv/data_save.pkl'),
 WindowsPath('C:/Users/luked/.fastai/data/yelp_review_full_csv/readme.txt'),
 WindowsPath('C:/Users/luked/.fastai/data/yelp_review_full_csv/test'),
 WindowsPath('C:/Users/luked/.fastai/data/yelp_review_full_csv/test.csv'),
 WindowsPath('C:/Users/luked/.fastai/data/yelp_review_full_csv/test_df_sample.csv'),
 WindowsPath('C:/Users/luked/.fastai/data/yelp_review_full_csv/train'),
 WindowsPath('C:/Users/luked/.fastai/data/yelp_review_full_csv/train.csv'),
 WindowsPath('C:/Users/luked/.fastai/data/yelp_review_full_csv/train_df_sample.csv')]

### Create databunch and process with vocab (above)

In [38]:
data_lm = TextLMDataBunch.from_csv(path, 'train_df_sample.csv')
data_clas = TextClasDataBunch.from_csv(path, 'train_df_sample.csv', vocab=data_lm.train_ds.vocab)

In [39]:
data_lm.show_batch()

idx,text
0,"trips to the xxmaj strip in the last couple years . xxmaj we 'd read up on reviews beforehand so we knew that this was more a xxmaj broadway style show than a straight xxmaj cirque show . xxmaj so we did n't enter the theater with any xxunk . \n \n xxmaj earlier in the day as we purchased our tickets at the xxmaj box xxmaj office ,"
1,"something to the texture of the meat . \n \n xxmaj there is serious xxunk in downtown xxmaj phoenix now . xxmaj this place is not ready to succeed here . xxbos xxmaj this is n't a movie theater that i frequent regularly since i live in xxup xxunk , but the "" xxunk "" since xxmaj rave owned the place seems favorable . xxmaj they 've added extended"
2,". xxmaj but - once i am on the road , i see something that makes me more than a little disappointed . i had mentioned the rain / mudstorm from a few weeks ago ? xxmaj well , i had left my windows xxunk in the office parking lot that night - and had plenty on mud drops on both front doors - the interior xxunk and armrests ."
3,"chef . xxmaj he xxunk and tells us ( ! ) that the rolls are actually correct : that salmon colored roll is indeed tuna , and the tuna colored roll is , in fact , salmon . xxmaj when confronted with the issue of salmon colored xxunk ( not to mention half of xxmaj craig xxmaj xxunk 's xxunk ) being named after the wrong xxunk fish , the"
4,", i figured she was hand xxunk them in the back . i xxunk walking over to the bar and xxunk myself but i xxunk . \n \n xxmaj all in all , if i 'm ever in town again i most likely wo nt come here . xxmaj it was a total let down . \n * sad face * xxbos xxmaj been there several times ."


### Transfer learning Model (pre-trained `AWD_LSTM` network on `WikiText103`)

In [40]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

In [42]:
# Learning rate finder
learn.lr_find()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.


In [None]:
learn.recorder.plot(skip_end=15)

In [None]:
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

In [None]:
# Now unfreeze embedding layer, to fine tune to new `yelp corpus`

In [None]:
learn.unfreeze()
# May take a while to run
learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))

* Now, the encoder is fine tuned to `Yelp Reviews`
* The encoder can be used to predict the next word in a sentence
* The next step is to remove the final layers of the encoder, and replace them with a classification/regression model