# Seminar: Applied Deep Learning for NLP

In this notebook, we will use GPT transformers to train text generation and text classification models based on the Yelp review datasets, which will be filtered after restaurants. The text generation models are based on different categories of the restaurants.

We will use the trained models to generate new reviews and classify new reviews. The newly generated reviews will be stored and be read by our alexa skill, and the classification model will be used by the alexa skill on the fly.

In [None]:
import tensorflow as tf
import os
import requests
import numpy as np
import json
import pandas as pd
import random

In [None]:
!pip install transformers==3.5.1

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Data Preprocessing

Download and unzip the dataset from yelp (the download link needs to be changed everytime).

In [None]:
# Replace with your own access link
# After downloaded, change the filename to 'yelp_dataset.tgz' (omit quotes)
!wget 'https://yelp-dataset.s3.amazonaws.com/YDC14/yelp_dataset.tgz?Signature=VnGzZgV1ODNutsZMrTYtE8baPkU%3D&Expires=1610713473&AWSAccessKeyId=AKIAJ3CYHOIAD6T2PGKA'
# Unzip the file
!gunzip -c yelp_dataset.tgz | tar xvf -

In [None]:
df_business = pd.read_json('yelp_academic_dataset_business.json', lines=True)
df_business.fillna('NullValue')

In [None]:
# Filter only businesses related to restaurants and with exisiting reviews
df_filter = df_business[(df_business['categories'].str.contains('Restaurants', na=False))]
# Change the value if you only want restaurants have review counts greater than a threshold
df = df_filter[df_filter['review_count'] > 0]
# Save a smaller file with filtered restaurants
df.to_json(r'business_small.json', orient='records', lines=True)

Next, we generate json files for all reviews and tips which are for the selected business, where tips are just shorter reviews (for review generation).

In [None]:
output = open('review_small', 'w')
#df = pd.read_json('business_small.json', lines=True)
id = list(df['business_id'])

with open('review.json') as f:
    for line in f:
        content = json.loads(line)
        bid = content['business_id']
        if bid in id:
          output.write(line)

In [None]:
output1 = open('tip_small.json', 'w')

with open('tip.json') as f:
    for line in f:
        content = json.loads(line)
        bid = content['business_id']
        if bid in id:
          output1.write(line)

Generate json file for 100000 positive and 100000 negative reviews with a star "1" as positive and "0" as negative (for review classification).

In [None]:
output2 = open('review_starsv2.json', 'w')

positive = 0
negative = 0

with open('review.json') as f:
  for line in f:
    data = {}
    content = json.loads(line)
    bid = content['business_id']
    if bid in id:
      data['text'] = content['text']
      star = content['stars']
      length = len(data['text'].split(' '))
      if length > 20:
        if float(star) >= 3.0 and positive < 100000:
          data['stars'] = '1'
          positive += 1
          output2.write(json.dumps(data) + '\r\n')
        elif float(star) < 3.0 and negative < 100000:
          data['stars'] = '0'
          negative += 1
          output2.write(json.dumps(data) + '\r\n')
        elif negative == 100000 and positive == 100000:
          break
        else:
          continue

In [None]:
df_b = pd.read_json('business_small.json', lines=True)
df_r = pd.read_json('review_small.json', lines=True)
df_t = pd.read_json('tip_small.json', lines=True)

In [None]:
df_b = df_b[['business_id', 'name', 'categories', 'hours']]
df_r = df_r[['business_id', 'text']]
df_t = df_t[['business_id', 'text']]
temp = df_r.append(df_t)

In [None]:
# Merge business and review
final = pd.merge(df_b, temp, on='business_id')
final.to_json(r'final.json', orient='records', lines=True)

In [None]:
# Download the whold file if necessary
# from google.colab import files
# files.download('final.json') 
# files.download('review_starsv2.json')

# Review Generation

In [None]:
# Change this to where the final json from above is
df = pd.read_json('final.json', lines = True)

In [None]:
df.head()

Unnamed: 0,business_id,name,categories,hours,text
0,7sb2FYLS2sejZKxRYF9mtg,Sakana,"Restaurants, Sushi Bars, Buffets, Japanese, Ba...","{'Monday': '11:30-0:0', 'Tuesday': '11:30-0:0'...",Yesterday was my first time at Sakana and I th...
1,7sb2FYLS2sejZKxRYF9mtg,Sakana,"Restaurants, Sushi Bars, Buffets, Japanese, Ba...","{'Monday': '11:30-0:0', 'Tuesday': '11:30-0:0'...","I really liked Sakana when it first opened, I ..."
2,7sb2FYLS2sejZKxRYF9mtg,Sakana,"Restaurants, Sushi Bars, Buffets, Japanese, Ba...","{'Monday': '11:30-0:0', 'Tuesday': '11:30-0:0'...",The best sushi bar in the United states the se...
3,7sb2FYLS2sejZKxRYF9mtg,Sakana,"Restaurants, Sushi Bars, Buffets, Japanese, Ba...","{'Monday': '11:30-0:0', 'Tuesday': '11:30-0:0'...",This location is located right behind my offic...
4,7sb2FYLS2sejZKxRYF9mtg,Sakana,"Restaurants, Sushi Bars, Buffets, Japanese, Ba...","{'Monday': '11:30-0:0', 'Tuesday': '11:30-0:0'...",What a great little place! \n\nThey offer a 2-...


In [None]:
# Example to parse a single category to find out names of restaurants, change to other types from below if needed
category = df[df['categories'].str.contains('Chinese|Korean|Asian Fusion|Thai', na=False)]
res_name = category['name'].drop_duplicates()
res_name.sample(10)

254825        Soho Japanese Restaurant
28951         Gangnam Asian BBQ Dining
48619     The Cowfish Sushi Burger Bar
271033                    Pho Kim Long
18618                     District One
64582                    Chubby Cattle
503705                   Lotus of Siam
131561              Hakkasan Nightclub
109302                        SumoMaya
410408                   Chino Bandido
Name: name, dtype: object

We took a look into the categories of all restaurants and decided to split the restaurants into 10 more compact categories, which have proper review counts:
1.	Burger
2.	Pizza|Italian
3.	Vegetarian|Vegan|Salad
4.	Chinese|Korean|Asian Fusion|Thai
5.	Sushi|Japanese|Ramen
6.	French|German|British|Fish & Chips 
7. 	Seafood
8. 	Mexican|Latin American
9. 	Steakhouse
10. 	Others


In [None]:
category = df[df['categories'].str.contains('Chinese|Korean|Asian Fusion|Thai', na=False)]
category_list = []
for i in range(len(category)):
  text = category['text'].iloc[i]
  text = text.replace('\n', '')
  category_list.append(text)

In [None]:
# Save reviews for one food type
import codecs
output = codecs.open('category.txt', 'w')
for i in category_list:
  output.write(str(i)+'\r\n')

In [None]:
# Load pre-trained model
from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("gpt2")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




In [None]:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




In [None]:
from transformers import TextDataset
dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="category.txt",
    block_size=128,
)



In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

In [None]:
# Training using trainer
from transformers import Trainer, TrainingArguments
train_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    weight_decay=0.01,
    do_predict=True,
    warmup_steps=1000,
    save_steps=1000,
    save_total_limit=5,
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=train_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

In [None]:
%%time
trainer.train()

Step,Training Loss
500,3.822509
1000,3.645658
1500,3.57563
2000,3.502323
2500,3.472054
3000,3.447516
3500,3.420215
4000,3.395496
4500,3.381283
5000,3.363621


CPU times: user 1h 27min 54s, sys: 54min 47s, total: 2h 22min 41s
Wall time: 2h 23min 36s


TrainOutput(global_step=23721, training_loss=3.263541534926858)

In [None]:
prompt_text = "The salad is"
encoded_prompt = tokenizer.encode(prompt_text,
                                  add_special_tokens=False,
                                  return_tensors="pt")

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GPT2LMHeadModel.from_pretrained('drive/MyDrive/nlp_models/text_generation_model_gpt_veggie')

In [None]:
generated_views = []

In [None]:
# Feel free to tune the params here
greedy_output = model.to(device).generate(input_ids=encoded_prompt.to(device), 
                                          max_length=100,
                                          temperature=1.0,
                                          top_k=0,
                                          top_p=0.9,
                                          repetition_penalty=10.0,
                                          do_sample=True,
                                          num_return_sequences=5)

for sequence in greedy_output:
    text = tokenizer.decode(sequence, clean_up_tokenization_spaces=True)
    print(text)
    print("-" * 100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The salad is good and the staff does not seem to have really bad service.  We ordered street tacos as a side order, which were pretty tasty!
Everything was made so well in front of us that I know they make excellent food at some large delis. For example Jose had no problem with their Al Pastor burrito when we asked about it for an appetizer or small entree on its own though since he's amazing but wanted something super filling vs plain chicken instead because both are boring
----------------------------------------------------------------------------------------------------
The salad is yummy.  You can add the dressing, but it's super rich on top and comes with potato purée....probably more than enough tomato for that creamy sauce to make a difference! The creme brûlé glazed donut bread pudding was so good I wasn't quite stuffed after dinner because my stomach had been artificially overloaded just thinking about eating here (I'm always full when in Vegas!)
This review has got five star

In [None]:
# Save the pretrained model and load it in above code
model.save_pretrained('drive/MyDrive/nlp_models/text_generation_model_gpt_chinese')

# Review Classification


In [None]:
review_labeled = pd.read_json('review_starsv2.json', lines = True)

In [None]:
review_labeled.count()

text     199998
stars    199998
dtype: int64

In [None]:
# Set seed for same permutation
np.random.seed(100)
random.seed(100)

review_labeled = review_labeled.iloc[np.random.permutation(len(review_labeled))]

texts, labels = review_labeled['text'].tolist(), review_labeled['stars'].tolist()
num_samples = len(texts)
training_samples = int(0.85 * num_samples)
validation_samples = int(0.05 * num_samples)
test_samples = int(0.10 * num_samples)

train_texts = texts[:training_samples]
train_labels = labels[:training_samples]
validation_texts = texts[training_samples: training_samples + validation_samples]
validation_labels = labels[training_samples: training_samples + validation_samples]
test_texts = texts[training_samples + validation_samples:]
test_labels = labels[training_samples + validation_samples:]

In [None]:
import transformers
from transformers import DistilBertTokenizerFast, TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=363423424.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_projector', 'activation_13', 'vocab_layer_norm', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'dropout_19', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [None]:
max_len = 200

train_encodings = tokenizer(train_texts, truncation=True, max_length=max_len, padding=True)
validation_encodings = tokenizer(validation_texts, truncation=True, max_length=max_len, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, max_length=max_len, padding=True)

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), train_labels))
validation_dataset = tf.data.Dataset.from_tensor_slices((dict(validation_encodings), validation_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings), test_labels))

In [None]:
model = TFDistilBertForSequenceClassification.from_pretrained('drive/MyDrive/model/text_classification_modelv3')
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

Some layers from the model checkpoint at drive/MyDrive/model/text_classification_modelv2 were not used when initializing TFDistilBertForSequenceClassification: ['dropout_79']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at drive/MyDrive/model/text_classification_modelv2 and are newly initialized: ['dropout_99']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
%%time
model.fit(train_dataset.shuffle(1000).batch(64), 
          epochs=1,
          validation_data = validation_dataset.batch(64))

CPU times: user 7.78 s, sys: 216 ms, total: 8 s
Wall time: 10.3 s


<tensorflow.python.keras.callbacks.History at 0x7fed88342c88>

In [None]:
model.evaluate(test_dataset.batch(64))



[1.1879534721374512, 0.5400000214576721]

In [None]:
model.save_pretrained('drive/MyDrive/model/text_classification_modelv3')

In [None]:
print(test_samples)

19999


In [None]:
text = ['This restaurant is really ok. I do not like it much']
encodings = tokenizer(text, truncation=True, max_length=max_len, padding=True)
tfdataset = tf.data.Dataset.from_tensor_slices(dict(encodings))
preds = model.predict(tfdataset.batch(1))
# preds = tf.keras.activations.softmax(tf.convert_to_tensor(list(preds.values())))
preds = tf.keras.activations.softmax(tf.convert_to_tensor(preds)).numpy()
print(preds)
# First score is negative, the second one is positive

[[[0.8097635  0.19023655]]]


In [None]:
model.save_pretrained('drive/MyDrive/model/text_classification_modelv2')

# NLPRule Grammar Checking

In [None]:
pip install nlprule

Collecting nlprule
[?25l  Downloading https://files.pythonhosted.org/packages/89/71/958b64ed704390e25c75f2c5de9d87be1d18402b9a5841cae4a763a0cb62/nlprule-0.3.0-cp36-cp36m-manylinux1_x86_64.whl (2.8MB)
[K     |▏                               | 10kB 27.2MB/s eta 0:00:01[K     |▎                               | 20kB 33.3MB/s eta 0:00:01[K     |▍                               | 30kB 21.3MB/s eta 0:00:01[K     |▌                               | 40kB 18.3MB/s eta 0:00:01[K     |▋                               | 51kB 14.4MB/s eta 0:00:01[K     |▊                               | 61kB 14.0MB/s eta 0:00:01[K     |▉                               | 71kB 15.3MB/s eta 0:00:01[K     |█                               | 81kB 14.8MB/s eta 0:00:01[K     |█                               | 92kB 15.2MB/s eta 0:00:01[K     |█▏                              | 102kB 15.3MB/s eta 0:00:01[K     |█▎                              | 112kB 15.3MB/s eta 0:00:01[K     |█▍                          

In [None]:
from nlprule import Tokenizer, Rules, SplitOn

tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer, SplitOn([".", "?", "!"]))

In [None]:
# Copy the generated texts to here, the second one is output from NLPRule
orig = "The asian food here is best in Las Vegas! I love to eat it at least once a month. It's cozy and tasty so don't try that sitting all day or spend too much on the food...I like their mix of Asian, Brazilian & Peruvian cuisine with an English twist lolThey also have seasonal taro poki which was fantastic!!!!"
print(orig + '\r\n')
rules.correct(orig)

The asian food here is best in Las Vegas! I love to eat it at least once a month. It's cozy and tasty so don't try that sitting all day or spend too much on the food...I like their mix of Asian, Brazilian & Peruvian cuisine with an English twist lolThey also have seasonal taro poki which was fantastic!!!!



"The asian food here is best in Las Vegas! I love to eat it at least once a month. It's cozy and tasty so don't try that sitting all day or spend too much on the food...I like their mix of Asian, Brazilian & Peruvian cuisine with an English twist lolThey also have seasonal taro poki which was fantastic!!!!"