## Yelp Review Generation with Transfer Learning

Text Generation is a popular subfield for Natural Language Processing that aims to generate text based on some **seed** text. This text can be done via RNN / LSTM to deal with sequence text data. However, this task cannot be easily done on normal computer given the humongous amount of data the model has to train on. 

Fortunately, new model training technologies have saved our lives. **Transfer Learning** is a learning technique in machine learning that focuses on transferring knowledge gained from training one problem to another similar task. For instance, we can use the neural network that are trained to classify cars to classify birds by changing later part of the model structure. 

Another wonderful news is that there are some research labs dedicating to open-sourced machine learning models. In this notebook, we will utilize the **GPT-2** model trained by **OpenAI**. GPT-2, **Generative pretrained transformer 2**, is a language model released in 2019 that are trained on 8 million web pages. It is a pretrained model with 1.5 billion parameters. 

Thanks to the open-sourced model, we can download the parameters directly through **Huggingface** API and fine-tuned via yelp review data. By doing this we can make use of the pretrained parameters in GPT-2 to best suit our review data.


In [1]:
import os
import datetime
import numpy as np
import pandas as pd
import tensorflow as tf
import warnings
import string
import gc
import re
from transformers import AutoTokenizer, TextDataset, DataCollatorForLanguageModeling
from sklearn.model_selection import train_test_split

warnings.filterwarnings('ignore')

### Prepare for Text Dataset

In [2]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [3]:
def build_text_files(df, file_path):
    """
    """
    f = open(file_path, 'w')
    text = '' 
    for sent in df.text:
        summary = str(sent).strip().lower()
        summary = re.sub(r"\s", " ", summary)
        text += summary + "  "
    f.write(text)
    return text

In [4]:
def load_and_process_input_text(filepath, tokenizer):
    """
    Load and process input text for later modeling.
    """
    
    df = pd.read_csv(filepath, index_col=0)
    df['only_ascii'] =  df['text'].apply(lambda x: x.isascii())
    df = df[df.only_ascii].reset_index(drop=True)
    df = df[df['useful'] >= 30]
    df = df.sample(1000)
    
    train, test = train_test_split(df, test_size = 0.3)
    
    # build text files
    train = build_text_files(train, 'data/train_data.txt')
    test = build_text_files(test, 'data/test_data.txt')
    return train, test


filepath = 'data/review.csv'
train, test = load_and_process_input_text(filepath, tokenizer)

In [5]:
def load_transformer_dataset(train_path, test_path, tokenizer):
    """
    """
    train_dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = train_path,
        block_size = 128
    )
    
    test_dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = test_path,
        block_size = 128
    )
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer = tokenizer, mlm = False
    )
    
    return train_dataset, test_dataset, data_collator

train_path = 'data/train_data.txt'
test_path = 'data/test_data.txt'
train_dataset, test_dataset, data_collator = load_transformer_dataset(train_path, test_path, tokenizer)

In [6]:
del train, test
gc.collect()

0

### Initialize Trainer with Training Arguments and GPT-2 model

In [7]:
from transformers import Trainer, TrainingArguments, AutoModelWithLMHead

In [8]:
model = AutoModelWithLMHead.from_pretrained('gpt2')

In [9]:
train_args = TrainingArguments(
    output_dir = './gpt2',
    overwrite_output_dir = True,
    num_train_epochs = 3,
    per_device_train_batch_size = 32, # batch size for training
    per_device_eval_batch_size = 64, # batch size for evaluation
    eval_steps = 400,
    save_steps = 800, # after # of steps the model is saved
    warmup_steps = 500,
    prediction_loss_only = True
)

trainer = Trainer(
    model = model,
    args = train_args,
    data_collator = data_collator,
    train_dataset = train_dataset,
    eval_dataset = test_dataset
)

### Train and save model

In [10]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=126, training_loss=4.071806408110119, metrics={'train_runtime': 10013.24, 'train_samples_per_second': 0.013, 'total_flos': 383903776309248, 'epoch': 3.0})

In [23]:
trainer.save_model()

### Test the model

In [11]:
from transformers import pipeline

In [24]:
generate_review = pipeline('text-generation', model = './gpt2', tokenizer = 'gpt2', config = {'max_length': 100})

In [28]:
generate_review('This is the best ice cream shop')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "This is the best ice cream shop in the area and have a great selection. for those of you who are staying in the area with a friend or with new friends, we've got your back.   it's a great option for those looking"}]

In [29]:
generate_review('I love the beef noodle offered')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "I love the beef noodle offered here. there's a huge chicken dish on each bun. it's really nice. i'll go back for a bowl. they didn't have any fresh bbq sauce in the bowl so i don't know"}]

In [30]:
generate_review('This is the worst steak in town')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This is the worst steak in town,  but  we have never eaten a steak with this amount of meat!    my husband loves their homemade burgers and we are looking forward to coming back from this experience.  we ordered our own buffalo'}]