<a href="https://colab.research.google.com/github/ikennedy240/text4demog/blob/master/text4demog_gpt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text4Demog Text Generation Using Transformers and GPT-2
The `transformers` library, or [huggingface](https://huggingface.co/landing/inference-api/startups?utm_source=Google&utm_medium=Search&utm_campaign=Transformers+10x+Faster&utm_id=12055067954&gclid=CjwKCAjwzaSLBhBJEiwAJSRokg1r6FSo8X9OiDO2Gey41WMxEO8fNj8Odw2Twb9NmKBkrWFLnjAVtRoCGYEQAvD_BwE) is an easy to use deep learning library. It has tools for image and text data, and has many pre-trained models that you can download and fine-tune for your task. This short demo is based on work I did to produce a survey experiment that used computer-generated texts as treatments. It's designed to run in google colab, but you could run it fairly easily on any jupyter kernel with a gpu (and maybe less easily on a kernel with no gpu). 


## Enabling and testing the GPU

First, you'll need to enable GPUs for the notebook:

    Navigate to Edit→Notebook Settings
    select GPU from the Hardware Accelerator drop-down

Next, we'll confirm that we can connect to the GPU with tensorflow:

In [4]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [None]:
# import dependencies 
import os
import time
import datetime
import pandas as pd
import numpy as np
import random
import torch
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler
torch.manual_seed(42)
!pip install transformers
from transformers import GPT2LMHeadModel,  GPT2Tokenizer, GPT2Config, GPT2LMHeadModel
from transformers import AdamW, get_linear_schedule_with_warmup

If you haven't connected to your google drive, you'll need to upload the data and alter the datapath in the next chunk

In [9]:
# read prepped data from github
df = pd.read_csv('https://raw.githubusercontent.com/ikennedy240/text4demog/master/data/cl_text4demog.csv')

# set the number of texts to include from each category
# note that it takes ~3mins to fine-tune with 100 texts, but more like 45 mins with 1000
n = 100

# slice samples from each category and one with mixed texts
hight50 = df[df.nh_text.str.contains('<hight50>')].sample(n).nh_text.to_list()
lowt50 = df[df.nh_text.str.contains('<lowt50>')].sample(n).nh_text.to_list()
mixed = df.sample(n*3).nh_text.to_list()
# Identify which slice to use in analysis below
text_list = hight50
text_list[:5]

["<hight50> 11516 4th Ave NE, Seattle\n\n 11516 5th AVE NE, SEATTLE, WA 98125 2 bedrooms 1 bathroom 1270 sq. ft. rambler Lease term: 12 months RENT: $1895/mo DEPOSIT: $1895 NO PETS NO SMOKING Newly updated 2 bed rambler just north of Northgate Mall. Great location for commuters!  Beautifully  landscaped yard, with private patio in the back yard, great for BBQ's and entertaining.  New carpet, paint, range and dishwasher.   Walking distance to mall and bus stop. Olympic Elementary, Nathan Hale High, Eckstein Middle School. NO Smoking,  and no pets, firm. Available for 12 mo lease. 1st month rent and equal deposit required. Subject to $40 rental appl. fee - see www.macphersonspm.com for details. MUST VIEW PROPERTY PRIOR TO APPLYING. Contact Leni at show contact info  click to show contact info / show contact info  click to show contact info",
 '<hight50> Extraordinary Gold Coast / Bike Storage\n\n\r Gold Coast Extraordinary 2 Bedroom 2 Bath Great building with modern apartments! Staff is 

In [10]:
# Load the GPT tokenizer.
special_tokens = ['<hight50>','<lowt50>','<mediumt50>']
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>') #gpt2-medium
tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


3

In [11]:
print("The max model length is {} for this model, although the actual embedding size for GPT small is 768".format(tokenizer.model_max_length))
print("The beginning of sequence token {} token has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.bos_token_id), tokenizer.bos_token_id))
print("The end of sequence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.eos_token_id), tokenizer.eos_token_id))
print("The padding token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.pad_token_id), tokenizer.pad_token_id))

The max model length is 1024 for this model, although the actual embedding size for GPT small is 768
The beginning of sequence token <|startoftext|> token has the id 50257
The end of sequence token <|endoftext|> has the id 50256
The padding token <|pad|> has the id 50258


In [12]:
batch_size = 2

In [13]:
class GPT2Dataset(Dataset):

  def __init__(self, txt_list, tokenizer, gpt2_type="gpt2", max_length=768):

    self.tokenizer = tokenizer
    self.input_ids = []
    self.attn_masks = []

    for txt in txt_list:

      encodings_dict = tokenizer('<|startoftext|>'+ txt + '<|endoftext|>', truncation=True, max_length=max_length, padding="max_length")

      self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
      self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
    
  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return self.input_ids[idx], self.attn_masks[idx] 

In [14]:
dataset = GPT2Dataset(text_list, tokenizer, max_length=768)

In [15]:
# Split into training and validation sets
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

   90 training samples
   10 validation samples


In [16]:
# Create the DataLoaders for our training and validation datasets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )


In [17]:
# I'm not really doing anything with the config buheret
configuration = GPT2Config.from_pretrained('gpt2', output_hidden_states=False)

# instantiate the model
model = GPT2LMHeadModel.from_pretrained("gpt2", config=configuration)

# this step is necessary because I've added some tokens (bos_token, etc) to the embeddings
# otherwise the tokenizer and model tensors won't match up
model.resize_token_embeddings(len(tokenizer))

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Embedding(50262, 768)

In [18]:
# Tell pytorch to run this model on the GPU.
device = torch.device("cuda")
model.cuda()

# this output shows all of the layers to the model. We've set the learning rate
# and dropout settings (and everything else) to be the same for all of the 
# layers, but you might want to set varied learning rates for some layers.

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50262, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )


In [21]:
# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)


# some parameters I cooked up that work reasonably well

epochs = 5
learning_rate = 5e-4
warmup_steps = 1e2
epsilon = 1e-8

# this produces sample output every 100 steps
sample_every = 100

optimizer = AdamW(model.parameters(),
                  lr = learning_rate,
                  eps = epsilon
                )

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
# This changes the learning rate as the training loop progresses
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = warmup_steps, 
                                            num_training_steps = total_steps)

def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

In [25]:
# use this to set the directory to save the model
output_dir = f'gpt2_model'

training_stats = []

model = model.to(device)

# This chunk fine-tunes (trains) and saves the gpt-2 model
total_t0 = time.time()
for epoch_i in range(0, epochs):

    # ========================================
    #               Training
    # ========================================

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    t0 = time.time()

    total_train_loss = 0

    model.train()

    for step, batch in enumerate(train_dataloader):

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)

        model.zero_grad()        

        outputs = model(  b_input_ids,
                          labels=b_labels, 
                          attention_mask = b_masks,
                          token_type_ids=None
                        )

        loss = outputs[0]  

        batch_loss = loss.item()
        total_train_loss += batch_loss

        # Get sample every x batches.
        if step % sample_every == 0 and not step == 0:

            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}. Loss: {:>5,}.   Elapsed: {:}.'.format(step, len(train_dataloader), batch_loss, elapsed))

            model.eval()

            sample_outputs = model.generate(
                                    bos_token_id=random.randint(1,30000),
                                    do_sample=True,   
                                    top_k=50, 
                                    max_length = 200,
                                    top_p=0.95, 
                                    num_return_sequences=1
                                )
            for i, sample_output in enumerate(sample_outputs):
                  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
            
            model.train()

        loss.backward()

        optimizer.step()

        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)       
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))
        
    # ========================================
    #               Validation
    # ========================================

    print("")
    print("Running Validation...")

    t0 = time.time()

    model.eval()

    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)
        
        with torch.no_grad():        

            outputs  = model(b_input_ids, 
#                            token_type_ids=None, 
                             attention_mask = b_masks,
                            labels=b_labels)
          
            loss = outputs[0]  
            
        batch_loss = loss.item()
        total_eval_loss += batch_loss        

    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    validation_time = format_time(time.time() - t0)    

    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")
print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


# Display floats with two decimal places.
pd.set_option('precision', 2)

# Create a DataFrame from our training statistics.
df_stats = pd.DataFrame(data=training_stats)

# Use the 'epoch' as the row index.
df_stats = df_stats.set_index('epoch')

# A hack to force the column headers to wrap.
#df = df.style.set_table_styles([dict(selector="th",props=[('max-width', '70px')])])

# Display the table.
print(df_stats)

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)


Training...

  Average training loss: 0.74
  Training epoch took: 0:00:46

Running Validation...
  Validation Loss: 1.15
  Validation took: 0:00:02

Training...

  Average training loss: 0.74
  Training epoch took: 0:00:46

Running Validation...
  Validation Loss: 1.15
  Validation took: 0:00:02

Training...

  Average training loss: 0.74
  Training epoch took: 0:00:46

Running Validation...
  Validation Loss: 1.15
  Validation took: 0:00:02

Training...

  Average training loss: 0.74
  Training epoch took: 0:00:46

Running Validation...
  Validation Loss: 1.15
  Validation took: 0:00:02

Training...

  Average training loss: 0.74
  Training epoch took: 0:00:46

Running Validation...
  Validation Loss: 1.15
  Validation took: 0:00:02

Training complete!
Total training took 0:03:56 (h:mm:ss)
       Training Loss  Valid. Loss Training Time Validation Time
epoch                                                          
1               0.74         1.15       0:00:46         0:00:02
2    

('gpt2_model/tokenizer_config.json',
 'gpt2_model/special_tokens_map.json',
 'gpt2_model/vocab.json',
 'gpt2_model/merges.txt',
 'gpt2_model/added_tokens.json')

## Model Evaluation
Now that we've trained the model, we want to see what kinds of texts it can produce. You have the option here of using a model that's you've just trained or to load a model that you've trained previously. There're already some models you can test in the `text4demog/models` directory.

In [26]:
if 'model' in globals():
  load_model = input("You have a model loaded already, would you like to load a different one? (Y/n)\n")

if 'model' not in globals() or load_model in ['y','Y']:
  model_path = input("Input the directory that contains your gpt-2 model:\n")

  #save a trained model, configuration and tokenizer using `save_pretrained()`.
  # They can then be reloaded using `from_pretrained()`
  model = GPT2LMHeadModel.from_pretrained(model_path)
  tokenizer = GPT2Tokenizer.from_pretrained(model_path)
  # Good practice: save your training arguments together with the trained model
  # torch.save(args, os.path.join(output_dir, 'training_args.bin'))

  device = torch.device("cuda")
  model.cuda()
  model = model.to(device)

model.eval()

# Automatically prepend the start of text token. 
prompt = input("Generating Text. Enter starting tokens or use (d)efault:\n")
if prompt =='d':
  prompt = "<|startoftext|>"
else:
  prompt = "<|startoftext|> "+prompt

n = input("How many texts should we generate?")
n = int(n)

generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
generated = generated.to(device)

print(generated)

sample_outputs = model.generate(
                              generated, 
                              #bos_token_id=random.randint(1,30000),
                              do_sample=True,   
                              top_k=50, 
                              max_length = 300,
                              top_p=0.95, 
                              num_return_sequences=n
                              )

for i, sample_output in enumerate(sample_outputs):
  print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

You have a model loaded already, would you like to load a different one? (Y/n)
n
Generating Text. Enter starting tokens or use (d)efault:
Great New Apartment
How many texts should we generate?3


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([[50257,  3878,   968,  5949,  1823]], device='cuda:0')
0:  Great New Apartment! Beautiful & Clean Condition! Available for rent!
 Great location! Great neighborhood. Very clean and convenient!
 Great location! Great price! Great price!
Excellent job! Great location! Great amenities!
Great price! Excellent location!
This apartment is one of our favorite locations!


1:  Great New Apartment

Near Lake Oahu

1/2 mile from the Waterline

Indian Creek Shopping Center


Cameo, Waterline!
Close proximity to the lake with a nice home!
The Apartment is very spacious and has kitchen throughout
This home is beautiful with living room appliances including dishwasher, dishwasher, light back and many hardwood floors for entertaining.

   It is built in, but it is painted
Gorgeous living room with 1 bath w/ windows.   Great location for office and commercial tenants with great location.    This Apartment is only 4 minutes from Honolulu.
Beautiful Home with a nice spacious interior.


2:  Grea