<a href="https://colab.research.google.com/github/ikennedy240/text4demog/blob/master/text4demog_gpt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text4Demog Text Generation Using Transformers and GPT-2
The `transformers` library, or [huggingface](https://huggingface.co/landing/inference-api/startups?utm_source=Google&utm_medium=Search&utm_campaign=Transformers+10x+Faster&utm_id=12055067954&gclid=CjwKCAjwzaSLBhBJEiwAJSRokg1r6FSo8X9OiDO2Gey41WMxEO8fNj8Odw2Twb9NmKBkrWFLnjAVtRoCGYEQAvD_BwE) is an easy to use deep learning library. It has tools for image and text data, and has many pre-trained models that you can download and fine-tune for your task. This short demo is based on work I did to produce a survey experiment that used computer-generated texts as treatments. It's designed to run in google colab, but you could run it fairly easily on any jupyter kernel with a gpu (and maybe less easily on a kernel with no gpu). 


## Enabling and testing the GPU

First, you'll need to enable GPUs for the notebook:

    Navigate to Edit→Notebook Settings
    select GPU from the Hardware Accelerator drop-down

Next, we'll confirm that we can connect to the GPU with tensorflow:

In [1]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


## Make sure you have data access
Great! Next, I reccomend connecting to your google drive. As long as the shared google drive folder `text4demog` is in your main gdrive folder, the rest of the code should run as written. Alternatively, you could upload the data to google colab, but then just alter the `data_path` value below.

To mount your gogole drive, go to the 'files' pane on the left, and click the folder with the drive logo (far right). Then follow the instructions.

You can also copy the following code into a chunck and run it:
``` 
from google.colab import drive

drive.mount(‘/content/gdrive’)
```

Then you can test to see if you have access to the `text4demog` folder by running `os.listdir('/content/drive/MyDrive/text4demog')`

In [3]:
import os
import time
import datetime
import pandas as pd
import numpy as np
import random
import torch
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler
torch.manual_seed(42)
!pip install transformers
from transformers import GPT2LMHeadModel,  GPT2Tokenizer, GPT2Config, GPT2LMHeadModel
from transformers import AdamW, get_linear_schedule_with_warmup

If you haven't connected to your google drive, you'll need to upload the data and alter the datapath in the next chunk

In [None]:
data_path = '/content/drive/MyDrive/text4demog/cl_text4demog.csv'

In [67]:
df = pd.read_csv(data_path)
# slice samples from each category and one with mixed texts
hight50 = df[df.nh_text.str.contains('<hight50>')].sample(100).nh_text.to_list()
lowt50 = df[df.nh_text.str.contains('<lowt50>')].sample(100).nh_text.to_list()
mixed = df.sample(300).nh_text.to_list()
# Identify which slice to use in analysis below
text_list = hight50
text_list[:5]

['<hight50> Ballantyne Townhouse\n\n\r 2 Bed Room 2 1/2 Bath Townhouse in great location, close to everything. Has gas fireplace,\r sunken den, wet bar, eat in kitchen, large bed rooms, lots of closet space, private backyard. pool\r and tennis court, 2 private reserved parking spaces.',
 "<hight50> FOREST PARK NEWLY REHABBED BEAUTIFUL 2 BDR available NOW!\n\n\r This newly renovated apartment is a must see! Location location location! One block from the Blue line and 290! Four blocks from the green line! No worries about parking because it's included along with heat and water! This 2 bedroom apartment has walk-in closets, a large living room and dining room with a beautiful fireplace. Granite and new stainless steel appliances. EXTRA PERKS include PRIVATE ENCLOSED PORCH with lots of windows and backyard! Everything you need!  Pets considered. $1500.00 a month. Don't let this one get away! Available 6/1! CALL PREFERRED John\r show contact info\r  click to show contact info",
 "<hight50> 

In [46]:
# Load the GPT tokenizer.
special_tokens = ['<hight50>','<lowt50>','<mediumt50>']
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>') #gpt2-medium
tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


3

In [47]:
print("The max model length is {} for this model, although the actual embedding size for GPT small is 768".format(tokenizer.model_max_length))
print("The beginning of sequence token {} token has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.bos_token_id), tokenizer.bos_token_id))
print("The end of sequence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.eos_token_id), tokenizer.eos_token_id))
print("The padding token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.pad_token_id), tokenizer.pad_token_id))

The max model length is 1024 for this model, although the actual embedding size for GPT small is 768
The beginning of sequence token <|startoftext|> token has the id 50257
The end of sequence token <|endoftext|> has the id 50256
The padding token <|pad|> has the id 50258


In [48]:
batch_size = 2

In [49]:
class GPT2Dataset(Dataset):

  def __init__(self, txt_list, tokenizer, gpt2_type="gpt2", max_length=768):

    self.tokenizer = tokenizer
    self.input_ids = []
    self.attn_masks = []

    for txt in txt_list:

      encodings_dict = tokenizer('<|startoftext|>'+ txt + '<|endoftext|>', truncation=True, max_length=max_length, padding="max_length")

      self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
      self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
    
  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return self.input_ids[idx], self.attn_masks[idx] 

In [50]:
dataset = GPT2Dataset(text_list, tokenizer, max_length=768)

In [51]:
# Split into training and validation sets
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

  900 training samples
  100 validation samples


In [52]:
# Create the DataLoaders for our training and validation datasets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )


In [53]:
# I'm not really doing anything with the config buheret
configuration = GPT2Config.from_pretrained('gpt2', output_hidden_states=False)

# instantiate the model
model = GPT2LMHeadModel.from_pretrained("gpt2", config=configuration)

# this step is necessary because I've added some tokens (bos_token, etc) to the embeddings
# otherwise the tokenizer and model tensors won't match up
model.resize_token_embeddings(len(tokenizer))

Embedding(50262, 768)

In [None]:
# Tell pytorch to run this model on the GPU.
device = torch.device("cuda")
model.cuda()


In [55]:
# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)


# some parameters I cooked up that work reasonably well

epochs = 5
learning_rate = 5e-4
warmup_steps = 1e2
epsilon = 1e-8

# this produces sample output every 100 steps
sample_every = 100

optimizer = AdamW(model.parameters(),
                  lr = learning_rate,
                  eps = epsilon
                )

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
# This changes the learning rate as the training loop progresses
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = warmup_steps, 
                                            num_training_steps = total_steps)

def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

In [56]:
training_stats = []

model = model.to(device)

today_str = datetime.datetime.now()
today_str = today_str.strftime('%Y_%m_%d')

output_dir = input(f'''
Set a directory to save the model (d)efault: 
/content/gpt2_{today_str}\n
But use "/content/drive/MyDrive/<yourmodelname>" to save to google drive ''')

if output_dir == 'd':
  ouptut_dir = f'gpt2_{today_str}'


Enter the model name, or use (d)efault: gpt2_2021_10_15



In [None]:
total_t0 = time.time()
for epoch_i in range(0, epochs):

    # ========================================
    #               Training
    # ========================================

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    t0 = time.time()

    total_train_loss = 0

    model.train()

    for step, batch in enumerate(train_dataloader):

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)

        model.zero_grad()        

        outputs = model(  b_input_ids,
                          labels=b_labels, 
                          attention_mask = b_masks,
                          token_type_ids=None
                        )

        loss = outputs[0]  

        batch_loss = loss.item()
        total_train_loss += batch_loss

        # Get sample every x batches.
        if step % sample_every == 0 and not step == 0:

            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}. Loss: {:>5,}.   Elapsed: {:}.'.format(step, len(train_dataloader), batch_loss, elapsed))

            model.eval()

            sample_outputs = model.generate(
                                    bos_token_id=random.randint(1,30000),
                                    do_sample=True,   
                                    top_k=50, 
                                    max_length = 200,
                                    top_p=0.95, 
                                    num_return_sequences=1
                                )
            for i, sample_output in enumerate(sample_outputs):
                  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
            
            model.train()

        loss.backward()

        optimizer.step()

        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)       
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))
        
    # ========================================
    #               Validation
    # ========================================

    print("")
    print("Running Validation...")

    t0 = time.time()

    model.eval()

    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)
        
        with torch.no_grad():        

            outputs  = model(b_input_ids, 
#                            token_type_ids=None, 
                             attention_mask = b_masks,
                            labels=b_labels)
          
            loss = outputs[0]  
            
        batch_loss = loss.item()
        total_eval_loss += batch_loss        

    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    validation_time = format_time(time.time() - t0)    

    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")
print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


# Display floats with two decimal places.
pd.set_option('precision', 2)

# Create a DataFrame from our training statistics.
df_stats = pd.DataFrame(data=training_stats)

# Use the 'epoch' as the row index.
df_stats = df_stats.set_index('epoch')

# A hack to force the column headers to wrap.
#df = df.style.set_table_styles([dict(selector="th",props=[('max-width', '70px')])])

# Display the table.
print(df_stats)

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

## Model Evaluation
Now that we've trained the model, we want to see what kinds of texts it can produce. You have the option here of using a model that's you've just trained or to load a model that you've trained previously. There're already some models you can test in the `text4demog/models` directory.

In [75]:
if 'model' in globals():
  load_model = input("You have a model loaded already, would you like to load a different one? (Y/n)\n")

if 'model' not in globals() or load_model in ['y','Y']:
  model_path = input("Input the directory that contains your gpt-2 model:\n")

  #save a trained model, configuration and tokenizer using `save_pretrained()`.
  # They can then be reloaded using `from_pretrained()`
  model = GPT2LMHeadModel.from_pretrained(model_path)
  tokenizer = GPT2Tokenizer.from_pretrained(model_path)
  # Good practice: save your training arguments together with the trained model
  # torch.save(args, os.path.join(output_dir, 'training_args.bin'))

  device = torch.device("cuda")
  model.cuda()
  model = model.to(device)

model.eval()

prompt = input("Generating Text. Enter starting tokens or use (d)efault:\n")
if prompt =='d':
  prompt = "<|startoftext|>"

n = input("How many texts should we generate?")
n = int(n)

generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
generated = generated.to(device)

print(generated)

sample_outputs = model.generate(
                              generated, 
                              #bos_token_id=random.randint(1,30000),
                              do_sample=True,   
                              top_k=50, 
                              max_length = 300,
                              top_p=0.95, 
                              num_return_sequences=n
                              )

for i, sample_output in enumerate(sample_outputs):
  print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

You have a model loaded already, would you like to load a different one? (Y/n)n
Generating Text. Enter starting tokens or use (d)efault:
d
How many texts should we generate?3


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


tensor([[50257]], device='cuda:0')
0: Newly Renovated, Stainless Steel Appliances, 2 Bedroom/1 Bathroom, Parking Included!


PROPERTY INFO
ID: 314471489
Rent: $2,998 / Month
Beds: 2
Bath: 1
Available Date: 10/31/2020
Pet: Cat Ok
Parking:: Available!
AVAILABLE NOW
2 Bedroom Apartment for $2,998. Includes a Washer/Dryer in unit
2nd Bedrooms
Living Room
Dining Room
Dining Room
Kitchen w/ SS Appliances w/ SS Appliances
Hardwood Floors
Pets Allowed
Carpet flooring
Heat and Hot Water Included in the Rent
Grocery Stores Near by
Great Price! Available 9/1!


1: 2 BED 1 BATH - GREAT LOCATION


This well located 2 bed/1 bath in prime Santa Monica location.
- Conveniently located
- 10 minute walk to beach
- Conveniently located just outside the door
- Great community for beach hopping! - Steps to everything!!
- No fee
- 1 parking space included
- Tenant pays for water
- Non-smoking included
- Cats OK (breed restrictions apply)
- No smoking
- Great water views!
- Walking distance to restaurants an