# Training GPT-J to replicate the Office Script
 
This scipt was originally built in Google Colab. A version of this can thus be found [here](https://colab.research.google.com/drive/1o7GjDUTcOHWbWy9jjAz20h0B04Cc6QZ5). This version has been modified to work with the structure of the repository.

### Importing the Script
* From Kagle: https://www.kaggle.com/datasets/nasirkhalid24/the-office-us-complete-dialoguetranscript

### Standardizing the way that information is presented to the Model
* We need some standard format that will match the outputs of the vision team. We should probabily make a class for this with methods that can send the prompt to the model
### Importing the Model
* From HuggingFace (the transformers library), we will import GPT-J
* https://huggingface.co/EleutherAI/gpt-j-6B
### Tuning

## Attempt 2: Custom Transformer (probably will suck)

### Buidling the Model
### Pre-training
### Fine tuning

### General Imports

In [4]:
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount = True)
    runtime = 'colab'
except ModuleNotFoundError:
    runtime = 'local'

import os
import pandas as pd
import torch 
import nltk
import numpy as np

from transformers import TextDataset, GPT2LMHeadModel, Trainer
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# !pip install transformers
# !pip3 install datasets

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

data_path = "../Data/"
model_path = ""

### Importing the Data

* Collect it by scene

In [7]:
data_file = "drive/Shareddrives/Final Project Data/The-Office-Lines-V4.csv" if runtime == 'colab' else data_path + "The-Office-Lines-V4.csv"
# noun_file = "drive/Shareddrives/Final Project Data/nouns.csv" if runtime == 'colab' else data_path + "nouns.csv"
data = pd.read_csv(data_file)
data = data.drop("Unnamed: 6", axis=1)

# common_nouns = set(pd.read_csv(noun_file))


## Keys of the dictionary will be scenes, Values will be a list of lines by each chracter
data_dictionary = {}

for index, row in data.iterrows():
  if row['scene'] not in data_dictionary:
    data_dictionary[row['scene']] = []

  data_dictionary[row['scene']].append(  ( row['speaker'], row['line'] ) )

  
data_dictionary[1]


[('Michael',
  'All right Jim. Your quarterlies look very good. How are things at the library?'),
 ('Jim', "Oh, I told you. I couldn't close it. So..."),
 ('Michael',
  "So you've come to the master for guidance? Is this what you're saying, grasshopper?"),
 ('Jim', 'Actually, you called me in here, but yeah.'),
 ('Michael', "All right. Well, let me show you how it's done.")]

### Standardizing Presentation of the TexT

Example of what we want to prompt the model with:

Prompt:

> Characters: Jim, Pam \\
> Objects in Scene: <What do we put here?> \\
> Lines:
>> Jim: Good morning, Pam \\
>> Pam:

Expected Text Completion:
> Characters: Jim, Pam \\
> Objects in Scene: <What do we put here?> \\
> Lines:
>> Jim: Good morning, Pam \\
>> Pam: **Good morning, Jim!**











### Version 1

In [None]:
from scipy.optimize.linesearch import LineSearchWarning        


class PromptGenerator:
  def __init__(self, data):
    assert type(data) == dict
    self.data       = data ### Dictionary where keys of the dictionary will be scenes, Values will be a list of lines by each chracter
    self.scenes     = set(self.data.keys())
    self.num_scenes = len(self.scenes)

  def to_text(self):
    text = ""
    for scene in self.scenes:
      text += self.get_prompt_for_scene(scene)  + " \n " # TODO: append new line to sentence

    return text 

  def get_prompt_for_scene(self, scene_number):
    assert scene_number in self.data

    scene_data                              = self.data[scene_number]
    _, characters_string, lines_in_scene    = self.get_characters_in_scene_and_lines(scene_number)
    # objects_in_scene                        = self.get_all_nouns_in_scene(lines_in_scene)


    prompt       = "Characters: " + characters_string  + " \n "   # TODO: append new line to sentence
    # prompt      += "Objects in Scene: " + objects_in_scene  + " \n " # TODO: Figure out how to generate what objects to put in scene  # TODO: append new line to sentence
    prompt      += "Lines: " + " \n ".join( lines_in_scene ) 

    return prompt


  def get_characters_in_scene_and_lines(self, scene_number):
    scene_data  = self.data[scene_number]

    characters_list = []
    lines_in_scene      = []
    for line_and_character in scene_data:
      characters_list.append(line_and_character[0])
      lines_in_scene.append(line_and_character[0] + ": " + line_and_character[1]) 

    characters_string = ", ".join(characters_list)

    return characters_list, characters_string, lines_in_scene

  def get_all_nouns(self):

    nouns = ""
    for scene_number in self.scenes: 
      scene_data                              = self.data[scene_number]
      _, characters_string, lines_in_scene    = self.get_characters_in_scene_and_lines(scene_number)
      objects_in_scene                        = self.get_all_nouns_in_scene(lines_in_scene)
      nouns                                   += " " + objects_in_scene
    
    return nouns


  def get_all_nouns_in_scene(self, text):
    if type(text) == list:
      text = " ".join(text)
    #Reference: https://stackoverflow.com/questions/33587667/extracting-all-nouns-from-a-text-file-using-nltk
    ## TODO: Think about this more and improve on later 
    temp = [word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(text)) if pos[0] == 'N' and (pos != 'NNP')]
    
    return " ".join( temp )

  


In [None]:
promptGenerator = PromptGenerator(data_dictionary)

In [None]:
promptGenerator.get_prompt_for_scene(1)

"Characters: Michael, Jim, Michael, Jim, Michael \n Lines: Michael: All right Jim. Your quarterlies look very good. How are things at the library? \n Jim: Oh, I told you. I couldn't close it. So... \n Michael: So you've come to the master for guidance? Is this what you're saying, grasshopper? \n Jim: Actually, you called me in here, but yeah. \n Michael: All right. Well, let me show you how it's done."

In [None]:
nouns = promptGenerator.get_all_nouns()
# pd.Series( nouns.split(" ") ).to_csv("drive/Shareddrives/Final Project Data/nouns.csv")

### Version 2

In [None]:
class Scene():
    def __init__(self, characters, lines, name = ""): # lines is a list of tuples of (speaker, line)
        self.characters = list(characters)
        self.lines = list(lines)

        self.text = "\n\n".join(self.lines)

        self.nouns = [word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(self.text)) if pos[0] == 'N' and (pos != 'NNP')]# and word in common_nouns]

        self.n_lines = len(lines)

        self.name = name

    def to_text(self, missing_lines = 0, return_missing = False):
        output = f"Characters: " + ", ".join(set(self.characters)) + "\n\n"

        output += "Nouns: " + ", ".join(self.nouns) + "\n\n" # Need to implement random sampling later

        output += "----TEXT----"

        if missing_lines < self.n_lines: output += "\n\n"

        output += "\n\n".join(
            [
                f"{character}: {line}"
                for character, line in zip(
                    self.characters[:self.n_lines -  missing_lines],
                    self.lines[:self.n_lines -  missing_lines]
                )
            ]
        )

        if missing_lines:
            output += f"\n\n{self.characters[max(0,self.n_lines-missing_lines)]}:"

        if return_missing:
            return output, self.to_text()
        else:
            return output

In [None]:
breaks = [0] + [i + 1 for i, scene_num in enumerate(data["scene"][1:]) if scene_num != data["scene"][i]] + [len(data["scene"])]
n_scenes = len(breaks) - 1 # I added an extra "break" for the end of all the lines

scenes = [
    Scene(
        data["speaker"][breaks[i]:breaks[i+1]],
        data["line"][breaks[i]:breaks[i+1]],
        data["title"][breaks[i]]
    )
    for i in range(n_scenes)
]


prompts = [scene.to_text() for scene in scenes]



### Viewing the Training Data

In [None]:
for i in range(10, 12): print(f"_____________________________________________________________________\n\n", prompts[i], "\n\n")



_____________________________________________________________________

 Characters: Jan, Michael, Michel, Todd Packer

Nouns: Question, terrific, rep., queen, coming, today, question, carpet, drapes, horrifying, person, lid, people, regime, office

----TEXT----

Michael: Question. How long do we have to...  Oh uh, Todd Packer, terrific rep. Do you mind if I take it?

Jan: Go ahead.

Michel: Packman.

Todd Packer: Hey, you big queen.

Michael: Oh, that's not appropriate.

Todd Packer: Hey, is old Godzillary coming in today?

Michael: Uh, I don't know what you mean.

Todd Packer: I've been meaning to ask her one question. Does the carpet match the drapes?

Michael: Oh, my God! Oh! That's... horrifying. Horrible. Horrible person.

Jan: So do you think we could keep a lid on this for now? I don't want to worry people unnecessarily.

Michael: No, absolutely. Under this regime, it will not leave this office.  Like that. 


_____________________________________________________________________

### Turning this into a dataset


In [None]:
from torch.utils.data import DataLoader, Dataset
from transformers import GPT2TokenizerFast
from transformers import TextDataset

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

tokenizer.decode(tokenizer.eos_token_id)

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

'<|endoftext|>'

### Building 

In [None]:
text_model = GPT2LMHeadModel.from_pretrained('gpt2', pad_token_id=tokenizer.eos_token_id)


Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

### Training the model

In [None]:
path_to_dataset = "drive/Shareddrives/Final Project Data/dataset.csv"
prompts_df =  pd.Series(prompts, name='Prompts')#.to_frame()
prompts_df.to_csv(path_to_dataset)


In [None]:
from datasets import load_dataset
prompts_dataset = load_dataset("csv", data_files= path_to_dataset)
prompts_dataset = prompts_dataset.remove_columns('Unnamed: 0')
prompts_dataset_sample = prompts_dataset["train"].shuffle(seed=42).select(range(6000))



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-1a6063ab2d7ed16c/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-1a6063ab2d7ed16c/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
def compute_prompt_length(example):
    return {"review_length": len(example["Prompts"].split())}

prompts_dataset_sample = prompts_dataset_sample.map(compute_prompt_length)
# Inspect the first training example

print( "Length of shortest prompt: ", prompts_dataset_sample.sort("review_length")[:1]['review_length'] )
print( "Length of longest prompt: ", prompts_dataset_sample.sort("review_length")[-1:]['review_length'] )

prompts_dataset_sample = prompts_dataset_sample.remove_columns('review_length')### TODO: Filter based on length????

  0%|          | 0/6000 [00:00<?, ?ex/s]



Length of shortest prompt:  [6]
Length of longest prompt:  [982]


In [None]:
prompts_dataset_sample = prompts_dataset_sample.train_test_split(train_size=0.9, seed=42)
prompts_dataset_sample["validation"] = prompts_dataset_sample.pop("test")
prompts_dataset_sample

DatasetDict({
    train: Dataset({
        features: ['Prompts'],
        num_rows: 5400
    })
    validation: Dataset({
        features: ['Prompts'],
        num_rows: 600
    })
})

In [None]:
# https://www.youtube.com/watch?v=P0MTXaeUJ9s
def tokenize(element, max_length=128):  # TODO: Best value for max length ?????
    outputs = tokenizer(
        element["Prompts"],
        truncation=True,
        max_length=max_length,
        return_overflowing_tokens=False,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == max_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}

tokenized_datasets = prompts_dataset_sample.map(
    tokenize, batched=True, remove_columns=prompts_dataset_sample["train"].column_names
)
tokenized_datasets

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 3015
    })
    validation: Dataset({
        features: ['input_ids'],
        num_rows: 344
    })
})

In [None]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator       = DataCollatorForLanguageModeling(tokenizer, mlm=False, return_tensors="pt")


#### NOTE: Shifting the inputs and labels to align them happens inside the model, 
#### so the data collator just copies the inputs to create the labels. 
#### Below, notice that input and labels are the same for a sample:
out = data_collator([tokenized_datasets["train"][i] for i in range(1)])
for key in out:
    print(f"{key} shape: {out[key].shape}")

for key in out:
    print(f"{key}: {out[key][0]}")

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


input_ids shape: torch.Size([1, 100])
attention_mask shape: torch.Size([1, 100])
labels shape: torch.Size([1, 100])
input_ids: tensor([48393,    25, 24497,    11,  5395,   198,   198,    45,   977,    82,
           25,  6891,    11, 25152,    11,  3404,    11,   835,    11,  3348,
        31945,    11,  4043,    11,  3275,   198,   198,   650, 32541,   650,
          198,   198,    47,   321,    25,   921,   766, 29902,   338,  6891,
        25152,    30,   198,   198, 18050,    25,   337,    76,    12,    71,
         3020,    13,   198,   198,    47,   321,    25,  8975,   618,   339,
          338,   407,   994,    11,   314,  1949,   284,  3714,  3404,   287,
          340,    13,   198,   198, 18050,    25,  1400,   835,    13,  3914,
          338,   466,   428,   220,  3966,    13,   198,   198,    47,   321,
           25,  3423,    13,   198,   198, 18050,    25,  3086,    13,   198])
attention_mask: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [None]:
device           = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pt_train_dataset = tokenized_datasets["train"].with_format("torch", device
# pt_test_dataset = tokenized_datasets["test"].with_format("torch", device=device)=device)

In [None]:
output_dir = "drive/Shareddrives/Final Project Data/results"
overwrite_output_dir = False
per_device_train_batch_size = 32
num_train_epochs = 25.0
save_steps = 500

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=overwrite_output_dir,
        per_device_train_batch_size=per_device_train_batch_size,
        num_train_epochs=num_train_epochs,
    )

trainer = Trainer(
        model=text_model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=pt_train_dataset,
        # eval_dataset=pt_test_dataset
)

In [None]:
trainer.train()
trainer.save_model()

***** Running training *****
  Num examples = 3015
  Num Epochs = 25
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 2375
  Number of trainable parameters = 124439808


Step,Training Loss
500,2.4374
1000,2.1359
1500,1.9613
2000,1.8444


Saving model checkpoint to drive/Shareddrives/Final Project Data/results/checkpoint-500
Configuration saved in drive/Shareddrives/Final Project Data/results/checkpoint-500/config.json
Model weights saved in drive/Shareddrives/Final Project Data/results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to drive/Shareddrives/Final Project Data/results/checkpoint-1000
Configuration saved in drive/Shareddrives/Final Project Data/results/checkpoint-1000/config.json
Model weights saved in drive/Shareddrives/Final Project Data/results/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to drive/Shareddrives/Final Project Data/results/checkpoint-1500
Configuration saved in drive/Shareddrives/Final Project Data/results/checkpoint-1500/config.json
Model weights saved in drive/Shareddrives/Final Project Data/results/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to drive/Shareddrives/Final Project Data/results/checkpoint-2000
Configuration saved in drive/Shareddrives/Final

### Generating text 

In [None]:
from transformers import GPT2LMHeadModel
output_dir = "drive/Shareddrives/Final Project Data/results"


fine_tuned_model = GPT2LMHeadModel.from_pretrained(output_dir)
# tokenizer        = GPT2TokenizerFast.from_pretrained(output_dir) ### TODO: Fix errors from this line

In [None]:
# example
def generate_text(prompt, model, tokenizer, max_length=200, device = 'cuda:0'):
  input_tokens = tokenizer.encode(prompt, return_tensors='pt').to(device)
  output       = model.generate(input_tokens, max_length=max_length, num_beams=5, no_repeat_ngram_size=2, early_stopping=True, temperature=0.8)

  # output2      = model.generate( input_tokens, do_sample=True, max_length=max_length, pad_token_id=model.config.eos_token_id, top_k=50, top_p=0.95,) 
  # print(tokenizer.decode(output2[0], skip_special_tokens=True)) # TODO: Look into generate hyperparameters

  return tokenizer.decode(output[0], skip_special_tokens=True)

In [None]:
prompts[0][:201]

'Characters: Michael, Jim\n\nNouns: quarterlies, things, library, master, guidance, grasshopper, right\n\n----TEXT----\n\nMichael: All right Jim. Your quarterlies look very good. How are things at the library'

In [None]:
generate_text(prompts[0][:201], fine_tuned_model, tokenizer)

'Characters: Michael, Jim\n\nNouns: quarterlies, things, library, master, guidance, grasshopper, right\n\n----TEXT----\n\nMichael: All right Jim. Your quarterlies look very good. How are things at the library? I mean, the master has written a great white handbook on how to teach you this branch. I\'m very excited to be a part of that. Oh, and you should come to my master\'s lecture right now. It\'s called "Stopping the Scranton Insectivores." Oh my God. That\'s so exciting.  I can\'t believe this is happening right here in Philadelphia. You know, I was thinking back to Cici\'s old lawn mower that she had lying around. And I thought, "Oh, that\'s a good thing, because it\'ll make us all more docile around here." So, good to have you back. Thank you very much for coming. Bye Michael.\n'

In [None]:
prompts[0][:]

"Characters: Michael, Jim\n\nNouns: quarterlies, things, library, master, guidance, grasshopper, right\n\n----TEXT----\n\nMichael: All right Jim. Your quarterlies look very good. How are things at the library?\n\nJim: Oh, I told you. I couldn't close it. So...\n\nMichael: So you've come to the master for guidance? Is this what you're saying, grasshopper?\n\nJim: Actually, you called me in here, but yeah.\n\nMichael: All right. Well, let me show you how it's done."

In [None]:
print(
    generate_text(
        'Characters: Pam, Jim, Michael\n\nNouns: flowers, car, cat\n\n----TEXT----\n\nJim:',
        fine_tuned_model.to('cuda:0'), tokenizer
    )
)

Characters: Pam, Jim, Michael

Nouns: flowers, car, cat

----TEXT----

Jim: Oh, I'm so sorry. I thought you were going out for a flower petting zoo with me. But then I realized that you're also going to be spending some time with a cat. So I decided to give him a little overreaction, because I think he'd be more excited to have you around.  You know what? I'll just have to call Pam and tell her that we're having a karaoke... and she won't be able to resist it. It's gonna be weird. You're welcome Michael. Bye Pam. Thank you so much for talking to me! Bye Michael!  Oh my God, you guys are so cute.
: Michael, what are you doing? You look so... excited. How did you know that?  I mean, it's just a car and I don't want to drive


In [None]:
# https://www.kaggle.com/datasets/nasirkhalid24/the-office-us-complete-dialoguetranscript
https://gist.github.com/mf1024/3df214d2f17f3dcc56450ddf0d5a4cd7

https://www.kaggle.com/code/changyeop/how-to-fine-tune-gpt-2-for-beginners
https://huggingface.co/docs/transformers/perf_train_gpu_one


https://colab.research.google.com/drive/1pkrFeHJPIbQO1ws4mKvKHiGnBiM_qGc5?usp=sharing#scrollTo=GoHwg1y2ZMdJ

SyntaxError: ignored