In [1]:
import pandas as pd
from transformers import GPT2Tokenizer

In [2]:
df = pd.read_csv('data.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,amazon,twitter
0,0,Despite the fact that I have only played a sma...,I am coming to the borders and I will kill you...
1,1,I bought this charger in Jul 2003 and it worke...,im getting on borderlands and i will kill you ...
2,2,Check out Maha Energy's website. Their Powerex...,im coming on borderlands and i will murder you...
3,3,Reviewed quite a bit of the combo players and ...,im getting on borderlands 2 and i will murder ...
4,4,I also began having the incorrect disc problem...,im getting into borderlands and i can murder y...


In [4]:
len(df)

10000

In [5]:
reviews = "[REVIEW]" + df["amazon"].astype(str)
tweets = "[TWEET]" + df["twitter"].astype(str)

In [6]:
reviews

Unnamed: 0,amazon
0,[REVIEW]Despite the fact that I have only play...
1,[REVIEW]I bought this charger in Jul 2003 and ...
2,[REVIEW]Check out Maha Energy's website. Their...
3,[REVIEW]Reviewed quite a bit of the combo play...
4,[REVIEW]I also began having the incorrect disc...
...,...
9995,"[REVIEW]The device works well enough, but I am..."
9996,"[REVIEW]My daughter loves this, she is six mon..."
9997,[REVIEW]I purchased this item for my 6 month o...
9998,[REVIEW]My daughter started enjoying this arou...


In [7]:
corpus = pd.concat([reviews, tweets]).tolist()

In [11]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({"additional_special_tokens": ["[REVIEW]", "[TWEET]"], "pad_token": "[PAD]"})

inputs = tokenizer(
    corpus,
    truncation=True,
    max_length=256,
    padding="max_length",
    return_tensors="pt"
)

In [12]:
from transformers import Trainer, TrainingArguments, GPT2LMHeadModel
import torch

In [13]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(50260, 768)

In [14]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    save_strategy="epoch"
)

train_dataset = [
    {"input_ids": inputs["input_ids"][i],
     "attention_mask": inputs["attention_mask"][i],
     "labels": inputs["input_ids"][i].clone()}
    for i in range(len(inputs["input_ids"]))
]

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)

In [15]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mfpesantez[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
500,2.0457
1000,0.898


TrainOutput(global_step=1250, training_loss=1.3586444091796874, metrics={'train_runtime': 3044.3571, 'train_samples_per_second': 13.139, 'train_steps_per_second': 0.411, 'total_flos': 5225840640000000.0, 'train_loss': 1.3586444091796874, 'epoch': 2.0})

In [16]:
model.save_pretrained("genre_aware_generator")
tokenizer.save_pretrained("genre_aware_generator")

('genre_aware_generator/tokenizer_config.json',
 'genre_aware_generator/special_tokens_map.json',
 'genre_aware_generator/vocab.json',
 'genre_aware_generator/merges.txt',
 'genre_aware_generator/added_tokens.json')

In [17]:
model = GPT2LMHeadModel.from_pretrained("genre_aware_generator")
tokenizer = GPT2Tokenizer.from_pretrained("genre_aware_generator")

In [18]:
def generate_text(genre_token, max_length=100):
  inputs = tokenizer(genre_token, return_tensors="pt")
  outputs = model.generate(
      inputs["input_ids"],
      max_length=max_length,
      num_return_sequences=1,
      pad_token_id=tokenizer.eos_token_id
  )
  return tokenizer.decode(outputs[0], skip_special_tokens=False)

In [19]:
print(generate_text("[REVIEW]"))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


[REVIEW] I bought this book because I was looking for a book that would help me with my writing. I was disappointed. I was looking for a book that would help me with my writing. I was disappointed. I was looking for a book that would help me with my writing. I was disappointed. I was looking for a book that would help me with my writing. I was disappointed. I was looking for a book that would help me with my writing. I was disappointed. I was looking for


In [20]:
print(generate_text("[TWEET]"))

[TWEET] I'm so excited to see this movie. I'm so excited to see it.[PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD][PAD]
