# Training tAIylor's version models

We will fine tune two HuggingFace models with Taylor Swift lyrics. One BERT model will be used to classify which era (or album/album group) lyrics are from and the GPT 2 model will generate lyrics.

### Data:

The data has two fields: label and text

**label:** the integer of the album or album group from which the lyrics come from. The eras are taylor_swift, speak_now, reputation, nineteen89, red, lover, folklore, evermore
Some versions have this grouped together: all eras version, Biber version, genre version

**text:** 4 lines of lyrics from a song on the album or 8 lines of lyrics from a song on the album

In [None]:
%%capture
!pip install pandas
!pip install sklearn.model_selection
!pip install google.colab
import pandas as pd
from sklearn.model_selection import train_test_split
from google.colab import drive

In [None]:
drive.mount("/content/drive")

# Load in the data
data = pd.read_csv("/content/drive/My Drive/Grad/Text Analysis/fine_tune_data_genre_l4.csv")

# Split data into test and training
# Split the data into features (X) and labels (y)
X = data["text"]
y = data["label"]

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

# Create pandas DataFrames for train and test
train_data = pd.DataFrame({"text": X_train, "label": y_train})
test_data = pd.DataFrame({"text": X_test, "label": y_test})

# Display the shapes of the training and testing sets
print("Training set shape: ", train_data.shape)
print("Testing set shape: ", test_data.shape)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Training set shape:  (1719, 2)
Testing set shape:  (430, 2)


### BERT training for classification

Model used: bert-base-cased

https://huggingface.co/bert-base-cased

In [None]:
%%capture
!pip install torch
!pip install transformers
!pip install evaluate
!pip install transformers[torch] -U
!pip install accelerate -U
import torch
import torch.nn as nn
import numpy as np
import evaluate
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset, DatasetDict, load_dataset
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
# Use GPU instead of CPU

# Check if GPU is available
if torch.cuda.is_available():
    # Get the name of the GPU
    gpu_name = torch.cuda.get_device_name(0)
    print(f"GPU: {gpu_name}")
else:
    print("GPU is not available. Switch to a GPU runtime.")

GPU: Tesla T4


In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Create a function to preprocess data
def preprocess(data):
    return tokenizer(data["text"], truncation=True, padding="max_length", return_tensors="pt")

dataset_dict = {
    "train": Dataset.from_pandas(train_data[['label', 'text']]),
    "test": Dataset.from_pandas(test_data[['label', 'text']])
}

# Apply the preprocess function to both train and test splits
for split in dataset_dict.keys():
    dataset_dict[split] = dataset_dict[split].map(preprocess, batched=True)

# Dynamically pad lyrics text to the length of the longest element in its batch
data_collator = DataCollatorWithPadding(
    tokenizer,
    padding=True
)

dataset = DatasetDict(dataset_dict)

Map:   0%|          | 0/1719 [00:00<?, ? examples/s]

Map:   0%|          | 0/430 [00:00<?, ? examples/s]

In [None]:
# Set up the evaluation process
metrics = {
    "accuracy": accuracy_score,
    "precision": precision_score,
    "recall": recall_score,
    "f1": f1_score
}

def compute_metrics(p):
    predictions, labels = p.predictions, p.label_ids
    predictions = np.argmax(predictions, axis=1)

    metrics_dict = {}
    for metric_name, metric_func in metrics.items():
        if metric_name == "accuracy":
            metrics_dict[metric_name] = metric_func(labels, predictions)
        else:
            metrics_dict[metric_name] = metric_func(labels, predictions, average='weighted')

    return metrics_dict

# Load the pretrained model
# all_eras: 9 labels
# genre: 3 labels
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3)

class CustomModel(nn.Module):
    def __init__(self, model):
        super(CustomModel, self).__init__()
        self.model = model

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
        outputs = self.model(input_ids, token_type_ids=token_type_ids,
                             attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits
        return loss, logits

# Instantiate custom model
custom_model = CustomModel(model)

# Move the model to the GPU if available
custom_model.to("cuda")

# Set training arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=8,
    weight_decay=0.01,
    fp16=True
)

trainer = Trainer(
    model=custom_model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
# Finally, call the trainer
%time trainer.train()

Step,Training Loss
500,0.8853
1000,0.449
1500,0.1198
2000,0.0299
2500,0.0033
3000,0.0044


CPU times: user 10min 45s, sys: 10.5 s, total: 10min 56s
Wall time: 11min 47s


TrainOutput(global_step=3440, training_loss=0.21694329294354417, metrics={'train_runtime': 707.1566, 'train_samples_per_second': 19.447, 'train_steps_per_second': 4.865, 'total_flos': 0.0, 'train_loss': 0.21694329294354417, 'epoch': 8.0})

In [None]:
# Save the fine-tuned model
drive_path = "/content/drive/My Drive/Grad/Text Analysis/"

model.save_pretrained(drive_path + "tAIylors-version-model-genre-v2")
tokenizer.save_pretrained(drive_path + "tAIylors-version-tokenizer-genre-v2")

('/content/drive/My Drive/Grad/Text Analysis/tAIylors-version-genre-v2/tokenizer_config.json',
 '/content/drive/My Drive/Grad/Text Analysis/tAIylors-version-genre-v2/special_tokens_map.json',
 '/content/drive/My Drive/Grad/Text Analysis/tAIylors-version-genre-v2/vocab.txt',
 '/content/drive/My Drive/Grad/Text Analysis/tAIylors-version-genre-v2/added_tokens.json',
 '/content/drive/My Drive/Grad/Text Analysis/tAIylors-version-genre-v2/tokenizer.json')

## Evaluate performance on the test set

In [None]:
# Evaluate on the test set
results = trainer.evaluate(dataset["test"])

# Print the evaluation results
print(results)

{'eval_loss': 1.0532838106155396, 'eval_accuracy': 0.8558139534883721, 'eval_precision': 0.8543025313612191, 'eval_recall': 0.8558139534883721, 'eval_f1': 0.8529464130606982, 'eval_runtime': 5.7891, 'eval_samples_per_second': 74.277, 'eval_steps_per_second': 18.656, 'epoch': 8.0}


In [None]:

# Define a test example
# Lyrics from Fearless (Fearless)
# all_eras: 1
# genre: 0
test_example = '''There's somethin' bout the way
The street looks when it's just rained
There's a glow off the pavement
You walk me to the car'''

# Tokenize the test example
tokens = tokenizer(test_example, truncation=True, padding="max_length", return_tensors="pt")

tokens = {key: value.to(model.device) for key, value in tokens.items()}

# Make a forward pass with the model
with torch.no_grad():
    outputs = model(**tokens)

# Get the predicted label
predicted_label = torch.argmax(outputs.logits).item()

# Print the result
print(f"Predicted Label: {predicted_label}")
print(outputs.logits)

Predicted Label: 0
tensor([[ 6.5254, -3.7612, -2.8027]], device='cuda:0')


# GPT2 Training for generation

Model: gpt2

https://huggingface.co/gpt2

In [None]:
%%capture
import os
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import pipeline, set_seed
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import TextGenerationPipeline

In [None]:
# Create the Dataset and the DataLoader
drive.mount("/content/drive")

# Load in the data
data = pd.read_csv("/content/drive/My Drive/Grad/Text Analysis/fine_tune_data_all_eras.csv")
data.drop(columns=['label'], inplace=True)


class CustomDataset(Dataset):
    def __init__(self, dataframe, tokenizer, block_size=128):
        self.data = dataframe['text'].tolist()
        self.tokenizer = tokenizer
        self.block_size = block_size

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data[idx]
        tokens = self.tokenizer(
            text,
            return_tensors='pt',
            truncation=True,
            max_length=self.block_size,
            padding='max_length',
        )
        return {
            'input_ids': tokens['input_ids'].squeeze(),
            'attention_mask': tokens['attention_mask'].squeeze()
        }

# Tokenize the dataset
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Add a new special token for padding
tokenizer.pad_token = tokenizer.eos_token
dataset = CustomDataset(data, tokenizer)

# Set DataLoader
train_dataloader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=None)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Load the pretrained model
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Move the model to the GPU if available
model.to("cuda")

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

Using the default loss and Trainer/TrainingArguments class:

In [None]:
# Set training arguments
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=8,
    per_device_train_batch_size=4,
    save_steps=10_000,
)

# Define a data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Train the model using the Trainer class
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=data_collator
)

In [None]:
# Finally, call the trainer
%time trainer.train()

Step,Training Loss
500,3.1616
1000,2.4324
1500,2.0166
2000,1.7349


CPU times: user 6min 29s, sys: 1.79 s, total: 6min 31s
Wall time: 6min 38s


TrainOutput(global_step=2224, training_loss=2.26788083083338, metrics={'train_runtime': 398.1044, 'train_samples_per_second': 22.346, 'train_steps_per_second': 5.586, 'total_flos': 581113479168000.0, 'train_loss': 2.26788083083338, 'epoch': 8.0})

In [None]:
# Save the fine-tuned model
drive_path = "/content/drive/My Drive/Grad/Text Analysis/"

model.save_pretrained(drive_path + "tAIylors-version-model-generate-v2")
tokenizer.save_pretrained(drive_path + "tAIylors-version-tokenizer-generate-v2")

('/content/drive/My Drive/Grad/Text Analysis/tAIylors-version-tokenizer-generate-v2/tokenizer_config.json',
 '/content/drive/My Drive/Grad/Text Analysis/tAIylors-version-tokenizer-generate-v2/special_tokens_map.json',
 '/content/drive/My Drive/Grad/Text Analysis/tAIylors-version-tokenizer-generate-v2/vocab.json',
 '/content/drive/My Drive/Grad/Text Analysis/tAIylors-version-tokenizer-generate-v2/merges.txt',
 '/content/drive/My Drive/Grad/Text Analysis/tAIylors-version-tokenizer-generate-v2/added_tokens.json')

## Try some generations

In [None]:
# Play around with text generation

# Path to the directory where you saved the fine-tuned GPT-2 model and tokenizer
model_dir = "/content/drive/My Drive/Grad/Text Analysis/tAIylors-version-model-generate-v2"
tokenizer_dir = "/content/drive/My Drive/Grad/Text Analysis/tAIylors-version-tokenizer-generate-v2"

# Load the fine-tuned GPT2 language modeling head model
fine_tuned_model = GPT2LMHeadModel.from_pretrained(model_dir)

# Load the GPT2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_dir)

# Move the model to the GPU if available
fine_tuned_model.to("cuda")

# Create a text generation pipeline using the fine-tuned GPT2 model and tokenizer
generator = TextGenerationPipeline(model=fine_tuned_model, tokenizer=tokenizer, device=0)

Top k instead of adjusting temperature works better. Adjusting temperature would make it get "stuck" often and repeat the same thing over and over.

In [None]:
# Set the seed for reproducibility
set_seed(13)

# Text prompt for generation
text_prompt = '''Create a song about staying up until midnight for a man who will never show up'''

# Generate text using the fine-tuned model with top-k sampling
generated_text = generator(text_prompt, max_length=200, num_return_sequences=1,
                           do_sample=True, top_k=50)

print(generated_text[0]["generated_text"])

Create a song about staying up until midnight for a man who will never show up And bring upon myself the dignity of my office and my mattress This was a national conversation, one that should last forever And it was a stately love affair That should be celebrated And given the gravity of my crime They might find a movie that would play The beat of your heart, oh-oh Whoa, whoa, it's you, it's me, it's me I'm the only one of me, honey (Oh) and you're the only one of me, sweet (Mmh, I miss you) That should be celebrated (But given the gravity of my crime) And given the gravity of my crime You're the only one of you (Ooh) and you're the only one of you, yeah, yeah Girl, are we out of line? Are we in the clear yet? Are we out of the woods yet? Are we in the clear yet? Are we in the
