In [1]:
#!pip install evaluate

In [2]:
#!pip install nltk

In [3]:
#!pip install rouge_score

This code will use the HuggingFace tutorial to fine-tune a model with a dataset. 

Tutorial: https://huggingface.co/course/chapter7/5?fw=pt

Step 1: Prepare the corpus for fine-tuning

In [4]:
import pandas as pd
import numpy as np

In [5]:
df = pd.read_csv('PoetryFoundationData.csv')

In [6]:
df

Unnamed: 0.1,Unnamed: 0,Title,Poem,Poet,Tags
0,0,\r\r\n Objects Used to Prop...,"\r\r\nDog bone, stapler,\r\r\ncribbage board, ...",Michelle Menting,
1,1,\r\r\n The New Church\r\r\n...,"\r\r\nThe old cupola glinted above the clouds,...",Lucia Cherciu,
2,2,\r\r\n Look for Me\r\r\n ...,\r\r\nLook for me under the hood\r\r\nof that ...,Ted Kooser,
3,3,\r\r\n Wild Life\r\r\n ...,"\r\r\nBehind the silo, the Mother Rabbit\r\r\n...",Grace Cavalieri,
4,4,\r\r\n Umbrella\r\r\n ...,\r\r\nWhen I push your button\r\r\nyou fly off...,Connie Wanek,
...,...,...,...,...,...
13849,13,\r\r\n 1-800-FEAR\r\r\n ...,\r\r\nWe'd like to talk with you about ...,Jody Gladding,"Living,Social Commentaries,Popular Culture"
13850,14,\r\r\n The Death of Atahual...,\r\r\n\r\r\n,William Jay Smith,
13851,15,\r\r\n Poet's Wish\r\r\n ...,\r\r\n\r\r\n,William Jay Smith,
13852,0,\r\r\n 0\r\r\n,\r\r\n Philosophic\r\r\nin its comple...,Hailey Leithauser,"Arts & Sciences,Philosophy"


In [7]:
df.iloc[0]['Title']

'\r\r\n                    Objects Used to Prop Open a Window\r\r\n                '

In [8]:
df = df[['Title', 'Poem']]

In [9]:
df['Title'] = df['Title'].apply(lambda x: x.replace('\r\r\n', ' ').strip())
df['Poem'] = df['Poem'].apply(lambda x: x.replace('\r\r\n', ' ').strip())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [10]:
df

Unnamed: 0,Title,Poem
0,Objects Used to Prop Open a Window,"Dog bone, stapler, cribbage board, garlic pres..."
1,The New Church,"The old cupola glinted above the clouds, shone..."
2,Look for Me,Look for me under the hood of that old Chevrol...
3,Wild Life,"Behind the silo, the Mother Rabbit hunches lik..."
4,Umbrella,When I push your button you fly off the handle...
...,...,...
13849,1-800-FEAR,We'd like to talk with you about fear t...
13850,The Death of Atahuallpa,
13851,Poet's Wish,
13852,0,"Philosophic in its complex, ovoid emptiness, a..."


In [11]:
df['Poem_len'] = df['Poem'].apply(lambda x: len(x))
df['Title_len'] = df['Title'].apply(lambda x: len(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [12]:
# remove all poems and titles which are too short or too long
df = df[df['Poem_len'] > 0]
df = df[df['Title_len'] > 0]
df = df[df['Poem_len'] < 10000]
df = df[df['Title_len'] < 100]

In [13]:
df

Unnamed: 0,Title,Poem,Poem_len,Title_len
0,Objects Used to Prop Open a Window,"Dog bone, stapler, cribbage board, garlic pres...",575,34
1,The New Church,"The old cupola glinted above the clouds, shone...",657,14
2,Look for Me,Look for me under the hood of that old Chevrol...,389,11
3,Wild Life,"Behind the silo, the Mother Rabbit hunches lik...",911,9
4,Umbrella,When I push your button you fly off the handle...,629,8
...,...,...,...,...
13835,!,"Dear Writers, I’m compiling the first in what ...",211,1
13848,1 January 1965,The Wise Men will unlearn your name. Above you...,785,14
13849,1-800-FEAR,We'd like to talk with you about fear t...,661,10
13852,0,"Philosophic in its complex, ovoid emptiness, a...",472,1


We're going to start with a dataset of just 1000 poem/title pairs for testing purposes. 

In [14]:
df = df.sample(1000)
df = df.reset_index(drop=True)

In [15]:
df

Unnamed: 0,Title,Poem,Poem_len,Title_len
0,The Breeder’s Cup,I. TO THE FATESThey cannot keep the peaceor th...,1301,17
1,Canto I,"And then went down to the ship, Set keel to br...",3371,7
2,Vow,It will be windy for a while until it isn’t. T...,1140,3
3,Between Assassinations,Old court. Old chain net hanging in frayed lin...,1978,22
4,The Last Hour,Lean and sane in the last hour of a long fast ...,1339,13
...,...,...,...,...
995,Sketch of a Man on a Platform,Man of absolute physical equilibrium You stand...,1337,29
996,My First Best Friend,My first best friend is Awful Ann— she socked ...,492,20
997,Negative Space,1. I was born on a Tuesday in April. I didn't ...,9311,14
998,Cold Trail,"The feeling of time derives from heat, an agit...",491,10


In [16]:
from datasets import Dataset

In [17]:
dataset = Dataset.from_pandas(df, split='validation')
dataset = dataset.train_test_split(test_size=0.2, shuffle=True)

In [18]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Title', 'Poem', 'Poem_len', 'Title_len'],
        num_rows: 800
    })
    test: Dataset({
        features: ['Title', 'Poem', 'Poem_len', 'Title_len'],
        num_rows: 200
    })
})

Now that we have our dataset, we choose a pre-trained model and preprocess our data. 
The model I'll use is facebook/bart-base.
See paper for explanation and analysis of why I chose this model. 

Note: After trying this model for a while, I am going to try using a BartForConditionalGeneration Model instead of just a BartModel. 

In [19]:
from transformers import BartTokenizer, BartForConditionalGeneration #, BartModel
from transformers import Seq2SeqTrainingArguments
from transformers import DataCollatorForSeq2Seq

tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')

In [20]:
# from huggingface_hub import notebook_login

# notebook_login()

In [21]:
# from transformers import AutoTokenizer
# 
# model_checkpoint = 'facebook/bart-base'
# tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [22]:
# Testing the tokenizer
inputs = tokenizer("This is a test to see if we can tokenize correctly.")
inputs

{'input_ids': [0, 713, 16, 10, 1296, 7, 192, 114, 52, 64, 19233, 2072, 12461, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [23]:
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['<s>',
 'This',
 'Ġis',
 'Ġa',
 'Ġtest',
 'Ġto',
 'Ġsee',
 'Ġif',
 'Ġwe',
 'Ġcan',
 'Ġtoken',
 'ize',
 'Ġcorrectly',
 '.',
 '</s>']

In [24]:
# Get the max tokens for titles and poems

max_poem = df.iloc[df['Poem_len'].idxmax()]['Poem']
max_title = df.iloc[df['Title_len'].idxmax()]['Title']

max_poem_length = len(tokenizer.convert_ids_to_tokens(tokenizer(max_poem, 
                                                                max_length=1024, 
                                                                truncation=True).input_ids))
max_title_length = len(tokenizer.convert_ids_to_tokens(tokenizer(max_title, 
                                                                max_length=1024, 
                                                                truncation=True).input_ids))

In [25]:
print("max poem tokens length: " + str(max_poem_length))
print("max title tokens length: " + str(max_title_length))

max poem tokens length: 1024
max title tokens length: 26


In [26]:
def preprocess_function(data):
    
    model_inputs = tokenizer(data["Poem"], max_length = max_poem_length, truncation=True)
    
    # should the first param be noted at target_text
    labels = tokenizer(data["Title"], max_length = max_title_length, truncation=True)
    
    model_inputs["labels"] = labels["input_ids"] #TODO: Check if this column should be "labels" or "label"
    return model_inputs

The above code is giving me errors => The model_inputs[labels] column name should be "labels" otherwise the tokenized_dataset won't be able to remove the right column to make the values not strings. 
However, when I input the pre-processed data, it tells me that the 'label' column is missing in the BartConfig.forward() function.

In [27]:
tokenized_datasets = dataset.map(preprocess_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [28]:
# Here we set the arguments for the DataTrainer building off a Sequence to Sequence base Trainer

batch_size = 8
num_train_epochs = 8

# Show the training loss with every epoch
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = 'bart-base'

# arguments
args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-poems",
    evaluation_strategy="epoch",
    learning_rate=5.6e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    push_to_hub=True,
)

We now create a metric to evaluate the training => for text title generation the right metric is "Rouge"

In [29]:
# setup evaluation metric for training

import evaluate
import nltk
from nltk.tokenize import sent_tokenize # sentence tokenizer

metric = evaluate.load("rouge")
nltk.download("punkt") # we need to download this for some reason to run the metric.compute function

[nltk_data] Downloading package punkt to /Users/raunit/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [30]:
# functions to test the rouge computational metric

generated_title = "I absolutely loved reading the Hunger Games"
reference_title = "I loved reading the Hunger Games"

scores = metric.compute(predictions=[generated_title], references=[reference_title])

scores # this returns only the fmeasure (nothing else though I'm not sure why...)

{'rouge1': 0.923076923076923,
 'rouge2': 0.7272727272727272,
 'rougeL': 0.923076923076923,
 'rougeLsum': 0.923076923076923}

We interperet the above rouge scores like this:
- rouge 1 is the ...

In [31]:
def one_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:1])

def evaluate_baseline(dataset, metric):
    summaries = [one_sentence_summary(text) for text in dataset["Poem"]]
    return metric.compute(predictions=summaries, references=dataset["Title"])

score = evaluate_baseline(dataset["train"], metric)
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_dict = dict((rn, round(score[rn] * 100, 2)) for rn in rouge_names)
rouge_dict

{'rouge1': 7.88, 'rouge2': 3.79, 'rougeL': 7.54, 'rougeLsum': 7.56}

^ we interepret these as such:
- Firstly, the rouge2 score is much lower... (here's why: ??)

In [32]:
# This function offically computes the metrics of the predictions so we can calculate during the training

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    
    # Decode generated titles into text
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    
    # Decode reference titles into text
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # ROUGE expects a newline after each sentence
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]
    
    # Compute ROUGE scores
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    
    # Extract the median scores
    result = {key: value * 100 for key, value in result.items()}
    return {k: round(v, 4) for k, v in result.items()}

In [33]:
# This is the data collator to pad the inputs and outputs

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [34]:
# Testing the data collator

tokenized_datasets = tokenized_datasets.remove_columns(dataset["train"].column_names)
features = [tokenized_datasets["train"][i] for i in range(2)]
data_collator(features)

{'input_ids': tensor([[   0, 5625,   18,  ...,    1,    1,    1],
        [   0,  100,   91,  ..., 1840,    4,    2]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]]), 'labels': tensor([[    0, 44728,  1208,  7239,     2],
        [    0,   771,  3239, 12521,     2]]), 'decoder_input_ids': tensor([[    2,     0, 44728,  1208,  7239],
        [    2,     0,   771,  3239, 12521]])}

In [35]:
# observe the train_dataset

tokenized_datasets['train'].features['labels']

Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)

In [36]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Cloning https://huggingface.co/rkbulk/bart-base-finetuned-poems into local empty directory.


In [37]:
trainer.train()

***** Running training *****
  Num examples = 800
  Num Epochs = 8
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 800
  Number of trainable parameters = 139420416


  0%|          | 0/800 [00:00<?, ?it/s]

***** Running Evaluation *****
  Num examples = 200
  Batch size = 8


{'loss': 3.7974, 'learning_rate': 4.9e-05, 'epoch': 1.0}


  0%|          | 0/25 [00:00<?, ?it/s]

{'eval_loss': 3.1509082317352295, 'eval_rouge1': 17.5339, 'eval_rouge2': 8.1625, 'eval_rougeL': 17.2189, 'eval_rougeLsum': 17.2855, 'eval_runtime': 500.4174, 'eval_samples_per_second': 0.4, 'eval_steps_per_second': 0.05, 'epoch': 1.0}


***** Running Evaluation *****
  Num examples = 200
  Batch size = 8


{'loss': 2.721, 'learning_rate': 4.2e-05, 'epoch': 2.0}


  0%|          | 0/25 [00:00<?, ?it/s]

{'eval_loss': 3.1969778537750244, 'eval_rouge1': 16.9107, 'eval_rouge2': 8.1464, 'eval_rougeL': 16.5554, 'eval_rougeLsum': 16.7396, 'eval_runtime': 487.5616, 'eval_samples_per_second': 0.41, 'eval_steps_per_second': 0.051, 'epoch': 2.0}


KeyboardInterrupt: 

In [None]:
trainer.evaluate()

In [38]:
trainer.push_to_hub(commit_message="Training complete", tags="summarization")

Saving model checkpoint to bart-base-finetuned-poems
Configuration saved in bart-base-finetuned-poems/config.json
Model weights saved in bart-base-finetuned-poems/pytorch_model.bin
tokenizer config file saved in bart-base-finetuned-poems/tokenizer_config.json
Special tokens file saved in bart-base-finetuned-poems/special_tokens_map.json


Upload file pytorch_model.bin:   0%|          | 32.0k/532M [00:00<?, ?B/s]

Upload file training_args.bin: 100%|##########| 3.42k/3.42k [00:00<?, ?B/s]

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/rkbulk/bart-base-finetuned-poems
   a263817..4ddd8a8  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/rkbulk/bart-base-finetuned-poems
   a263817..4ddd8a8  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Sequence-to-sequence Language Modeling', 'type': 'text2text-generation'}}
To https://huggingface.co/rkbulk/bart-base-finetuned-poems
   4ddd8a8..4e0ffa7  main -> main

   4ddd8a8..4e0ffa7  main -> main



'https://huggingface.co/rkbulk/bart-base-finetuned-poems/commit/4ddd8a85248c8a581a5e60d8f4f1a3c85b91d916'

Now that we've fine-tuned our model, let's use it!

In [39]:
from transformers import pipeline

hub_model_id = "rkbulk/bart-base-finetuned-poems"
summarizer = pipeline("summarization", model=hub_model_id)

Downloading:   0%|          | 0.00/1.74k [00:00<?, ?B/s]

loading configuration file config.json from cache at /Users/raunit/.cache/huggingface/hub/models--rkbulk--bart-base-finetuned-poems/snapshots/4e0ffa7b15c7ccae4a4b101e76a8d047cff0253f/config.json
Model config BartConfig {
  "_name_or_path": "rkbulk/bart-base-finetuned-poems",
  "activation_dropout": 0.1,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.1,
  "classifier_dropout": 0.0,
  "d_model": 768,
  "decoder_attention_heads": 12,
  "decoder_ffn_dim": 3072,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 12,
  "encoder_ffn_dim": 3072,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2lab

Downloading:   0%|          | 0.00/558M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /Users/raunit/.cache/huggingface/hub/models--rkbulk--bart-base-finetuned-poems/snapshots/4e0ffa7b15c7ccae4a4b101e76a8d047cff0253f/pytorch_model.bin
All model checkpoint weights were used when initializing BartForConditionalGeneration.

All the weights of BartForConditionalGeneration were initialized from the model checkpoint at rkbulk/bart-base-finetuned-poems.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BartForConditionalGeneration for predictions without further training.


Downloading:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/999k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/957 [00:00<?, ?B/s]

loading file vocab.json from cache at /Users/raunit/.cache/huggingface/hub/models--rkbulk--bart-base-finetuned-poems/snapshots/4e0ffa7b15c7ccae4a4b101e76a8d047cff0253f/vocab.json
loading file merges.txt from cache at /Users/raunit/.cache/huggingface/hub/models--rkbulk--bart-base-finetuned-poems/snapshots/4e0ffa7b15c7ccae4a4b101e76a8d047cff0253f/merges.txt
loading file tokenizer.json from cache at None
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /Users/raunit/.cache/huggingface/hub/models--rkbulk--bart-base-finetuned-poems/snapshots/4e0ffa7b15c7ccae4a4b101e76a8d047cff0253f/special_tokens_map.json
loading file tokenizer_config.json from cache at /Users/raunit/.cache/huggingface/hub/models--rkbulk--bart-base-finetuned-poems/snapshots/4e0ffa7b15c7ccae4a4b101e76a8d047cff0253f/tokenizer_config.json


Testing the model here...

In [49]:
df.iloc[754]['Poem']

"it's your 1st year of college & you should be missing home by now but mostly you don't. you read the Chicago newspapers & call family on Sundays. you pick up going to church at a place adjacent to the projects. you're not from the projects & the ones in Chicago seem worse but there's comfort in being around plainspoken folk. the church folk feed you & also cook you food. you take African American studies classes & sleep through Spanish & write poems at night. you read the newspaper. you consider pledging a fraternity. you go to parties to watch people. you don't miss home. you call your ex girl a lot. you imagine her face across the phone line. you stare at the scar on her chin. it is shiny & smooth. you read the newspaper. you text new girls mostly. you invite them to play cards & bet clothes or take them to dinner on your birthday so you don't spend it alone or you share their extra-long twin beds or you just text them. it's your 1st year of college & your nephew is tiny & your niec

In [47]:
summarizer(df.iloc[754]['Poem'])

[{'summary_text': "it's your 1st year of college & you should be missing home by now"}]

In [48]:
df.iloc[754]['Title']

'recycling'