# Transformers examples

## Loading, filtering, splitting datasets

Loading dataset

In [1]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="train")

  from .autonotebook import tqdm as notebook_tqdm


Splitting into training and test set

In [2]:
billsum = billsum.train_test_split(test_size=0.2)
billsum

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 15159
    })
    test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 3790
    })
})

In [3]:
billsum['train'][0]

{'text': "SECTION 1. FINDINGS.\n\n    Congress finds as follows:\n            (1) Since 1935, the United States has owned a parcel of \n        land in Riverside, California, consisting of approximately 9.5 \n        acres, more specifically described in section 2(a) (in this \n        section referred to as the ``property'').\n            (2) The property is administered by the Department of \n        Agriculture and has been variously used for research and plant \n        materials purposes.\n            (3) Since 1998, the property has been administered by the \n        Natural Resources Conservation Service.\n            (4) Since 2002, the property has been co-managed under a \n        cooperative agreement between the Natural Resources \n        Conservation Service and the Riverside Corona Resource \n        Conservation District, which is a legal subdivision of the \n        State of California under section 9003 of the California Public \n        Resources Code.\n            (

Loading dataset from disk

In [4]:
medium_datasets = load_dataset("csv", data_files="medium-articles.zip")

Let's inspect it

In [5]:
medium_datasets

DatasetDict({
    train: Dataset({
        features: ['title', 'text', 'url', 'authors', 'timestamp', 'tags'],
        num_rows: 192368
    })
})

Splitting (and selecting a subset)

In [6]:
datasets_train_test = medium_datasets["train"].train_test_split(test_size=3000)
datasets_train_validation = datasets_train_test["train"].train_test_split(test_size=3000)

medium_datasets["train"] = datasets_train_validation["train"]
medium_datasets["validation"] = datasets_train_validation["test"]
medium_datasets["test"] = datasets_train_test["test"]


medium_datasets["train"] = medium_datasets["train"].shuffle().select(range(10000))
medium_datasets["validation"] = medium_datasets["validation"].shuffle().select(range(1000))
medium_datasets["test"] = medium_datasets["test"].shuffle().select(range(1000))

Filtering too long examples

In [7]:
medium_datasets_cleaned = medium_datasets.filter(
    lambda example: (len(example['text']) >= 500) and
    (len(example['title']) >= 20)
)

Filter: 100%|██████████| 10000/10000 [00:03<00:00, 2928.35 examples/s]
Filter: 100%|██████████| 1000/1000 [00:00<00:00, 3898.10 examples/s]
Filter: 100%|██████████| 1000/1000 [00:00<00:00, 3515.64 examples/s]


In [8]:
medium_datasets_cleaned

DatasetDict({
    train: Dataset({
        features: ['title', 'text', 'url', 'authors', 'timestamp', 'tags'],
        num_rows: 8539
    })
    validation: Dataset({
        features: ['title', 'text', 'url', 'authors', 'timestamp', 'tags'],
        num_rows: 861
    })
    test: Dataset({
        features: ['title', 'text', 'url', 'authors', 'timestamp', 'tags'],
        num_rows: 837
    })
})

In [9]:
medium_datasets_cleaned["train"][0]

{'title': 'Ordering Wine For Beginners',
 'text': '48% of American adults drink wine a few times a month. If you’re not sure of the difference between a Côtes du Rhône and a Château Méaume, though, choosing a wine on a date can seem impossible. From choosing a type, to a vintage, to knowing the correct procedure for tasting, the world of wine can be intimidating. But if you’re worried about picking the wine for your next date, don’t — there are a few simple guidelines that can turn wine from a headache into a relaxing, enjoyable experience.\n\nWhat Are You Eating?\n\nOn a date, it’s likely that you’re having dinner at a restaurant, rather than just drinking wine. Lucky for you, this makes choosing your wine a lot easier — instead of worrying about particular notes, one of the simplest ways to pick a wine is when you’re pairing it with a meal. The easiest way to remember wine and food pairings is by color: red wines pair well with red meat, where lighter meats like fish pair well with w

## Tokenization

Load tokenizer…

In [10]:
from transformers import AutoTokenizer

checkpoint = "t5-small"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Text preprocessing and preparation

In [11]:
import nltk
nltk.download('punkt')
import string



def clean_text(text):
    sentences = nltk.sent_tokenize(text.strip())
    sentences_cleaned = [s for sent in sentences for s in sent.split("\n")]
    sentences_cleaned_no_titles = [sent for sent in sentences_cleaned if len(sent) > 0 and sent[-1] in string.punctuation]
    text_cleaned = "\n".join(sentences_cleaned_no_titles)
    return text_cleaned

[nltk_data] Downloading package punkt to /Users/mdipenta/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Preparing training examples, by filtering long examples and adding a prompt

In [12]:
max_input_length = 512
max_target_length = 64
prefix = "summarize: "

def preprocess_data(examples):
    texts_cleaned = [clean_text(text) for text in examples["text"]]
    inputs = [prefix + text for text in texts_cleaned]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
    labels = tokenizer(examples["title"], max_length=max_target_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Performing the preprocessing phase

In [13]:
tokenized_datasets = medium_datasets_cleaned.map(preprocess_data,
                                                   batched=True)

Map: 100%|██████████| 8539/8539 [00:28<00:00, 299.10 examples/s]
Map: 100%|██████████| 861/861 [00:02<00:00, 296.07 examples/s]
Map: 100%|██████████| 837/837 [00:03<00:00, 251.15 examples/s]


## Setting up the model training

Loading the data collator

In [14]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

2024-05-18 11:25:50.428272: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Setting the metrics computation

In [15]:
import evaluate

rouge = evaluate.load("rouge")

import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    return {k: round(v, 4) for k, v in result.items()}

Loading the model

In [16]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

Setting the training arguments

In [17]:
training_args = Seq2SeqTrainingArguments(
    output_dir=".",
    evaluation_strategy="steps",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=10,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=False,
    load_best_model_at_end=True,
    push_to_hub=False)



Preparing the training

In [18]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics)

## Running the actual training (fine-tuning)

In [19]:
trainer.train()

  0%|          | 3/2136 [00:36<7:01:41, 11.86s/it]

KeyboardInterrupt: 

## Loading and using the model

In [45]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [48]:
checkpoint = "t5-small"
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
model.load_state_dict(torch.load("betterSummarizingModel",map_location=device))

<All keys matched successfully>

## Performing the inference

In [53]:
def inference(text):
    inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
    inputs = {k: inputs[k].to(device) for k in inputs}
    predicted_abstract_ids = model.generate(**inputs)
    return tokenizer.decode(predicted_abstract_ids[0], skip_special_tokens=True)

In [56]:
with open("abstract.txt", "r") as f:
    text = f.read()

In [57]:
inference(text)

'A Transformer is a deep learning architecture developed by Google and based on the multi-head'

In [42]:
medium_datasets_cleaned['test'][5]['text']



In [51]:
text=medium_datasets_cleaned['test'][5]['text']

In [55]:
inference(text)



'If something makes you feel uncomfortable it’s not right for you.'

In [58]:
prefix="summarize: "
text=prefix+text
inputs = tokenizer(text, return_tensors="pt").input_ids
inputs=inputs.to(device)
 
outputs = model.generate(inputs, max_new_tokens=30, num_beams=10, do_sample=False, temperature=0.6,repetition_penalty=3.0)
tokenizer.decode(outputs[0], skip_special_tokens=True)



'A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism'