# Import:

https://huggingface.co/docs/transformers/notebooks

In [68]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

Load data:

In [69]:
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/data/title_text.csv")

In [70]:
df.head()

Unnamed: 0,Sl,ItemID,Title,Text
0,0,4288352,grover’s-like drug eruption in a patient with ...,A 73-year-old man developed Grover&#39;s like ...
1,1,4288295,aromatase inhibitor-induced carpal tunnel synd...,"In a retrospective study of 441&#160;patients,..."
2,2,4288431,paradoxical physiological responses to propran...,A 13-year-old girl developed paradoxical incre...
3,3,4288417,autoimmune inner ear disease in a melanoma pat...,An 82-year-old man developed sensorineural hea...
4,4,4288429,pathological autopsy of a patient that underwe...,A 65-year-old man died due to interstitial pne...


Dependencies:

In [71]:
!pip install datasets transformers rouge-score nltk



Sign up [here](https://huggingface.co/join) and input username and password:

In [72]:
##
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center>\n<img src=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then install Git-LFS:

In [73]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.


Make sure version of Transformers >= 4.11.0
(since the functionality was introduced in that version):

In [74]:
##
import transformers

print(transformers.__version__)

4.13.0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

# Fine-tuning a model on a summarization task

In [75]:
##
model_checkpoint = "facebook/bart-large"

We need any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model **has a sequence-to-sequence version** in the Transformers library.

## Loading metric

`load_dataset` and `load_metric`:

In [76]:
##
from datasets import load_dataset, load_metric

#raw_datasets = load_dataset("xsum")
metric = load_metric("rouge")

To access an actual element, you need to select a split first, then give an index:

## Preprocessing the data

Instantiate tokenizer with `AutoTokenizer.from_pretrained`:

In [77]:
##
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

loading configuration file https://huggingface.co/facebook/bart-large/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3f12fb71b844fcb7d591fdd4e55027da90d7b5dd6aa5430ad00ec6d76585f26c.58d5dda9f4e9f44e980adb867b66d9e0cbe3e0c05360cefe3cd86f5db4fff042
Model config BartConfig {
  "_name_or_path": "facebook/bart-large",
  "activation_dropout": 0.1,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "BartModel"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.1,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpo

To prepare the targets for our model, we need to tokenize them inside the `as_target_tokenizer` context manager. This will make sure the tokenizer uses the special tokens corresponding to the targets.

In [78]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [79]:
##
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6744 entries, 0 to 6824
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Sl      6744 non-null   int64 
 1   ItemID  6744 non-null   int64 
 2   Title   6744 non-null   object
 3   Text    6744 non-null   object
dtypes: int64(2), object(2)
memory usage: 263.4+ KB


In [80]:
##
df = df[:2000]

In [81]:
##
max_input_length = 400
max_target_length = 20

In [82]:
##
def preprocess(txt):
  soup = BeautifulSoup(txt, 'html.parser')
  txt = soup.get_text()
  txt = txt.encode('ascii', errors='ignore').strip().decode('ascii') # decode ASCII to remove \u***** characters
  return txt

In [83]:
##
df['Title'] = df['Title'].apply(lambda x: preprocess(x))
df['Text'] = df['Text'].apply(lambda x: preprocess(x))

In [84]:
##
def format_text(text, summary):
    #inputs = [doc for doc in text]
    model_inputs = tokenizer(text, max_length=max_input_length, truncation=True, padding=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(summary, max_length=max_target_length, truncation=True, padding=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [85]:
##
#model_inputs = df['Text'].apply(lambda x: format_text(x))
data = []
for idx, row in df.iterrows():
  data.append(format_text(row[3], row[2]))

In [86]:
##
# create dataframe to use as dataset
df2 = pd.DataFrame(data)
#df2 = pd.read_csv('xlnet_tokens.csv')
#df2.drop(columns=['Unnamed: 0'], inplace=True)
df2

Unnamed: 0,attention_mask,input_ids,labels
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 250, 6521, 12, 180, 12, 279, 313, 2226, 74...","[0, 15821, 3697, 12, 3341, 1262, 20945, 11, 10..."
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 1121, 10, 33777, 892, 9, 204, 4006, 11632,...","[0, 271, 1075, 415, 3175, 34496, 12, 29101, 51..."
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 250, 508, 12, 180, 12, 279, 1816, 2226, 30...","[0, 5489, 625, 4325, 3569, 38399, 8823, 7, 175..."
3,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 4688, 7383, 12, 180, 12, 279, 313, 2226, 9...","[0, 39545, 42866, 8725, 5567, 2199, 11, 10, 29..."
4,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 250, 3620, 12, 180, 12, 279, 313, 962, 528...","[0, 22609, 9779, 15103, 9, 10, 3186, 14, 12796..."
...,...,...,...
1995,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 1121, 10, 892, 6, 10, 2631, 12, 180, 12, 2...","[0, 3792, 28281, 36557, 32148, 9, 385, 1043, 2..."
1996,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 250, 5545, 12, 180, 12, 279, 313, 2226, 13...","[0, 42866, 21562, 34496, 111, 38838, 12855, 13..."
1997,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 1121, 10, 12429, 41398, 892, 6, 10, 6791, ...","[0, 1588, 718, 40133, 873, 8, 419, 2434, 9, 34..."
1998,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[0, 250, 1718, 12, 180, 12, 279, 693, 2226, 12...","[0, 338, 1594, 3914, 636, 179, 12, 29101, 1169..."


In [None]:
##
#df2.to_csv('t5_tokens.csv')

In [None]:
#df2.iloc[0].to_dict()

In [87]:
import torch
from torch.utils.data import Dataset

In [88]:
class FineTuneDataset(Dataset):
  def __init__(self, df, max_input_length, max_target_length):
    self.data = df
  
  def __len__(self):
    return len(self.data)

  def __getitem__(self, item):
    ids = self.data[item].input_ids
    mask = self.data[item].attention_mask
    labels = self.data[item].labels

    return {
        "input_ids": ids,
        "mask": mask,
        "labels": labels
    }


In [89]:
train_dataset = FineTuneDataset(data[:1500], 400, 20)
val_dataset = FineTuneDataset(data[1500:], 400, 20)

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the **sequence-to-sequence** kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [90]:
##
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
#model = XLNetModel.from_pretrained('xlnet-base-cased')

loading configuration file https://huggingface.co/facebook/bart-large/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3f12fb71b844fcb7d591fdd4e55027da90d7b5dd6aa5430ad00ec6d76585f26c.58d5dda9f4e9f44e980adb867b66d9e0cbe3e0c05360cefe3cd86f5db4fff042
Model config BartConfig {
  "_name_or_path": "facebook/bart-large",
  "activation_dropout": 0.1,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "BartModel"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.1,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_eos_token_id": 2,
  "gradient_checkpo

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [91]:
##
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-med",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=False,
    push_to_hub=True,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the weight decay. Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/t5-finetuned-xsum"` or `"huggingface/t5-finetuned-xsum"`).

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

In [92]:
##
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to decode the predictions into texts:

In [93]:
##
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [None]:
"""trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)"""

In [94]:
##
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

/content/bart-large-finetuned-med is already a clone of https://huggingface.co/Leonis/bart-large-finetuned-med. Make sure you pull the latest changes with `repo.git_pull()`.


We can now finetune our model by just calling the `train` method:

In [95]:
##
trainer.train()

***** Running training *****
  Num examples = 1500
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 94


ValueError: ignored

In [67]:
data[0]['input_ids']

375

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
##
trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("sgugger/my-awesome-model")
```

In [None]:
##
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("Leonis/bart-large-cnn-finetuned-med")