<a href="https://colab.research.google.com/github/pedroconcejero/UTAD_python/blob/master/huggingface_summarisation_con_TF_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [41]:
!pip install transformers datasets evaluate rouge_score



In [42]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

In [43]:
billsum = billsum.train_test_split(test_size=0.2)

In [45]:
type(billsum)

datasets.dataset_dict.DatasetDict

In [44]:
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 18897 of the Revenue and Taxation Code is amended to read:\n18897.\nAll moneys transferred to the School Supplies for Homeless Children Fund, upon appropriation by the Legislature, shall be allocated as follows:\n(a) To the Franchise Tax Board, the State Department of Social Services, and the Controller for reimbursement of all costs incurred by the Franchise Tax Board, the Controller, and the State Department of Social Services in connection with their duties under this article.\n(b) To the State Department of Social Services as follows:\n(1) For the 2014–15 fiscal year, the Controller shall transfer the funds appropriated to the State Department of Education for this purpose from Budget Items 6110-001-8075 and 6110-101-8075 to the State Department of Social Services. Funds transferred may be used for state operations or local assistance expenditures and for distribution to a nonprofit organi

There are two fields that you’ll want to use:

    text: the text of the bill which’ll be the input to the model.
    summary: a condensed version of text which’ll be the model target.

## Preprocess

The next step is to load a T5 tokenizer to process text and summary:

The preprocessing function you want to create needs to:

  - Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
  - Use the keyword text_target argument when tokenizing labels.
  - Truncate sequences to be no longer than the maximum length set by the max_length parameter.

In [46]:
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-small"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [47]:
prefix = "summarize: "

# IMPORTANT CHANGED max_length inputs to 512 and of labels to 64.
# OTHERWISE WE HAVE NO RESOURCES

def preprocess_function(examples):

    inputs = [prefix + doc for doc in examples["text"]]

    model_inputs = tokenizer(inputs,
                             max_length=512,
                             truncation=True)

    labels = tokenizer(text_target=examples["summary"],
                       max_length=64,
                       truncation=True)

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map method. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once:

In [48]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

Now create a batch of examples using DataCollatorForSeq2Seq. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [49]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

# Evaluate

Including a metric during training is often helpful for evaluating your model’s performance. You can quickly load a evaluation method with the 🤗 Evaluate library. For this task, load the ROUGE metric (see the 🤗 Evaluate quick tour to learn more about how to load and compute a metric):

In [50]:
import evaluate

rouge = evaluate.load("rouge")

Then create a function that passes your predictions and labels to compute to calculate the ROUGE metric:

In [51]:
import numpy as np


def compute_metrics(eval_pred):

    predictions, labels = eval_pred

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]

    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}


## Train Tensorflow

To finetune a model in TensorFlow, start by setting up an optimizer function, learning rate schedule, and some training hyperparameters:


In [52]:
from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

In [53]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Convert your datasets to the tf.data.Dataset format with prepare_tf_dataset():

In [54]:
tf_train_set = model.prepare_tf_dataset(

    tokenized_billsum["train"],

    shuffle=True,

    batch_size=16,

    collate_fn=data_collator,

)

tf_test_set = model.prepare_tf_dataset(

    tokenized_billsum["test"],

    shuffle=False,

    batch_size=16,

    collate_fn=data_collator,

)

In [55]:
import tensorflow as tf

model.compile(optimizer=optimizer)  # No loss argument!

The last two things to setup before you start training is to compute the ROUGE score from the predictions, and provide a way to push your model to the Hub. Both are done by using Keras callbacks.

Pass your compute_metrics function to KerasMetricCallback:

## THIS CALLBACK **DOES NOT WORK IN COLAB**

In [None]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_set)

Aunque creemos este callback **no se usará en el .fit**

In [None]:
callbacks = [metric_callback]

In [56]:
model.fit(x=tf_train_set,
          validation_data=tf_test_set,
          epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x7c9216305990>

ES NECESARIO GUARDAR EL MODELO
** con save_pretrained**
PARA USARLO EN PIPELINE

Se puede guardar en local o en un hub de huggingface

In [58]:
model.save_pretrained("modelos/modelo1", from_pt=True)

Proponemos un texto para resumir:

In [None]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

Creamos nuestro pipeline A PARTIR DEL MODELO GUARDADO

In [60]:
from transformers import pipeline

miresumen = pipeline(task="summarization",
                model='modelos/modelo1',
                tokenizer=tokenizer)

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at modelos/modelo1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


# Inference - Create summaries

Tokenize the text and return the input_ids as TensorFlow tensors:

In [61]:
miresumen(text)

Your max_length is set to 200, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': "the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up workers and create good-paying, union jobs across the country."}]