# Login to HuggingFace Hub

Create a community account and create toekns:
https://huggingface.co/settings/tokens

With the account and token, you can access pretrain models, open source datasets and push your finetuned model to the hub.

In [1]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Hugging Face Example

In [2]:
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)

Found cached dataset billsum (/Users/matthewong/.cache/huggingface/datasets/billsum/default/3.0.0/75cf1719d38d6553aa0e0714c393c74579b083ae6e164b2543684e3e92e0c4cc)


# Data Structure

In [3]:
billsum['train'][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nArticle 3.3 (commencing with Section 20119) is added to Chapter 1 of Part 3 of Division 2 of the Public Contract Code, to read:\nArticle  3.3. Los Angeles Unified School District — Best Value Procurement\n20119.\n(a) It is the intent of the Legislature to enable school districts to use cost-effective options for building and modernizing school facilities. The Legislature has recognized the merits of the best value procurement method process in the past by authorizing its use for projects undertaken by the University of California.\n(b) The Legislature also finds and declares that school districts using the best value procurement method require a clear understanding of the roles and responsibilities of each participant in the best value process. As reflected in the University of California report to the Legislature, the benefits of a best value procurement method include a reduction in contract delays,

In [4]:
billsum

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 989
    })
    test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 248
    })
})

# Process Data

In [5]:
from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [6]:
prefix = "summarize: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [7]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [8]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

In [9]:
import evaluate

rouge = evaluate.load("rouge")

In [10]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

# Finetune Model with CPU

In [11]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

cpu_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [12]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_billsum_model_cpu",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=False,
    no_cuda=True,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    model=cpu_model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.816046,0.1211,0.0321,0.1001,0.1002,19.0
2,No log,2.605204,0.1314,0.0423,0.1076,0.1079,19.0
3,No log,2.544509,0.1306,0.0416,0.1082,0.1083,19.0
4,No log,2.528401,0.1359,0.0451,0.1126,0.1126,19.0


TrainOutput(global_step=248, training_loss=3.0199228102161038, metrics={'train_runtime': 2957.6008, 'train_samples_per_second': 1.338, 'train_steps_per_second': 0.084, 'total_flos': 1070824333246464.0, 'train_loss': 3.0199228102161038, 'epoch': 4.0})

# Finetune Model with GPU

In [13]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

gpu_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [14]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_billsum_model_gpu",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=False,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    model=gpu_model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.794135,0.1216,0.033,0.1004,0.1005,19.0
2,No log,2.587568,0.1295,0.0406,0.1062,0.1062,19.0
3,No log,2.529297,0.1353,0.0463,0.1123,0.1122,19.0
4,No log,2.512336,0.1352,0.046,0.1125,0.1126,19.0


  if unfinished_sequences.max() == 0:


TrainOutput(global_step=248, training_loss=2.9696047382970012, metrics={'train_runtime': 1799.0295, 'train_samples_per_second': 2.199, 'train_steps_per_second': 0.138, 'total_flos': 1070824333246464.0, 'train_loss': 2.9696047382970012, 'epoch': 4.0})

In [22]:
# Push to huggingface hub
#trainer.push_to_hub()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Upload file pytorch_model.bin:   0%|          | 1.00/231M [00:00<?, ?B/s]

Upload file training_args.bin:   0%|          | 1.00/4.00k [00:00<?, ?B/s]

To https://huggingface.co/mattbeen/my_awesome_billsum_model
   8dd7653..21c9633  main -> main

   8dd7653..21c9633  main -> main



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

To https://huggingface.co/mattbeen/my_awesome_billsum_model
   21c9633..e1b46d0  main -> main

   21c9633..e1b46d0  main -> main



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


'https://huggingface.co/mattbeen/my_awesome_billsum_model/commit/21c96338ae094a6159b6394c36c27773b9e4618b'

# Testing

In [16]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In [19]:
from transformers import pipeline

my_hub_model = "mattbeen/my_awesome_billsum_model"
summarizer = pipeline("summarization", model=my_hub_model)
summarizer(text)

Your max_length is set to 200, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': "the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country."}]

--------------------------------Break-----------------------------------------