# Huggingface Basics
Basic usage of huggingface.

In [26]:
from transformers import (
    pipeline
    ,AutoTokenizer
    ,TFAutoModelForSequenceClassification
)

import datasets

## Sentiment analysis on strings

Download a pretrained model and tokenizer for sentiment analysis.

In [2]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
Downloading: 100%|██████████| 629/629 [00:00<00:00, 146kB/s]
Downloading: 100%|██████████| 256M/256M [00:09<00:00, 27.9MB/s] 
2022-05-01 08:48:45.338820: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-01 08:48:45.363206: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint a

Use classifier on a single example.

In [3]:
classifier("We are very happy to show you the 🤗 Transformers library.")

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

Use classifiers on a list of examples.

In [4]:
results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


Sentiment classification on a dataset.

TweetEval consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. The tasks include - irony, hate, offensive, stance, emoji, emotion, and sentiment. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits.

In [13]:
dataset = datasets.load_dataset("tweet_eval", name='emotion', split="train")

Downloading and preparing dataset tweet_eval/emotion (download: 472.47 KiB, generated: 511.52 KiB, post-processed: Unknown size, total: 984.00 KiB) to /Users/Lauren/.cache/huggingface/datasets/tweet_eval/emotion/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343...


Downloading data: 307kB [00:00, 10.3MB/s]                   ]
Downloading data: 6.51kB [00:00, 2.67MB/s]                   .01it/s]
Downloading data: 133kB [00:00, 10.6MB/s]                    .73it/s]
Downloading data: 2.84kB [00:00, 1.19MB/s]                  3.78it/s]
Downloading data: 34.6kB [00:00, 6.47MB/s]                   .05it/s]
Downloading data: 748B [00:00, 307kB/s]                     4.22it/s]
Downloading data files: 100%|██████████| 6/6 [00:01<00:00,  4.06it/s]
Extracting data files: 100%|██████████| 6/6 [00:00<00:00, 1047.70it/s]
                                                                           

Dataset tweet_eval downloaded and prepared to /Users/Lauren/.cache/huggingface/datasets/tweet_eval/emotion/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343. Subsequent calls will reuse this data.




In [16]:
files = dataset["text"]
classifier(files[:4])

[{'label': 'NEGATIVE', 'score': 0.9921145439147949},
 {'label': 'NEGATIVE', 'score': 0.9914141297340393},
 {'label': 'POSITIVE', 'score': 0.9987362027168274},
 {'label': 'NEGATIVE', 'score': 0.6745527982711792}]

# Summarization Example
Using BillSum dataset.

In [34]:
from datasets import load_dataset

from transformers import (
    pipeline
    ,AutoTokenizer
    ,DataCollatorForSeq2Seq
    ,TFAutoModelForSeq2SeqLM
    ,create_optimizer
    ,AdamWeightDecay
)

## Get Data

In [18]:
billsum = load_dataset("billsum", split="ca_test")

Downloading builder script: 3.62kB [00:00, 370kB/s]                    
Downloading metadata: 1.75kB [00:00, 668kB/s]                  
Using custom data configuration default


Downloading and preparing dataset billsum/default (download: 64.14 MiB, generated: 259.80 MiB, post-processed: Unknown size, total: 323.94 MiB) to /Users/Lauren/.cache/huggingface/datasets/billsum/default/3.0.0/d1e95173aed3acb71327864be74ead49b578522e4c7206048b2f2e5351b57959...


Downloading data: 100%|██████████| 67.3M/67.3M [00:55<00:00, 1.20MB/s]
                                                                                      

Dataset billsum downloaded and prepared to /Users/Lauren/.cache/huggingface/datasets/billsum/default/3.0.0/d1e95173aed3acb71327864be74ead49b578522e4c7206048b2f2e5351b57959. Subsequent calls will reuse this data.


In [19]:
# train test split
billsum = billsum.train_test_split(test_size=0.2)

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 1938 of the Civil Code is amended to read:\n1938.\n(a) A commercial property owner or lessor shall state on every lease form or rental agreement executed on or after January 1, 2016, whether or not the subject premises have undergone inspection by a Certified Access Specialist (CASp).\n(b) If the subject premises have undergone inspection by a CASp and, to the best of the commercial property owner’s or lessor’s knowledge, there have been no modifications or alterations completed or commenced between the date of the inspection and the date of the lease or rental agreement which have impacted the subject premises’ compliance with construction-related accessibility standards, the commercial property owner or lessor shall provide, prior to execution of the lease or rental agreement, a copy of any report prepared by the CASp with an agreement from the prospective lessee or tenant that information i

In [21]:
billsum["train"][0].keys()

dict_keys(['text', 'summary', 'title'])

## Preprocess

In [23]:
#load T5 tokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")

Downloading: 100%|██████████| 1.17k/1.17k [00:00<00:00, 451kB/s]
Downloading: 100%|██████████| 773k/773k [00:00<00:00, 922kB/s] 
Downloading: 100%|██████████| 1.32M/1.32M [00:00<00:00, 2.05MB/s]


In [24]:
prefix = "summarize: "

def preprocess_function(examples):
    """
    The preprocessing function needs to:
    1. Prefix the input with a prompt so T5 knows this is a summarization task. Some models capable of multiple NLP tasks require prompting for specific tasks.
    2. Use a context manager with the as_target_tokenizer() function to parallelize tokenization of inputs and labels.
    3. Truncate sequences to be no longer than the maximum length set by the max_length parameter.
    From: https://huggingface.co/docs/transformers/tasks/summarization#preprocess
    """
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once.

In [25]:
tokenized_billsum = billsum.map(preprocess_function, batched=True)

100%|██████████| 1/1 [00:05<00:00,  5.42s/ba]
100%|██████████| 1/1 [00:01<00:00,  1.18s/ba]


Instantiate model class (with a sequence-to-sequence language modeling head) from a pretrained model.

In [30]:
model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")

Downloading: 100%|██████████| 231M/231M [00:11<00:00, 20.9MB/s] 
All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Use DataCollatorForSeq2Seq to create a batch of examples. It will also dynamically pad your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the tokenizer function by setting `padding=True`, dynamic padding is more efficient.

In [31]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")

In [32]:
data_collator

DataCollatorForSeq2Seq(tokenizer=PreTrainedTokenizerFast(name_or_path='t5-small', vocab_size=32100, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': ['<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<ext

## Train

Convert dataset to the tf.data.Dataset format with to_tf_dataset. Specify inputs and labels in columns, whether to shuffle the dataset order, batch size, and the data collator.

In [33]:
tf_train_set = tokenized_billsum["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = tokenized_billsum["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

Set up an optimizer function, learning rate schedule, and some training hyperparameters.

In [35]:
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

Configure the model for training.

In [36]:
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Call `fit` to fine-tune the model.

In [37]:
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7ff7f8ac7640>

Save model.

In [43]:
model.save_pretrained(save_directory='./my-t5-small')

Load model.

In [44]:
pretmodel = TFAutoModelForSeq2SeqLM.from_pretrained("./my-t5-small")

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at ./my-t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


## Predict

In [45]:
summarizer = pipeline("summarization", model=pretmodel, tokenizer=tokenizer, framework="tf")

In [48]:
summarizer("I love the song I Just Wanna Be a Pickle by Natalie Burdick. She sings, I just wanna be a pickle. Get my booty in that brine.")

Your max_length is set to 200, but you input_length is only 45. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)


[{'summary_text': 'I love the song I Just Wanna Be a Pickle by Natalie Burdick . Get my booty in that brine.'}]