# Hugging Face Transformers Cheatsheet with Expanded Functions

## Import Libraries

In [1]:

from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer, TFAutoModelForSequenceClassification
from transformers import AutoModelForQuestionAnswering, AutoModelForTokenClassification, AutoModelForSeq2SeqLM
from transformers import Trainer, TrainingArguments
import torch


## Text Classification

In [2]:

# Load pre-trained model and tokenizer
classifier = pipeline('sentiment-analysis')

# Use the classifier
result = classifier("Transformers are amazing!")
print("Classification Result:", result)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Classification Result: [{'label': 'POSITIVE', 'score': 0.9998725652694702}]


## Question Answering

In [3]:

# Load pre-trained model and tokenizer for question answering
qa_pipeline = pipeline('question-answering')

# Use the pipeline
result = qa_pipeline(question="What is the capital of France?", context="France's capital is Paris.")
print("Answer:", result['answer'])


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Answer: Paris


## Named Entity Recognition (NER)

In [4]:

# Load pre-trained model and tokenizer for NER
ner_pipeline = pipeline('ner')

# Use the pipeline
result = ner_pipeline("My name is John and I live in New York.")
print("Named Entities:", result)


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Named Entities: [{'entity': 'I-PER', 'score': 0.9984444, 'index': 4, 'word': 'John', 'start': 11, 'end': 15}, {'entity': 'I-LOC', 'score': 0.9991617, 'index': 9, 'word': 'New', 'start': 30, 'end': 33}, {'entity': 'I-LOC', 'score': 0.9989077, 'index': 10, 'word': 'York', 'start': 34, 'end': 38}]


## Text Generation

In [5]:

# Load pre-trained model and tokenizer for text generation
generator = pipeline('text-generation', model='gpt2')

# Use the generator
result = generator("Once upon a time", max_length=50, num_return_sequences=1)
print("Generated Text:", result[0]['generated_text'])


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text: Once upon a time in The Dark Brotherhood's history, the Dark Brotherhood was called the "Council of the Dark Path," and the Order was known as Council of the Dark Ones. The Dark Brotherhood was led by a powerful man known as the Dark Master


## Text Summarization

In [6]:

# Load pre-trained model and tokenizer for summarization
summarizer = pipeline('summarization')

# Use the summarizer
text = "Transformers are very powerful for natural language processing tasks. They have revolutionized the field with their ability to handle long-range dependencies in text."
result = summarizer(text)
print("Summary:", result[0]['summary_text'])


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Your max_length is set to 142, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)


Summary:  Transformers are very powerful for natural language processing tasks . They have revolutionized the field with their ability to handle long-range dependencies in text . Transformers are more powerful than computers with their long-term dependencies . Transformers can be used to solve complex problems in natural language systems .


## Translation

In [7]:

# Load pre-trained model and tokenizer for translation
translator = pipeline('translation_en_to_fr')

# Use the translator
result = translator("Transformers are very powerful for NLP tasks.")
print("Translation:", result[0]['translation_text'])


No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Translation: Les transformateurs sont très puissants pour les tâches NLP.


## Custom Model and Tokenizer

In [8]:

# Load custom model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize input text
inputs = tokenizer("Transformers are amazing!", return_tensors="pt")
outputs = model(**inputs)

# Get predictions
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print("Predictions:", predictions)


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Predictions: tensor([[1.2737e-04, 9.9987e-01]], grad_fn=<SoftmaxBackward0>)


## Fine-Tuning a Model

In [9]:

# Define model and tokenizer
model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=None,  # Replace with your dataset
    eval_dataset=None    # Replace with your dataset
)

# Train the model
# trainer.train()  # Uncomment this line to start training


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

## Using TensorFlow with Transformers

In [12]:
import tensorflow as tf
# Load TensorFlow model and tokenizer
tf_model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tf_model = TFAutoModelForSequenceClassification.from_pretrained(tf_model_name)
tokenizer = AutoTokenizer.from_pretrained(tf_model_name)

# Tokenize input text
inputs = tokenizer("Transformers are amazing!", return_tensors="tf")
outputs = tf_model(inputs)

# Get predictions
predictions = tf.nn.softmax(outputs.logits, axis=-1)
print("Predictions:", predictions)


All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Predictions: tf.Tensor([[1.2737003e-04 9.9987257e-01]], shape=(1, 2), dtype=float32)


## Model for Question Answering

In [13]:

# Load model and tokenizer for question answering
qa_model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
qa_model = AutoModelForQuestionAnswering.from_pretrained(qa_model_name)
tokenizer = AutoTokenizer.from_pretrained(qa_model_name)

# Tokenize input text
inputs = tokenizer("What is the capital of France?", "France's capital is Paris.", return_tensors="pt")
outputs = qa_model(**inputs)

# Get the answer
answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits
answer_start = torch.argmax(answer_start_scores)
answer_end = torch.argmax(answer_end_scores) + 1
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))
print("Answer:", answer)


config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Answer: paris


## Model for Token Classification

In [14]:

# Load model and tokenizer for token classification
token_classifier = pipeline('token-classification', model='dbmdz/bert-large-cased-finetuned-conll03-english')

# Use the pipeline
result = token_classifier("Hugging Face Inc. is a company based in New York City.")
print("Token Classification Result:", result)


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Token Classification Result: [{'entity': 'I-ORG', 'score': 0.9992662, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.9808882, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}, {'entity': 'I-ORG', 'score': 0.99536246, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}, {'entity': 'I-ORG', 'score': 0.9993383, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}, {'entity': 'I-LOC', 'score': 0.9990269, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}, {'entity': 'I-LOC', 'score': 0.9988483, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}, {'entity': 'I-LOC', 'score': 0.9991774, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}]


## Model for Sequence-to-Sequence Tasks

In [15]:

# Load model and tokenizer for sequence-to-sequence tasks
seq2seq_model_name = "t5-small"
seq2seq_model = AutoModelForSeq2SeqLM.from_pretrained(seq2seq_model_name)
seq2seq_tokenizer = AutoTokenizer.from_pretrained(seq2seq_model_name)

# Tokenize input text
inputs = seq2seq_tokenizer("translate English to German: The house is wonderful.", return_tensors="pt")
outputs = seq2seq_model.generate(**inputs)

# Decode the generated sequence
translated_text = seq2seq_tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Translated Text:", translated_text)


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]



Translated Text: Das Haus ist wunderbar.


## Using Pipelines for Multiple Tasks

In [16]:

# Load multiple pipelines
sentiment_pipeline = pipeline('sentiment-analysis')
qa_pipeline = pipeline('question-answering')
ner_pipeline = pipeline('ner')

# Use the pipelines
sentiment_result = sentiment_pipeline("I love using transformers!")
qa_result = qa_pipeline(question="What is the capital of France?", context="France's capital is Paris.")
ner_result = ner_pipeline("Hugging Face Inc. is a company based in New York City.")

print("Sentiment Analysis Result:", sentiment_result)
print("Question Answering Result:", qa_result['answer'])
print("Named Entity Recognition Result:", ner_result)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: 

Sentiment Analysis Result: [{'label': 'POSITIVE', 'score': 0.9994327425956726}]
Question Answering Result: Paris
Named Entity Recognition Result: [{'entity': 'I-ORG', 'score': 0.9992662, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.9808882, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}, {'entity': 'I-ORG', 'score': 0.99536246, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}, {'entity': 'I-ORG', 'score': 0.9993383, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}, {'entity': 'I-LOC', 'score': 0.9990269, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}, {'entity': 'I-LOC', 'score': 0.9988483, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}, {'entity': 'I-LOC', 'score': 0.9991774, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}]
