# Hugging Face: Basic

## Transformers library

- Transformers library provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio for Pytorch, TensorFlow and JAX.

- Github: https://github.com/huggingface/transformers

In [None]:
!pip install -q datasets evaluate transformers[sentencepiece]

In [None]:
import transformers

print(f"transformers version: {transformers.__version__}")

transformers version: 4.30.2


## pipeline

- Pipelines offer a simple high-level API that perform the end-to-end process to use pre-trained models for inference.
- There are two categories of pipeline abstractions: main pipeline and task-specific pipelines.
- Pipelines are made of three main steps:
  1. preprocessed (tokenizer)
  2. model inference
  3. post-processed
- Parameters:
  - task: machine learning task ex. "sentiment-analysis"
  - model: pretrained model inheriting from PreTrainedModel (for PyTorch) or TFPreTrainedModel
  - config: model configuration
  - tokenizer: tokenizer inheriting from PreTrainedTokenizer.
  - framework: either "pt" for PyTorch or "tf" for TensorFlow.
  - device: -1 for CPU and 0 for GPU. cuda:<device_id>
- doc: https://huggingface.co/docs/transformers/main_classes/pipelines

In [None]:
from transformers import pipeline

sentiment_analysis = pipeline(task="sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [None]:
# Specific task, model and tokenizer

sentiment_analysis = pipeline(task="sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


### explore

In [None]:
# return pipeline

print(type(sentiment_analysis))
print(dir(sentiment_analysis))

<class 'transformers.pipelines.text_classification.TextClassificationPipeline'>
['__abstractmethods__', '__call__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_batch_size', '_ensure_tensor_on_device', '_forward', '_forward_params', '_num_workers', '_postprocess_params', '_preprocess_params', '_sanitize_parameters', 'binary_output', 'call_count', 'check_model_type', 'default_input_names', 'device', 'device_placement', 'ensure_tensor_on_device', 'feature_extractor', 'forward', 'framework', 'function_to_apply', 'get_inference_context', 'get_iterator', 'image_processor', 'iterate', 'model', 'modelcard', 'postprocess', 'predict', 'preprocess', 'return_all_scores', 'run

In [None]:
# show model detail

sentiment_analysis.model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [None]:
# show tokenizer detail

sentiment_analysis.tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

### Infer

In [None]:
# infer

text = "The new restaurant in town is amazing!"
result = sentiment_analysis(text)

print(type(result))
print(result)
print(result[0]['label'])

<class 'list'>
[{'label': 'POSITIVE', 'score': 0.9998883008956909}]
POSITIVE


In [None]:
# infer list

text = ["The new restaurant in town is amazing!", "The customer service was terrible. "]
sentiment_analysis(text)

[{'label': 'POSITIVE', 'score': 0.9998883008956909},
 {'label': 'NEGATIVE', 'score': 0.9997467398643494}]

### Load and save

In [None]:
# Save model to local path

model_path = "/content/model"
sentiment_analysis.model.save_pretrained(model_path)
print("Model saved to:", model_path)

Model saved to: /content/model


In [None]:
# Save tokenizer to local path

tokenizer_path = "/content/tokenizer"
sentiment_analysis.tokenizer.save_pretrained(tokenizer_path)
print("Tokenizer saved to:", tokenizer_path)

Tokenizer saved to: /content/tokenizer


In [None]:
# Load model from local path
# Need both model and toeknizer

model_path     = "/content/model"
tokenizer_path = "/content/tokenizer"

# Load the sentiment analysis pipeline with the local model
sentiment_analysis = pipeline("sentiment-analysis",
                              model=model_path,
                              tokenizer=tokenizer_path)

# Test
sentiment_analysis("The new restaurant in town is amazing!")

[{'label': 'POSITIVE', 'score': 0.9998883008956909}]

### Available pipelines

#### Zero-shot classification

- To provide model without labels of pretrained model.

In [None]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

In [None]:
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445989489555359, 0.11197412759065628, 0.04342695698142052]}

#### Text generation

- auto-complete by generating the remaining text.

In [None]:
from transformers import pipeline

generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

In [None]:
generator("In this course, we will teach you how to")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to manage virtual machines to ensure they are fully safe. To understand how you can use Virtual Machines and other virtualized services, we must first understand whether virtual machines are actually safer.\n\nVirtual machines as'}]

#### Mask filling

- to fill in the blanks in a given text.

In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

In [None]:
unmasker("This course will teach you all about <mask> models.", top_k=2)

[{'score': 0.19619806110858917,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052723944187164,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

#### Named entity recognition

- To find which parts of the input text correspond to entities such as persons, locations, or organizations.

In [None]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]



In [None]:
ner("Apple Inc. announced its plans to open a new store in downtown Seattle.")

[{'entity_group': 'ORG',
  'score': 0.9996147,
  'word': 'Apple Inc',
  'start': 0,
  'end': 9},
 {'entity_group': 'LOC',
  'score': 0.99673444,
  'word': 'Seattle',
  'start': 63,
  'end': 70}]

#### Question answering

- Answers questions using information from a given context.

In [None]:
from transformers import pipeline

question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

In [None]:
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

{'score': 0.6949767470359802, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

#### Summarization

- Create shorter text while keepping all important aspects.

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

In [None]:
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

#### Translation

- Translate from one language to another language.

In [None]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Bonjour, comment ça va?")

[{'translation_text': 'Hello, how are you?'}]

## Reference:

- Hugging Face NLP Course
  - https://huggingface.co/learn/nlp-course/chapter1/1

- datacamp: An Introduction to Using Transformers and Hugging Face
  - https://www.datacamp.com/tutorial/an-introduction-to-using-transformers-and-hugging-face