<a href="https://colab.research.google.com/github/hardiksahi/MachineLearning/blob/HS-hf_transformers_course/courses/huggingface_transformers_course/notebooks/1_Chapter1_Section3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Follow: https://huggingface.co/learn/llm-course/en/chapter1/3?fw=pt
## NB: https://github.com/huggingface/notebooks/blob/main/course/en/chapter1/section3.ipynb
## Name: Transformers, what can they do?

Notes:
1. The 🤗 Transformers library provides the functionality to create and use those shared models.
2. The Model Hub contains millions of pretrained models that anyone can download and use

Pipeline API:
1. Github: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py
2. Encapsulates pre-processing, model and post-processing steps for a chosen NLP usecase like classification, summarization etc.
3. List of pretrained models for different usecases: https://huggingface.co/models

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31

In [3]:
from transformers import pipeline

### Task1: Text classification
- Model list: https://huggingface.co/models?pipeline_tag=text-classification&sort=trending
- Default: "distilbert/distilbert-base-uncased-finetuned-sst-2-english" as mentioned in https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L200
- Default: BERT base model with distillation => DistilBERT base model (uncased) => fine tuned on SST2 binary classification dataset (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english)

In [4]:
classification_pipeline = pipeline(task="text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


In [7]:
classification_pipeline(["I am very happy today", "My vacation starts from next week", "Gruesome attack in Kashmir"])

[{'label': 'POSITIVE', 'score': 0.9998797178268433},
 {'label': 'POSITIVE', 'score': 0.9682051539421082},
 {'label': 'NEGATIVE', 'score': 0.9973526000976562}]

### Task2: Zero shot classification
1. Model list: https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=trending
2. Default: facebook/bart-large-mnli as mentioned at https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L313C25-L313C49
3. Default:
- Enc(BERT based) + Dec (Autoregressive like GPT) based Seq2seq model (https://huggingface.co/facebook/bart-large)
- Finetune on MNLI dataset (https://huggingface.co/facebook/bart-large-mnli)
4. Default: MNLI dataset has premise and hypothesis with labels: contradiction, neutral, entailment
5. Zero shot classification as an NLI task: https://joeddav.github.io/blog/2020/05/29/ZSL.html


In [9]:
zero_shot_classifier = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli")

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


In [12]:
zero_shot_classifier(["what mahatama gandhi faced in south africa is a system of apartheid", "Israel is committing a genocide in Gaza"], candidate_labels=["education", "business", "politics", "society"])

[{'sequence': 'what mahatama gandhi faced in south africa is a system of apartheid',
  'labels': ['society', 'politics', 'business', 'education'],
  'scores': [0.8773439526557922,
   0.05702158436179161,
   0.0399077832698822,
   0.025726648047566414]},
 {'sequence': 'Israel is committing a genocide in Gaza',
  'labels': ['society', 'politics', 'business', 'education'],
  'scores': [0.46537885069847107,
   0.23787003755569458,
   0.2188195139169693,
   0.07793162763118744]}]

### Task3: Text generation
1. Model list: https://huggingface.co/models?pipeline_tag=text-generation&sort=trending
2. Default: openai-community/gpt2 as mentioned at https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L304C39-L304C60
3. Default: GPT2 model trained by HF (https://huggingface.co/openai-community/gpt2)
3. Default: It is a decoder based autoregressive model


In [32]:
generation_pipeline = pipeline(task="text-generation", model="openai-community/gpt2")

Device set to use cuda:0


In [33]:
generation_pipeline("Sometimes I wonder that the world is moving fast towards fascism", num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Sometimes I wonder that the world is moving fast towards fascism as its own version of the Russian revolution. Of course the US has a record of its own taking extreme positions but the international community has refused to take them seriously. What was once just a fringe'},
 {'generated_text': 'Sometimes I wonder that the world is moving fast towards fascism," she said. "And if you are serious, would you like to see us back to the day with a national security policy focused on Russia to defeat the country, and not to win us'},
 {'generated_text': 'Sometimes I wonder that the world is moving fast towards fascism, so quickly that we have to try it out. The whole "war on terror" against those whose ideologies are more closely tied to the Third Reich.\n\nIf the people at NATO want'},
 {'generated_text': 'Sometimes I wonder that the world is moving fast towards fascism, the state is not so far from that but it can\'t do it much better".\n\nMeanwhile, the EU\'s economic

In [25]:
generation_pipeline2 = pipeline(task="text-generation", model="HuggingFaceTB/SmolLM2-360M")

Device set to use cuda:0


In [30]:
generation_pipeline2(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

ValueError: Greedy methods without beam search do not support `num_return_sequences` different than 1 (got 2).

### Task4: Fill mask
1. Model list: https://huggingface.co/models?pipeline_tag=fill-mask&sort=trending
2. Default: distilbert/distilroberta-base as mentioned in https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L266C25-L266C54
3. Default: RoBERTa base model (pretrained using MLM objective) -> distillation to get distilbert/distilroberta-base


In [31]:
fill_mask_pipeline = pipeline(task="fill-mask", model="distilbert/distilroberta-base")

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


In [35]:
fill_mask_pipeline(["The world is a <mask> place to be"], top_k=2)

[{'score': 0.1633252501487732,
  'token': 4613,
  'token_str': ' wonderful',
  'sequence': 'The world is a wonderful place to be'},
 {'score': 0.1583014577627182,
  'token': 6587,
  'token_str': ' terrible',
  'sequence': 'The world is a terrible place to be'}]

In [36]:
fill_mask_pipeline2 = pipeline(task="fill-mask", model="google-bert/bert-base-cased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at google-bert/bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0


In [37]:
fill_mask_pipeline2(["The world is a [MASK] place to be"], top_k=2)

[{'score': 0.17171472311019897,
  'token': 1363,
  'token_str': 'good',
  'sequence': 'The world is a good place to be'},
 {'score': 0.12640348076820374,
  'token': 1632,
  'token_str': 'great',
  'sequence': 'The world is a great place to be'}]

In [39]:
fill_mask_pipeline2(["I have a wierd feeling when I [MASK] in dark"], top_k=2)

[{'score': 0.4612177908420563,
  'token': 2647,
  'token_str': 'walk',
  'sequence': 'I have a wierd feeling when I walk in dark'},
 {'score': 0.09169616550207138,
  'token': 1301,
  'token_str': 'go',
  'sequence': 'I have a wierd feeling when I go in dark'}]

### Task5: Named Entity Recognition + POS tagging
1. Model list: https://huggingface.co/models?pipeline_tag=token-classification&sort=trending
2. Default: dbmdz/bert-large-cased-finetuned-conll03-english as mentioned in https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L212
3. Default: No model card is available


In [42]:
ner_pipeline = pipeline("token-classification", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [43]:
ner_pipeline("My name is Hardik Sahi. I am currently in India but I live in York, Canada.")

[{'entity_group': 'PER',
  'score': np.float32(0.99887145),
  'word': 'Hardik Sahi',
  'start': 11,
  'end': 22},
 {'entity_group': 'LOC',
  'score': np.float32(0.9996455),
  'word': 'India',
  'start': 42,
  'end': 47},
 {'entity_group': 'LOC',
  'score': np.float32(0.9938917),
  'word': 'York',
  'start': 62,
  'end': 66},
 {'entity_group': 'LOC',
  'score': np.float32(0.9995478),
  'word': 'Canada',
  'start': 68,
  'end': 74}]

In [45]:
pos_pipeline = pipeline("token-classification", model="vblagoje/bert-english-uncased-finetuned-pos")

config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


In [47]:
pos_pipeline("My name is Hardik Sahi. I am currently in India but I live in York, Canada.", grouped_entities=True)



[{'entity_group': 'PRON',
  'score': np.float32(0.9994789),
  'word': 'my',
  'start': 0,
  'end': 2},
 {'entity_group': 'NOUN',
  'score': np.float32(0.9953603),
  'word': 'name',
  'start': 3,
  'end': 7},
 {'entity_group': 'AUX',
  'score': np.float32(0.9956742),
  'word': 'is',
  'start': 8,
  'end': 10},
 {'entity_group': 'PROPN',
  'score': np.float32(0.9973943),
  'word': 'hardik sahi',
  'start': 11,
  'end': 22},
 {'entity_group': 'PUNCT',
  'score': np.float32(0.99966085),
  'word': '.',
  'start': 22,
  'end': 23},
 {'entity_group': 'PRON',
  'score': np.float32(0.9994947),
  'word': 'i',
  'start': 24,
  'end': 25},
 {'entity_group': 'AUX',
  'score': np.float32(0.9940321),
  'word': 'am',
  'start': 26,
  'end': 28},
 {'entity_group': 'ADV',
  'score': np.float32(0.9992065),
  'word': 'currently',
  'start': 29,
  'end': 38},
 {'entity_group': 'ADP',
  'score': np.float32(0.99933004),
  'word': 'in',
  'start': 39,
  'end': 41},
 {'entity_group': 'PROPN',
  'score': np.flo

### Task6: Question answering
1. Model list: https://huggingface.co/models?pipeline_tag=question-answering&sort=trending&search=pos
2. Default: distilbert/distilbert-base-cased-distilled-squad as mentioned in https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L224
3. Default: BERT based model -> Distil BERT base -> Fine tune on Squadv1.1 dataset (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad)


In [48]:
qa_pipeline = pipeline(task="question-answering", model="distilbert/distilbert-base-cased-distilled-squad")

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0


In [51]:
qa_pipeline(question="Where do I reside", context="The name that hangs outside my door is Hardik Sahi, Indian citizen and resident of Canada")

{'score': 0.972436785697937, 'start': 83, 'end': 89, 'answer': 'Canada'}

### Task7: Summarization
1. Model list: https://huggingface.co/models?pipeline_tag=summarization&sort=trending
2. Default: sshleifer/distilbart-cnn-12-6 as mentioned in https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L277
3. Default: https://huggingface.co/sshleifer/distilbart-cnn-12-6


In [52]:
summarizer = pipeline(task="summarization", model="sshleifer/distilbart-cnn-12-6")

config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cuda:0


In [56]:
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

### Task8: Translation
1. Model list: https://huggingface.co/models?pipeline_tag=translation&sort=trending
2. Default: google-t5/t5-base from https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L287
3. Default: Seq2Seq T5 model (https://huggingface.co/google-t5/t5-base)


In [59]:
translation_pipeline = pipeline(task="translation_en_to_fr", model="google-t5/t5-base")

Device set to use cuda:0


In [63]:
translation_pipeline("My name is hardik Sahi and i live in canada")

[{'translation_text': "Je m'appelle Hardik Sahi et je réside au Canada."}]