<a href="https://colab.research.google.com/github/hardiksahi/MachineLearning/blob/HS-hf_transformers_course/courses/huggingface_transformers_course/notebooks/1_Chapter1_Section3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Follow: https://huggingface.co/learn/llm-course/en/chapter1/3?fw=pt
## NB: https://github.com/huggingface/notebooks/blob/main/course/en/chapter1/section3.ipynb
## Name: Transformers, what can they do?

Notes:
1. The 🤗 Transformers library provides the functionality to create and use those shared models.
2. The Model Hub contains millions of pretrained models that anyone can download and use

Pipeline API:
1. Github: https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py
2. Encapsulates pre-processing, model and post-processing steps for a chosen NLP usecase like classification, summarization etc.
3. List of pretrained models for different usecases: https://huggingface.co/models

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [None]:
from transformers import pipeline

### Task1: Text classification
- Model list: https://huggingface.co/models?pipeline_tag=text-classification&sort=trending
- Default: "distilbert/distilbert-base-uncased-finetuned-sst-2-english" as mentioned in https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L200
- Default: BERT base model with distillation => DistilBERT base model (uncased) => fine tuned on SST2 binary classification dataset (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english)

In [None]:
classification_pipeline = pipeline(task="text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

In [None]:
classification_pipeline(["I am very happy today", "My vacation starts from next week", "Gruesome attack in Kashmir"])

### Task2: Zero shot classification
1. Model list: https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=trending
2. Default: facebook/bart-large-mnli as mentioned at https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L313C25-L313C49
3. Default:
- Enc(BERT based) + Dec (Autoregressive like GPT) based Seq2seq model (https://huggingface.co/facebook/bart-large)
- Finetune on MNLI dataset (https://huggingface.co/facebook/bart-large-mnli)
4. Default: MNLI dataset has premise and hypothesis with labels: contradiction, neutral, entailment
5. Zero shot classification as an NLI task: https://joeddav.github.io/blog/2020/05/29/ZSL.html


In [None]:
zero_shot_classifier = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli")

In [None]:
zero_shot_classifier(["what mahatama gandhi faced in south africa is a system of apartheid", "Israel is committing a genocide in Gaza"], candidate_labels=["education", "business", "politics", "society"])

### Task3: Text generation
1. Model list: https://huggingface.co/models?pipeline_tag=text-generation&sort=trending
2. Default: openai-community/gpt2 as mentioned at https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L304C39-L304C60
3. Default: GPT2 model trained by HF (https://huggingface.co/openai-community/gpt2)
3. Default: It is a decoder based autoregressive model


In [None]:
generation_pipeline = pipeline(task="text-generation", model="openai-community/gpt2")

In [None]:
generation_pipeline("Sometimes I wonder that the world is moving fast towards fascism", num_return_sequences=5)

In [None]:
generation_pipeline2 = pipeline(task="text-generation", model="HuggingFaceTB/SmolLM2-360M")

In [None]:
# generation_pipeline2(
#     "In this course, we will teach you how to",
#     max_length=30,
#     num_return_sequences=2,
# )

### Task4: Fill mask
1. Model list: https://huggingface.co/models?pipeline_tag=fill-mask&sort=trending
2. Default: distilbert/distilroberta-base as mentioned in https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L266C25-L266C54
3. Default: RoBERTa base model (pretrained using MLM objective) -> distillation to get distilbert/distilroberta-base


In [None]:
fill_mask_pipeline = pipeline(task="fill-mask", model="distilbert/distilroberta-base")

In [None]:
fill_mask_pipeline(["The world is a <mask> place to be"], top_k=2)

In [None]:
fill_mask_pipeline2 = pipeline(task="fill-mask", model="google-bert/bert-base-cased")

In [None]:
fill_mask_pipeline2(["The world is a [MASK] place to be"], top_k=2)

In [None]:
fill_mask_pipeline2(["I have a wierd feeling when I [MASK] in dark"], top_k=2)

### Task5: Named Entity Recognition + POS tagging
1. Model list: https://huggingface.co/models?pipeline_tag=token-classification&sort=trending
2. Default: dbmdz/bert-large-cased-finetuned-conll03-english as mentioned in https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L212
3. Default: No model card is available


In [None]:
ner_pipeline = pipeline("token-classification", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)

In [None]:
ner_pipeline("My name is Hardik Sahi. I am currently in India but I live in York, Canada.")

In [None]:
pos_pipeline = pipeline("token-classification", model="vblagoje/bert-english-uncased-finetuned-pos")

In [None]:
pos_pipeline("My name is Hardik Sahi. I am currently in India but I live in York, Canada.", grouped_entities=True)

### Task6: Question answering
1. Model list: https://huggingface.co/models?pipeline_tag=question-answering&sort=trending&search=pos
2. Default: distilbert/distilbert-base-cased-distilled-squad as mentioned in https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L224
3. Default: BERT based model -> Distil BERT base -> Fine tune on Squadv1.1 dataset (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad)


In [None]:
qa_pipeline = pipeline(task="question-answering", model="distilbert/distilbert-base-cased-distilled-squad")

In [None]:
qa_pipeline(question="Where do I reside", context="The name that hangs outside my door is Hardik Sahi, Indian citizen and resident of Canada")

### Task7: Summarization
1. Model list: https://huggingface.co/models?pipeline_tag=summarization&sort=trending
2. Default: sshleifer/distilbart-cnn-12-6 as mentioned in https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L277
3. Default: https://huggingface.co/sshleifer/distilbart-cnn-12-6


In [None]:
summarizer = pipeline(task="summarization", model="sshleifer/distilbart-cnn-12-6")

In [None]:
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

### Task8: Translation
1. Model list: https://huggingface.co/models?pipeline_tag=translation&sort=trending
2. Default: google-t5/t5-base from https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L287
3. Default: Seq2Seq T5 model (https://huggingface.co/google-t5/t5-base)


In [None]:
translation_pipeline = pipeline(task="translation_en_to_fr", model="google-t5/t5-base")

In [None]:
translation_pipeline("My name is hardik Sahi and i live in canada")