# Hugging Face - Pipeline

- Source of knowledge: [YouTube video by AssemblyAI](https://www.youtube.com/watch?v=QEaBAZQCtwE)
- [Hugging Face official documentation](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline.task)

The **pipelines** are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. 

A pipeline does 4 things for us:

1. Preprocessing: it tokenizes the string
2. Feeds the model with the data we create (can be a str, a list, a dataset)
3. Applies the model
4. Posprocessing: shows the relust how we would expect it


#### Examples:
1. Sentiment Analysis
1. Text-generation

See all parameters `pipeline()` accepts [here](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.pipeline.task).

### Sentiment Analysis

In [14]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

#with str
res = classifier("I have been waiting for a HuggingFace course my whole life!")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [17]:
print(res)
print("\nLabel:", res[0]['label'])

[{'label': 'POSITIVE', 'score': 0.9899705052375793}]

Label: POSITIVE


### Text classification
With the default model:

In [18]:
#with lists
reviews = ["This restaurant is awesome", "I think I won't be back to this restaurant"]

pipe = pipeline("text-classification")
pipe(reviews)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998743534088135},
 {'label': 'NEGATIVE', 'score': 0.9965842962265015}]

The cell above uses the default text classification model from Hugging Face. As you can see in the warining, it's `distilbert-base-uncased-finetuned-sst-2-english`. Now, we will specify a couple of text classification models. there are tons of text classification models in Hugging Face. See [here](https://huggingface.co/models?pipeline_tag=text-classification&sort=trending).

Examples with specific text-classifation models:

In [20]:
# If you want to use a specific model from the hub you can ignore the task if the model on the hub already defines it:

pipe = pipeline(model="microsoft/deberta-large-mnli")
pipe(reviews)

config.json:   0%|          | 0.00/729 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'label': 'NEUTRAL', 'score': 0.9744240641593933},
 {'label': 'NEUTRAL', 'score': 0.9822401404380798}]

Let's use now a second text classification model:

In [21]:
# If you want to use a specific model from the hub you can ignore the task if the model on the hub already defines it:

pipe = pipeline(model="ProsusAI/finbert")
pipe(reviews)

config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'label': 'neutral', 'score': 0.8923954963684082},
 {'label': 'neutral', 'score': 0.5531617999076843}]

### Text generation

With the default model, `gpt2`, as shown in the warning:

In [22]:
course_description = "In this course, you will learn the principles"

generator = pipeline("text-generation")

response = generator(course_description)

print(response)

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, you will learn the principles of the art of the drawing and draw as well as learn important concepts of drawing. Finally, you will develop valuable new drawing techniques.\n\nOur course covers the basics of drawing for children. For all'}]


Now with a chosen model (`distilgpt2`):

In [35]:
generator = pipeline("text-generation", model="distilgpt2")

response = generator(course_description,
                    max_length=30,
                    num_return_sequences=2) #number of options it will generate

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


### Zero Shot Classification

We will use the generated output from the previous model and check its domain with a different model. We will use `zero-shot-classification` default model, which is `facebook/bart-large-mnli`:

In [37]:
# course_description = response[0]['generated_text']

classifier = pipeline("zero-shot-classification")

response = classifier(course_description,
                      candidate_labels=["education", "programming", "law", "finance"])

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [38]:
print(response)

{'sequence': 'In this course, you will learn the principles of the theory of the universe. Our universe is not built upon the fundamental universe: it evolves with all', 'labels': ['education', 'law', 'programming', 'finance'], 'scores': [0.8442307114601135, 0.11178430914878845, 0.02998129092156887, 0.014003695920109749]}


### With Dataset (to be done)

To iterate over full datasets it is recommended to use a dataset directly. This means you don’t need to allocate the whole dataset at once, nor do you need to do batching yourself. This should work just as fast as custom loops on GPU. If it doesn’t don’t hesitate to create an issue.

Note: running the cell below will take a lot of time. Keeping that commented just for future reference

In [16]:
import datasets
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

# pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")
# dataset = datasets.load_dataset("superb", name="asr", split="test")

# # KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item
# # as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset
# for out in tqdm(pipe(KeyDataset(dataset, "file"))):
#     print(out)
#     # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
#     # {"text": ....}
#     # ....