# Working-With-Hugging-face_Pipelines
=============================

    *M-Zia-Rasa*


* **Hugging-face** is a leading company in the field of artificial intelligence, particularly known for its contributions to natural language processing (NLP). They have developed the Transformers library, which is one of the most popular libraries for NLP tasks. Here's an overview of what Hugging Face offers:

    * **Pre-trained-Models**: Provides thousands of pre-trained models for NLP, vision, audio tasks and supports a wide range of architectures such as BERT, GPT, T5, and more.

    * **Datasets**: A library that offers access to thousands of datasets across various domains. It provides tools to load, preprocess, and manipulate datasets efficiently.

    * **Inference API**: Allows users to deploy models for inference without managing the infrastructure. It provides an easy way to integrate models into applications using simple API calls.

    * **Hugging Face Spaces**: A service that allows developers to create and share AI applications powered by Gradio or Streamlit.


* **Pipeline** is used to automate workflows,ensure that data flows smoothly through various states of processing. In NLP pipeline refers to to Pre-Processing, instantiating the model and Post-processing of the texts.

    * **PreProcessing**: is used to process the raw-texts like tokenizing, removing numbers, etc. to create pure text.
    * **Instantiating-Model**: is used to perform a specific task like text-generation, question-answering, etc.
    * **Post-Processing** is used to to output a readable representation of the text.

* Now we learn how to use pipeline to accomplish different tasks as follows:

    * **Text-Classification**
    * **Zero-Shot-Classification**
    * **Token-Classification (Group-Entity)**
    * **Text-Completion(Mask-Filling)** 
    * **Text-Generation**
    * **Text-Summarization**
    * **Question-Answering**

* **What really happens inside a pipeline** ???
   

## Installing important packages

In [35]:
%pip install transformers
%pip install torch
%pip install huggingface-hub
%pip install tf-keras

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Text-Classification

In [36]:
from transformers import pipeline

classifier = pipeline ("sentiment-analysis")
classifier ("I have been waiting for hugging-face course in my entire life"),

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


([{'label': 'POSITIVE', 'score': 0.9991065859794617}],)

In [37]:
from transformers import pipeline

classifier = pipeline ("sentiment-analysis")
classifier ([
    "I have been waiting for huggingface course in my entire life",
    "I really love these glasses",
    "I don't want to use these sunglasses",
    "A few people go to the movies nowadays because it is boring somehow"
    ])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9932845830917358},
 {'label': 'POSITIVE', 'score': 0.9998409748077393},
 {'label': 'NEGATIVE', 'score': 0.9934645295143127},
 {'label': 'NEGATIVE', 'score': 0.9972587823867798}]

## Zero-Shot-Classification

In [38]:
from transformers import pipeline

classifier = pipeline ("zero-shot-classification")
classifier(
    "I have been waiting for huggingface course in my entire life",
    candidate_labels = ["business", "education", "politics"],
    )

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'I have been waiting for huggingface course in my entire life',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.4752252995967865, 0.27590247988700867, 0.2488723248243332]}

In [39]:
from transformers import pipeline

classifier = pipeline ("zero-shot-classification")
classifier([
    "Elon Mask's neat benefits is four hundred million dollars",
    "Learning English very important nowadays",
    "President Jue Biden was killed yesterday"
    ],

    candidate_labels = ["business", "education", "politics"],
    )

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'sequence': "Elon Mask's neat benefits is four hundred million dollars",
  'labels': ['business', 'politics', 'education'],
  'scores': [0.9523494839668274, 0.03159298002719879, 0.01605755090713501]},
 {'sequence': 'Learning English very important nowadays',
  'labels': ['education', 'business', 'politics'],
  'scores': [0.9403640627861023, 0.05317789688706398, 0.006458061747252941]},
 {'sequence': 'President Jue Biden was killed yesterday',
  'labels': ['politics', 'business', 'education'],
  'scores': [0.9040516018867493, 0.07048234343528748, 0.025466060265898705]}]

## Token-Classification (Group-Entity)

In [40]:
from transformers import pipeline

entity = pipeline("ner", grouped_entities=True)

entity ("John Smith is working in a Microsoft in Untied State")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.99941146,
  'word': 'John Smith',
  'start': 0,
  'end': 10},
 {'entity_group': 'ORG',
  'score': 0.9937976,
  'word': 'Microsoft',
  'start': 27,
  'end': 36}]

## Text-Completion(Mask-Filling)

In [41]:
from transformers import pipeline

mask = pipeline("fill-mask")

mask("There are many open-source <mask> models available for text-gner", top_k=3)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.06788498163223267,
  'token': 2788,
  'token_str': ' text',
  'sequence': 'There are many open-source text models available for text-gner'},
 {'score': 0.02523723430931568,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'There are many open-source mathematical models available for text-gner'},
 {'score': 0.019201966002583504,
  'token': 41806,
  'token_str': ' graphical',
  'sequence': 'There are many open-source graphical models available for text-gner'}]

## Text-Generation

In [42]:
from transformers import pipeline

generator = pipeline("text-generation")

generator("In this course, we will teach you how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': "In this course, we will teach you how to build the network, how to use your data to create web apps, and how to create your own websites using Python. We will use these resources to create content that's fast, easy, and beautiful"}]

In [43]:
from transformers import pipeline

generator = pipeline("text-generation" , model="distilgpt2")

generator("In this course, we will teach you how to",
          max_length=32,
          num_return_sequences=2,
          )

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'In this course, we will teach you how to successfully use the Google Play Store and Google Play Store on-time for more advanced, mobile app features, and'},
 {'generated_text': "In this course, we will teach you how to create one of the top videos you will find online. If you don't have the material, don't hesitate"}]

## Text-Summarization

In [44]:
from transformers import pipeline

tex = pipeline("summarization")

tex (
    
    """ 
    A Hugging Face pipeline is a predefined sequence of tasks that can be applied to text, audio, vision, or multi-modal inputs.
    These pipelines are part of the Hugging Face Transformers library, which provides a wide range of pre-trained models and tools for natural language processing (NLP).
    
    Key Features of Hugging Face Pipelines
        Predefined Tasks: Pipelines are designed for specific tasks such as text classification, sentiment analysis, image classification, and more. They simplify the process of applying complex models to real-world data1.
        Ease of Use: Pipelines abstract away much of the complexity involved in setting up and running models. Users can easily load a pipeline and start using it with minimal setup1.
        Modularity: Pipelines are modular, meaning you can easily swap out different models or components as needed.
        Integration: They integrate seamlessly with other Hugging Face tools and libraries, making it easy to build end-to-end workflows.
        """
        )

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' A Hugging Face pipeline is a predefined sequence of tasks that can be applied to text, audio, vision, or multi-modal inputs . Pipelines are designed for specific tasks such as text classification, sentiment analysis, image classification, and more . They simplify the process of applying complex models to real-world data .'}]

## Question-Answering

In [45]:
from transformers import pipeline

question_answering = pipeline("question-answering")

question_answering(
    question="Where do you work?",
    context="I work in Microsoft."
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.9543827176094055, 'start': 10, 'end': 19, 'answer': 'Microsoft'}

## ** What really happens inside a pipeline?? **

Now let's see what happens inside a pipeline and how a pipeline handles and automate the workflows of all the processing stages like:

    * **Pre-processing**
    * **Model-Instantiating**
    * **Post-Processing**

Here, we consider the below example of **sentiment-analysis** pipeline and expands it how it works through all the stages step by step. 

In [46]:
from transformers import pipeline

classifier = pipeline ("sentiment-analysis")
classifier ([
    "She is very interested in learning machine learning with hugging face.",
    "She hates learning English language"   
])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.998216450214386},
 {'label': 'NEGATIVE', 'score': 0.9995666146278381}]

## Step 1:

**Pre-Processing** is used to process the raw-texts to create pure text for a model input to be understanding for the model it uses a technique called tokenizing.

    **Tokenizing** is used to break down the texts into smaller units called tokens. These tokens can be words, sub-words, or characters, depending on the specific tokenization approach and then covert each token into numeric presentation to be easier for machine learning models to process and analyze the text.

In [47]:
from transformers import AutoTokenizer

tokenizer_model = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model)

# Example texts
text1 = "She is very interested in learning machine learning with hugging face." 
text2 = "She hates learning English language"


inputs1 = tokenizer (text1, padding=True, truncation=True, return_tensors='pt')
inputs2 = tokenizer (text2, padding=True, truncation=True, return_tensors='pt')

print(f"Tokens for text1: {inputs1}")
print(f"Tokens for text2: {inputs2}")


Tokens for text1: {'input_ids': tensor([[  101,  2016,  2003,  2200,  4699,  1999,  4083,  3698,  4083,  2007,
         17662,  2227,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Tokens for text2: {'input_ids': tensor([[  101,  2016, 16424,  4083,  2394,  2653,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


## Step 2:

**Model-Instantiating**:In this step: After the raw_text is processed, it is given to large language model or any fine-tuned model to perform a specific task like sentiment-analysis, text-generation, question-answering, etc.

### Model instantiating with Pytorch

In [48]:
from transformers import AutoModel

llm_model = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(llm_model)

outputs1 = model(**inputs1)
outputs2 = model(**inputs2)

print(outputs1.last_hidden_state.shape)
print(outputs2.last_hidden_state.shape)

print(f"Model outputs for text1: {outputs1}") 
print(f"Model outputs for text2: {outputs2}")

torch.Size([1, 14, 768])
torch.Size([1, 7, 768])
Model outputs for text1: BaseModelOutput(last_hidden_state=tensor([[[ 0.1052,  0.1038,  0.4744,  ...,  0.2579,  0.7159, -0.3149],
         [ 0.3117, -0.0781,  0.2464,  ...,  0.2481,  0.6948, -0.4315],
         [ 0.2061,  0.0306,  0.3996,  ...,  0.2917,  0.7380, -0.2845],
         ...,
         [ 0.3097,  0.1896,  0.7819,  ...,  0.1903,  0.7685, -0.0713],
         [ 0.2798,  0.2286,  0.6623,  ...,  0.5312,  0.6744, -0.5555],
         [ 0.5922,  0.4073,  0.4609,  ...,  0.6287,  0.5652, -0.7413]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)
Model outputs for text2: BaseModelOutput(last_hidden_state=tensor([[[-0.3099,  0.6556, -0.2782,  ...,  0.0112, -1.0256, -0.2191],
         [-0.3476,  0.4199, -0.4006,  ...,  0.0151, -0.7378, -0.2960],
         [-0.4644,  0.8465, -0.1417,  ..., -0.0858, -0.9849, -0.2328],
         ...,
         [-1.1499,  0.8378, -0.6173,  ..., -0.3606, -0.2767,  0.0499],
         [-0

In [49]:
from transformers import AutoModelForSequenceClassification

llm_model = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(llm_model)

outputs1 = model(**inputs1)
outputs2 = model(**inputs2)

print(outputs1.logits)
print(outputs2.logits)

tensor([[-3.0720,  3.2554]], grad_fn=<AddmmBackward0>)
tensor([[ 4.3080, -3.4354]], grad_fn=<AddmmBackward0>)


## Step 3:

**Post-Processing**: is used to to output a readable representation of the text.

In [50]:
import torch

result1 = torch.nn.functional.softmax(outputs1.logits, dim=-1)
print(result1)

result2 = torch.nn.functional.softmax(outputs2.logits, dim=-1)
print(result2)

tensor([[0.0018, 0.9982]], grad_fn=<SoftmaxBackward0>)
tensor([[9.9957e-01, 4.3341e-04]], grad_fn=<SoftmaxBackward0>)


In [51]:
# Get the labels and scores 
labels = ["negative", "positive"] 
label1 = labels[torch.argmax(result1).item()] 
label2 = labels[torch.argmax(result2).item()] 
score1 = result1[0][torch.argmax(result1)].item() 
score2 = result2[0][torch.argmax(result2)].item() 

# Print the results 
print(f"Text 1: '{text1}'") 
print(f"Sentiment: {label1}, Score: {score1:.4f}\n") 
print(f"Text 2: '{text2}'") 
print(f"Sentiment: {label2}, Score: {score2:.4f}")

Text 1: 'She is very interested in learning machine learning with hugging face.'
Sentiment: positive, Score: 0.9982

Text 2: 'She hates learning English language'
Sentiment: negative, Score: 0.9996


## Sentiment-Analysis with more brief code with the same result

In [52]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Tokenizing
tokenizer_model = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model)
raw_text = [
    "She is very interested in learning machine learning with hugging face.",
    "She hates learning English language"
]
inputs = tokenizer(raw_text, padding=True, truncation=True, return_tensors='pt')

# Step 2: Model Use
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
outputs = model(**inputs)

# Step 3: Post-processing
sentiment_analyzer = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
results = sentiment_analyzer(raw_text)

# Print the results
for text, result in zip(raw_text, results):
    print(f"Text: {text}")
    print(f"Sentiment: {result['label']}, Score: {result['score']:.4f}\n")


Text: She is very interested in learning machine learning with hugging face.
Sentiment: POSITIVE, Score: 0.9982

Text: She hates learning English language
Sentiment: NEGATIVE, Score: 0.9996



## Sentiment-Analysis with Huggingface Pipeline with same results

In [53]:
from transformers import pipeline

classifier = pipeline ("sentiment-analysis")
classifier ([
    "She is very interested in learning machine learning with hugging face.",
    "She hates learning English language"   
])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.998216450214386},
 {'label': 'NEGATIVE', 'score': 0.9995666146278381}]

## 😍😍 Thanks to the hugging face of pipeline that how made the the coding easier ❤️❤️

### Thank You!

Please contribute on my github repository https://github.com/m-zia-rasa/NLP-Hugging_face-Pipelines.git