# **Loading Models and Inference with Hugging Face Inferences**


This lab explores how to leverage the Hugging Face `transformers` library for various natural language processing (NLP) tasks. It begins by demonstrating text classification and text generation using pretrained models like DistilBERT and GPT-2 without using the `pipeline()` function, covering the steps involved in loading models, tokenizing input, performing inference, and processing outputs. The lab then showcases the simplicity and efficiency of using the `pipeline()` function to accomplish the same tasks with minimal code. By comparing both approaches, the lab illustrates how the `pipeline()` function streamlines the process, making it easier and faster to implement NLP solutions.


## Setup


### Installing required libraries


In [1]:
!pip install torch
!pip install transformers



### Checking if Pytorch installed in local or Cloud

In [3]:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torchaudio
  Downloading https://download.pytorch.org/whl/cu118/torchaudio-2.7.1%2Bcu118-cp313-cp313-win_amd64.whl.metadata (6.8 kB)
INFO: pip is looking at multiple versions of torchaudio to determine which version is compatible with other requirements. This could take a while.
  Downloading https://download.pytorch.org/whl/cu118/torchaudio-2.7.0%2Bcu118-cp313-cp313-win_amd64.whl.metadata (6.8 kB)
  Downloading https://download.pytorch.org/whl/cu118/torchaudio-2.6.0%2Bcu118-cp313-cp313-win_amd64.whl.metadata (6.8 kB)
Collecting torchvision
  Downloading https://download.pytorch.org/whl/cu118/torchvision-0.22.1%2Bcu118-cp313-cp313-win_amd64.whl.metadata (6.3 kB)
Collecting torch
  Downloading https://download.pytorch.org/whl/cu118/torch-2.7.1%2Bcu118-cp313-cp313-win_amd64.whl.metadata (27 kB)
Downloading https://download.pytorch.org/whl/cu118/torchvision-0.22.1%2Bcu118-cp313-cp313-win_amd64.whl (5.5 MB)
   ----------

In [6]:
pip show transformers

Name: transformers
Version: 4.57.0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: d:\OneDrive - Imperial College London\Langchain_LLM\Huggingface_Fine_Tuning\.venv\Lib\site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 
Note: you may need to restart the kernel to use updated packages.


### Importing required libraries

_It is recommended that you import all required libraries in one place (here):_


In [7]:
from transformers import pipeline
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


# Text classification with DistilBERT


## Load the model and tokenizer

First, let's initialize a tokenizer and a model for sentiment analysis using DistilBERT fine-tuned on the SST-2 dataset. This setup is useful for tasks where you need to quickly classify the sentiment of a piece of text with a pretrained, efficient transformer model.




In [11]:
# Load the tokenizer and model
# "distilbert-base-uncased-finetuned-sst-2-english" breakdown:
# - distilbert: A smaller, faster version of BERT (50% fewer parameters, 97% of BERT's performance)
# - base: The base model size (as opposed to large)
# - uncased: Text is converted to lowercase (no distinction between "Hello" and "hello")
# - finetuned: The model has been further trained on a specific task
# - sst-2: Stanford Sentiment Treebank v2 dataset (binary sentiment classification)
# - english: The model is trained on English text

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")


# Load the pre-trained DistilBERT model for sequence classification
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

## Preprocess the input text
Tokenize the input text and convert it to a format suitable for the model:


In [14]:
# Sample text
text = "Congratulations! You've won a free ticket to the Bahamas. Reply WIN to claim."

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt")

print(inputs)

{'input_ids': tensor([[  101, 23156,   999,  2017,  1005,  2310,  2180,  1037,  2489,  7281,
          2000,  1996, 17094,  1012,  7514,  2663,  2000,  4366,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


The token ids are the token indexes  ```attention_mask``` is essential for correctly processing padded sequences, ensuring efficient computation, and maintaining model performance. Even when no tokens are explicitly masked, it helps the model differentiate between actual content and padding, which is critical for accurate and efficient processing of input data.


###  Perform inference
The `torch.no_grad()` context manager is used to disable gradient calculation.
This reduces memory consumption and speeds up computation, as gradients are not needed for inference (i.e. when you are not training the model). The **inputs syntax is used to unpack a dictionary of keyword arguments in Python. In the context of the model(**inputs):


In [18]:
# Perform inference
with torch.no_grad():
    outputs = model(**inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-3.9954,  4.3336]]), hidden_states=None, attentions=None)

Another method is `input_ids`, and `attention_mask` is their own parameter.


In [None]:
#model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])

#### Get the logits
The logits are the raw, unnormalized predictions of the model. Let's extract the logits from the model's outputs to perform further processing, such as determining the predicted class or calculating probabilities.


In [20]:
logits = outputs.logits

print(logits,logits.shape)

tensor([[-3.9954,  4.3336]]) torch.Size([1, 2])


## Post-process the output
Convert the logits to probabilities and get the predicted class:


In [23]:
# Convert logits to probabilities using softmax
# The softmax function transforms raw logits (which can be any real numbers) into a probability distribution 
# where all values are between 0 and 1 and sum to 1. It does this by exponentiating each logit and then 
# normalizing by the sum of all exponentiated values, ensuring the output represents valid probabilities.
# The argmax function then finds the index of the highest probability, which corresponds to the model's 
# most confident prediction and represents the predicted class.

# dim=-1 applies softmax along the last dimension (columns), normalizing probabilities across all classes for each sample
probs = torch.softmax(logits,dim = -1)

predicted_class = torch.argmax(probs, dim = -1)
predicted_class


tensor([1])

In [25]:
labels = ['Negative', 'Positive']
predicted_answer = labels[predicted_class]
predicted_answer

'Positive'

# Text generation with GPT-2 


## Load tokenizer
 Load the pretrained GPT-2 tokenizer. The tokenizer is responsible for converting text into tokens that the model can understand.


In [28]:
# Load the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Load the pretrained GPT-2 model with a language modeling head. The model generates text based on the input tokens.


In [29]:
# Load the tokenizer and model

model = GPT2LMHeadModel.from_pretrained("gpt2")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
'(ProtocolError('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None)), '(Request ID: 3bb0f2ab-eecf-4d4c-b2a4-26432f42ee8b)')' thrown while requesting GET https://huggingface.co/gpt2/resolve/main/model.safetensors
Retrying in 1s [Retry 1/5].


## Preprocess the input text  
Tokenize the input text and convert it to a format suitable for the model, like before you have the token indexes, i.e., inputs. 


In [30]:
# Prompt
prompt = "Once upon a time"

# Tokenize the input text
# 'pt' stands for PyTorch tensors - this returns the tokenized input as PyTorch tensors instead of lists
inputs = tokenizer(prompt,return_tensors='pt')

inputs


{'input_ids': tensor([[7454, 2402,  257,  640]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

## Perform inference  

 
 **What is Performance Inference?**
 
Performance inference refers to optimizing the model's inference process to achieve faster generation speeds, lower memory usage, and better resource utilization during text generation. This is crucial when deploying models in production environments where response time and computational efficiency matter.
 
 **Why do we need Performance Inference?**
 1. **Speed**: Faster text generation improves user experience and allows handling more requests per second
 2. **Memory Efficiency**: Reduces GPU/CPU memory usage, allowing larger batch sizes or running on smaller hardware

Common performance optimization techniques include using `torch.no_grad()` to disable gradient computation, model quantization, caching mechanisms, and specialized inference engines like ONNX or TensorRT.


In [32]:
# Generate text using model.generate() method
# This method uses specialized text generation strategies that differ from torch.no_grad():
# 1. Autoregressive generation: Generates tokens one by one, feeding previous outputs as inputs
# 2. Built-in sampling strategies: Supports greedy search, beam search, top-k, top-p sampling
# 3. Generation-specific optimizations: Uses KV-cache to avoid recomputing attention for previous tokens
# 4. Automatic stopping criteria: Handles EOS tokens and max length automatically
# 5. Memory-efficient decoding: Only stores necessary intermediate states during generation
# Unlike torch.no_grad() which just disables gradient computation, generate() implements
# sophisticated text generation algorithms optimized for sequential token prediction

output_ids = model.generate(
    inputs.input_ids, 
    attention_mask=inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    max_length=50, 
    num_return_sequences=1
)

output_ids

tensor([[7454, 2402,  257,  640,   11,  262,  995,  373,  257, 1295,  286, 1049,
         8737,  290, 1049, 3514,   13,  383,  995,  373,  257, 1295,  286, 1049,
         3514,   11,  290,  262,  995,  373,  257, 1295,  286, 1049, 3514,   13,
          383,  995,  373,  257, 1295,  286, 1049, 3514,   11,  290,  262,  995,
          373,  257]])

or


```python
with torch.no_grad():
    outputs = model(**inputs) 

outputs


## Post-process the output  
Decode the generated tokens to get the text:


In [35]:
# Decode the generated text
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(generated_text)

Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a


# Hugging Face `pipeline()` function

The `pipeline()` function from the Hugging Face `transformers` library is a high-level API designed to simplify the usage of pretrained models for various natural language processing (NLP) tasks. It abstracts the complexities of model loading, tokenization, inference, and post-processing, allowing users to perform complex NLP tasks with just a few lines of code.

## Definition

```python
transformers.pipeline(
    task: str,
    model: Optional = None,
    config: Optional = None,
    tokenizer: Optional = None,
    feature_extractor: Optional = None,
    framework: Optional = None,
    revision: str = 'main',
    use_fast: bool = True,
    model_kwargs: Dict[str, Any] = None,
    **kwargs
)
```

## Parameters

- **task**: `str`
  - The task to perform, such as "text-classification", "text-generation", "question-answering", etc.
  - Example: `"text-classification"`

- **model**: `Optional`
  - The model to use. This can be a string (model identifier from Hugging Face model hub), a path to a directory containing model files, or a pre-loaded model instance.
  - Example: `"distilbert-base-uncased-finetuned-sst-2-english"`

- **config**: `Optional`
  - The configuration to use. This can be a string, a path to a directory, or a pre-loaded config object.
  - Example: `{"output_attentions": True}`

- **tokenizer**: `Optional`
  - The tokenizer to use. This can be a string, a path to a directory, or a pre-loaded tokenizer instance.
  - Example: `"bert-base-uncased"`

- **feature_extractor**: `Optional`
  - The feature extractor to use for tasks that require it (e.g., image processing).
  - Example: `"facebook/detectron2"`

- **framework**: `Optional`
  - The framework to use, either `"pt"` for PyTorch or `"tf"` for TensorFlow. If not specified, it will be inferred.
  - Example: `"pt"`

- **revision**: `str`, default `'main'`
  - The specific model version to use (branch, tag, or commit hash).
  - Example: `"v1.0"`

- **use_fast**: `bool`, default `True`
  - Whether to use the fast version of the tokenizer if available.
  - Example: `True`

- **model_kwargs**: `Dict[str, Any]`, default `None`
  - Additional keyword arguments passed to the model during initialization.
  - Example: `{"output_hidden_states": True}`

- **kwargs**: `Any`
  - Additional keyword arguments passed to the pipeline components.

## Task types

The `pipeline()` function supports a wide range of NLP tasks. Here are some of the common tasks:

1. **Text Classification**: `text-classification`
   - **Purpose**: Classify text into predefined categories.
   - **Use Cases**: Sentiment analysis, spam detection, topic classification.

2. **Text Generation**: `text-generation`
   - **Purpose**: Generate coherent text based on a given prompt.
   - **Use Cases**: Creative writing, dialogue generation, story completion.

3. **Question Answering**: `question-answering`
   - **Purpose**: Answer questions based on a given context.
   - **Use Cases**: Building Q&A systems, information retrieval from documents.

4. **Named Entity Recognition (NER)**: `ner` (or `token-classification`)
   - **Purpose**: Identify and classify named entities (like people, organizations, locations) in text.
   - **Use Cases**: Extracting structured information from unstructured text.

5. **Summarization**: `summarization`
   - **Purpose**: Summarize long pieces of text into shorter, coherent summaries.
   - **Use Cases**: Document summarization, news summarization.

6. **Translation**: `translation_xx_to_yy` (e.g., `translation_en_to_fr`)
   - **Purpose**: Translate text from one language to another.
   - **Use Cases**: Language translation, multilingual applications.

7. **Fill-Mask**: `fill-mask`
   - **Purpose**: Predict masked words in a sentence (useful for masked language modeling).
   - **Use Cases**: Language modeling tasks, understanding model predictions.

8. **Zero-Shot Classification**: `zero-shot-classification`
   - **Purpose**: Classify text into categories without needing training data for those categories.
   - **Use Cases**: Flexible and adaptable classification tasks.

9. **Feature Extraction**: `feature-extraction`
   - **Purpose**: Extract hidden state features from text.
   - **Use Cases**: Downstream tasks requiring text representations, such as clustering, similarity, or further custom model training.


### Example 1: Text classification using `pipeline()`

In this example, you will use the `pipeline()` function to perform text classification. You will load a pretrained text classification model and use it to classify a sample text.

#### Load the text classification model:
We initialize the pipeline for the `text-classification` task, specifying the model `"distilbert-base-uncased-finetuned-sst-2-english"`. This model is fine-tuned for sentiment analysis.

#### Classify the sample text:
We use the classifier to classify a sample text: "Congratulations! You've won a free ticket to the Bahamas. Reply WIN to claim." The `classifier` function returns the classification result, which is then printed.


In [37]:
# Load a general text classification model
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

# Classify a sample text
result = classifier("Congratulations! You've won a free ticket to the Bahamas. Reply WIN to claim.")
print(result)

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9997586607933044}]


#### Output

The output will be a list of dictionaries, where each dictionary contains:

- `label`: The predicted label (e.g., "POSITIVE" or "NEGATIVE").
- `score`: The confidence score for the prediction.


### Example 2: Language detection using `pipeline()`

In this example, you will use the `pipeline()` function to perform language detection. You will load a pretrained language detection model and use it to identify the language of a sample text.

#### Load the language detection model:
We initialize the pipeline for the `text-classification` task, specifying the model `"papluca/xlm-roberta-base-language-detection"`. This model is fine-tuned for language detection.

#### Classify the sample text:
We use the classifier to detect the language of a sample text: "Bonjour, comment ça va?" The `classifier` function returns the classification result, which is then printed.


In [39]:
text2 = "Bonjour, comment ça va?"

result2 = classifier(text2)
result2

[{'label': 'NEGATIVE', 'score': 0.8884902000427246}]

In [40]:
from transformers import pipeline

classifier = pipeline("text-classification", model="papluca/xlm-roberta-base-language-detection")
result = classifier("Bonjour, comment ça va?")
print(result)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cuda:0


[{'label': 'fr', 'score': 0.9934879541397095}]


#### Output
The output will be a list of dictionaries, where each dictionary contains:

- `label`: The predicted language label (e.g., "fr" for French).
- `score`: The confidence score for the prediction.


### Example 3: Text generation using `pipeline()`

In this example, you will use the `pipeline()` function to perform text generation. You will load a pretrained text generation model and use it to generate text based on a given prompt.

#### Initialize the text generation model:
We initialize the pipeline for the `text-generation` task, specifying the model `"gpt2"`. GPT-2 is a well-known model for text generation tasks.


In [None]:
# Initialize the text generation pipeline with GPT-2
generator = pipeline("text-generation", model="gpt2")

#### Generate text based on a given prompt:
We use the generator to generate text based on a prompt: "Once upon a time". Let's specify `max_length=50`, `truncation=True` to limit the generated text to 50 tokens and `num_return_sequences=1` to generate one sequence. The `generator` function returns the generated text, which is then printed.


In [None]:
# Generate text based on a given prompt
prompt = "Once upon a time"
result = generator(prompt, max_length=50, num_return_sequences=1, truncation=True)

# Print the generated text
print(result[0]['generated_text'])

#### Output
The output will be a list of dictionaries, where each dictionary contains:

- `generated_text`: The generated text based on the input prompt.


### Example 4: Text generation using T5 with `pipeline()`

In this example, you will use the `pipeline()` function to perform text-to-text generation with the T5 model. You will load a pretrained T5 model and use it to translate a sentence from English to French based on a given prompt.

#### Initialize the text generation model:
We initialize the pipeline for the `text2text-generation task, specifying the model "t5-small". T5 is a versatile model that can perform various text-to-text generation tasks, including translation.


In [None]:
# Initialize the text generation pipeline with T5
generator = pipeline("text2text-generation", model="t5-small")

#### Generate text based on a given prompt:
We use the generator to translate a sentence from English to French based on the prompt: "translate English to French: How are you?". Let's specify `max_length=50` to limit the generated text to 50 tokens and `num_return_sequences=1` to generate one sequence. The `generator` function returns the translated text, which is then printed.


In [None]:
# Generate text based on a given prompt
prompt = "translate English to French: How are you?"
result = generator(prompt, max_length=50, num_return_sequences=1)

# Print the generated text
print(result[0]['generated_text'])

#### Output
The output will be a list of dictionaries, where each dictionary contains:

- `generated_text`: The generated text based on the input prompt.


### Example 5: Fill-mask task using BERT with `pipeline()`

In this exercise, you will use the `pipeline()` function to perform a fill-mask task using the BERT model. You will load a pretrained BERT model and use it to predict the masked word in a given sentence.


### Introductions

1. **Initialize the fill-mask pipeline** with the BERT model.
2. **Create a prompt** with a masked token.
3. **Generate text** by filling in the masked token.
4. **Print the generated text** with the predictions.


In [56]:
# Generated masked text
masked_text = "The [MASK] of Australia is Canberra"

mask_filler = pipeline('fill-mask', model = 'bert-base-uncased')
result3 = mask_filler(masked_text)
print(result3[0])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


{'score': 0.9584370255470276, 'token': 3007, 'token_str': 'capital', 'sequence': 'the capital of australia is canberra'}


# Conclusion
## Benefits of using `pipeline()`

- **Reduced Boilerplate Code**: Simplifies the code required to perform NLP tasks.
- **Improved Readability**: Makes code more readable and expressive.
- **Time Efficiency**: Saves time by handling model loading, tokenization, inference, and post-processing automatically.
- **Consistent API**: Provides a consistent API across different tasks, allowing for easy experimentation and quick prototyping.
- **Automatic Framework Handling**: Automatically handles the underlying framework (TensorFlow or PyTorch).

## When to use `pipeline()`

- **Quick Prototyping**: When you need to quickly prototype an NLP application or experiment with different models.
- **Simple Tasks**: When performing simple or common NLP tasks that are well-supported by the `pipeline()` function.
- **Deployment**: When deploying NLP models in environments where simplicity and ease of use are crucial.

## When to avoid `pipeline()`

- **Custom Tasks**: When you need to perform highly customized tasks that are not well-supported by the `pipeline()` function.
- **Performance Optimization**: When you need fine-grained control over the model and tokenization process for performance optimization or specific use cases.
