#Zero-shot Question Answering (QA) with HuggingFace Models

In this **comprehensive tutorial on running zero-shot Question Answering (QA)** with Hugging Face models, covering:

* What is zero-shot QA?
* How it differs from classic extractive QA
* Pipelines and direct model usage
* Best model options (including `facebook/bart-large-mnli`, NLI models, and zero-shot QG/QA)
* End-to-end code with explanations
* Bonus: Using Multilingual Zero-Shot QA

## 1. **What is Zero-Shot QA?**

* **Classic (Extractive) QA:** Given a **context** and a **question**, the model extracts the answer from the context. Needs the answer to be *explicitly present* in the text.
* **Zero-Shot QA:** The model answers a question based on *world knowledge* or *generalization*, even if it hasn’t seen a similar Q\&A pair during training, and often *without being fine-tuned for QA*.

  * **Use Cases:** Open-domain QA, knowledge base QA, document search where context may not directly include the answer.
  * **How?** Common method is to reframe QA as NLI (Natural Language Inference): treat question as hypothesis and context as premise, then check if the context entails the answer.

## 2. **Approaches for Zero-Shot QA**

## - **A. NLI-Based (Entailment) Approach**

* **NLI (Natural Language Inference)** is a task in NLP that determines whether a hypothesis is true (entailed), false (contradicted), or uncertain (neutral) based on a given context.
* Use a model like `facebook/bart-large-mnli` or `roberta-large-mnli`.
* Check whether a candidate answer is supported (“entailed”) by the context.
* **Workflow:**

  1. Generate answer candidates (retrieval, text spans, or generation).
  2. For each candidate, use NLI model to see if context entails the Q\&A pair.#

### - **B. Generative Zero-Shot QA**

* Use models like `google/flan-t5-xxl`, `tiiuae/falcon-7b-instruct`, or `mistralai/Mistral-7B-Instruct` for direct question answering without fine-tuning.


## 3. **A. Zero-Shot QA using NLI Models (Pipeline)**

#### **Example 1: QA with NLI as Zero-Shot Classifier**


In [1]:
from transformers import pipeline

# BART-large MNLI is a strong zero-shot model for natural language inference tasks
nli_pipeline = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Define the context, which provides information about the Ganges river
context = "The Ganges is a trans-boundary river of Asia which flows through India and Bangladesh."
# Define the question for which we need to generate the answer
question = "Which countries does the Ganges flow through?"

# Candidate answers (these can be improved via retrieval/heuristics based on the context)
candidate_answers = ["India", "Bangladesh", "Nepal", "China"]

# We turn the question and each candidate answer into "hypotheses" for classification
hypotheses = [f"The Ganges flows through {ans}." for ans in candidate_answers]  # Creating hypotheses

# Use the zero-shot classification pipeline to evaluate the hypotheses against the context
result = nli_pipeline(
    sequences=context,  # The context provided to the model
    candidate_labels=hypotheses,  # The hypotheses created from candidate answers
    multi_label=True  # Allowing for multiple valid labels (answers)
)

# Print the predicted labels (answers) that the model found most likely
print(result["labels"])

# Print the confidence scores for each hypothesis (how strongly the model supports each answer)
print(result["scores"])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


['The Ganges flows through India.', 'The Ganges flows through Bangladesh.', 'The Ganges flows through Nepal.', 'The Ganges flows through China.']
[0.964613676071167, 0.9313793778419495, 0.0002740573254413903, 0.00010426752851344645]


This code uses a **Natural Language Inference (NLI)** approach to find answers to a question based on a given context. Specifically, it uses a zero-shot classification model `(facebook/bart-large-mnli)` to check which candidate answers are most likely correct according to the context.

**Think of it as:**

* Context: The information you already have.

* Question: What you want to know.

* Candidate answers: Possible answers you want the model to check.

* NLI model: Decides if each candidate answer is supported by the context.

### **Code Explanation:**

* `pipeline` is a helper function from the transformers library.It simplifies using pre-trained models for tasks like question answering, text generation, or classification.

* `"zero-shot-classification"`: Allows the model to classify text without training on specific labels.

* `facebook/bart-large-mnli`: A strong model trained for Natural Language Inference, meaning it can check if a statement is true, false, or neutral based on a context.

* NLI models need statements (hypotheses) to check, not just single words.

* `sequences=context`: The text the model uses as reference.

* `candidate_labels=hypotheses`: The statements (hypotheses) the model evaluates.

* `multi_label=True`: Allows more than one answer to be correct at the same time.

* The model gives a score for each hypothesis, showing how strongly the context supports it.

### **Expected Output**:

* The model predicts India and Bangladesh as the correct answers because the context explicitly mentions them.

* Nepal and China get very low scores because the context does not support them.

* The labels are ranked from most likely true to least likely true, and the scores show the confidence (0 to 1).

## 4. **B. Zero-Shot QA with Generative LLMs**

#### **Example 2: Using FLAN-T5 for Open QA**

In [2]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model for FLAN-T5 (can be changed to other models like flan-t5-xl, Mistral-7B, etc.)
model_name = "google/flan-t5-large"  # Or "flan-t5-xl", "mistralai/Mistral-7B-Instruct", etc.
tokenizer = AutoTokenizer.from_pretrained(model_name)  # Initialize the tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)  # Initialize the pre-trained model

# Define the question and context
question = "Which countries does the Ganges flow through?"  # The question to be answered
context = "The Ganges is a trans-boundary river of Asia which flows through India and Bangladesh."  # The context providing information about the Ganges

# Format the input prompt by combining the question and context
prompt = f"Question: {question}\nContext: {context}\nAnswer:"  # The formatted prompt for the model

# Tokenize the input prompt to prepare it for the model
inputs = tokenizer(prompt, return_tensors="pt")  # Convert the prompt into token format suitable for the model

# Generate an answer from the model based on the tokenized input
outputs = model.generate(**inputs, max_new_tokens=32)  # Generate an answer with a limit of 32 new tokens

# Decode and print the generated answer (removes special tokens used by the model)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))  # Convert token IDs back to human-readable text and print the answer


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

India and Bangladesh


* **For true open-domain QA,** just use the question (no context) and see how the model generalizes.

**Flan-T5** is a variant of the T5 (Text-to-Text Transfer Transformer) model, developed by Google. The "Flan" part refers to the fine-tuning methodology used on T5, where the model is trained on a broad set of instruction-based datasets. This fine-tuning process makes Flan-T5 more effective at following instructions, such as answering questions, completing tasks, or summarizing information.

### **Code Explanation**:

* `AutoTokenizer`: Converts text into tokens (numbers) that the model understands.

* `AutoModelForSeq2SeqLM`: Loads a sequence-to-sequence model, which can generate text as output.

* `model_name`: Specifies the pre-trained model to use (FLAN-T5-large).

* `tokenizer`: Prepares text for the model (splits it into tokens).

* `model`: Loads the pre-trained FLAN-T5 model for text generation.

* `return_tensors="pt"*: Prepares tokens as PyTorch tensors, which the model uses internally. Converts text into numbers (tokens) that the model can process.

* `max_new_tokens=32`: Limits the length of the answer to 32 tokens. The model predicts the answer based on the tokens.

* `skip_special_tokens=True`: Removes extra symbols used internally by the model. Converts the model's token output back into human-readable text.

### **Expected Output:**

* This code automatically answers questions based on given text.

* It uses FLAN-T5, a pre-trained model that understands context and generates answers.

 * Useful when you want AI to provide answers without manually programming the logic.

## 5. **C. End-to-End Zero-Shot QA Pipeline (Custom Function)**

Suppose you want to get an answer **and** check with NLI if it’s supported by the context:

In [3]:
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# Step 1: Generate an answer using FLAN-T5 (or similar models like Mistral)
qa_model = "google/flan-t5-large"  # The model name for FLAN-T5 large version
tokenizer = AutoTokenizer.from_pretrained(qa_model)  # Load the tokenizer for the model
model = AutoModelForSeq2SeqLM.from_pretrained(qa_model)  # Load the pre-trained FLAN-T5 model

# Define the question and context for the model
question = "Which countries does the Ganges flow through?"  # The question to be answered
context = "The Ganges is a trans-boundary river of Asia which flows through India and Bangladesh."  # The context providing information

# Create the input prompt by combining the question and context
prompt = f"Question: {question}\nContext: {context}\nAnswer:"  # Formatting the input for the model

# Tokenize the input prompt to prepare it for the model
inputs = tokenizer(prompt, return_tensors="pt")  # Convert the prompt into tensor format suitable for the model

# Generate the answer from the model based on the tokenized input
outputs = model.generate(**inputs, max_new_tokens=32)  # Limit the output to 32 new tokens for the answer

# Decode the generated output (token IDs) into human-readable text
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)  # Remove special tokens used by the model

# Print the generated answer
print("Generated answer:", answer)  # Output the generated answer

# Step 2: Check if the generated answer is entailed by the context using NLI (Natural Language Inference)
nli = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")  # Load the NLI pipeline with BART-large-MNLI
hypothesis = f"{answer}"  # The hypothesis is the generated answer (e.g., "India and Bangladesh.")

# Run zero-shot classification to check if the answer is supported by the context
result = nli(sequences=context, candidate_labels=[hypothesis])  # Evaluate the hypothesis against the context

# Print the entailment score, which indicates how much the hypothesis aligns with the context
print("Entailment score:", result["scores"][0])  # Output the entailment score (confidence level)


Generated answer: India and Bangladesh


Device set to use cpu


Entailment score: 0.9881207942962646


This code performs two tasks:

 1. **Generate an Answer (using FLAN-T5):**

* It combines a question and context, feeds them into the FLAN-T5 model to generate an answer.

* Example: For the question "Which countries does the Ganges flow through?" and the context "The Ganges flows through India and Bangladesh", it generates: "India and Bangladesh".

2. **Check Entailment (using BART-large-MNLI):**

* It checks if the generated answer is supported by the context using a zero-shot classification model.

* The entailment score indicates how confident the model is that the answer is true based on the context.

| **Function / Method**                          | **Purpose**                                                                                 |
|:---------------------------------------------|:-------------------------------------------------------------------------------------------|
| `AutoTokenizer.from_pretrained()`            | Loads the tokenizer for the model to convert text into tokens (numerical representations). |
| `AutoModelForSeq2SeqLM.from_pretrained()`    | Loads the pre-trained sequence-to-sequence language model used to generate answers.        |
| `tokenizer()`                                | Processes and converts input text into tokens that the model can understand.                |
| `model.generate()`                           | Generates a sequence of tokens (the answer) based on the processed input tokens.           |
| `tokenizer.decode()`                         | Converts the generated tokens from the model back into human-readable text.                |
| `pipeline()`                                 | A high-level utility from Hugging Face to easily load models for specific tasks.           |
| `nli()` (as part of a pipeline)              | Uses Natural Language Inference to check if an answer is supported by context.             |

### **Expected Output:**

* `Generated answer`: The model generates the answer based on the given context.

* `Entailment score`: The NLI model provides a high score (e.g., 0.99) indicating that the hypothesis ("India and Bangladesh") is strongly supported by the context.

## 6. **D. Multilingual Zero-Shot QA**

Use models like `joeddav/xlm-roberta-large-xnli` for cross-lingual entailment:

In [4]:
from transformers import pipeline

# Load the zero-shot classification pipeline with the XLM-RoBERTa-large-XNLI model for multilingual tasks
nli = pipeline("zero-shot-classification", model="joeddav/xlm-roberta-large-xnli")

# Define the context in French, describing the Seine river
context = "La Seine est un fleuve français, long de 777 kilomètres."  # The context information about the Seine river

# Define the question related to the context
question = "Quel est le fleuve qui traverse Paris ?"  # The question asking about the river that crosses Paris

# Define the hypothesis which we want to check against the context
hypothesis = "La Seine traverse Paris."  # Hypothesis statement asserting that the Seine crosses Paris

# Use the NLI model to check if the hypothesis is entailed by the context
result = nli(sequences=context, candidate_labels=[hypothesis])  # Run the zero-shot classification

# Print the result, which includes the labels (hypothesis) and the corresponding entailment scores
print(result)  # Output the result showing the labels and their respective scores


config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cpu


{'sequence': 'La Seine est un fleuve français, long de 777 kilomètres.', 'labels': ['La Seine traverse Paris.'], 'scores': [0.70330810546875]}


This code uses the **XLM-RoBERTa-large-XNLI model** for zero-shot classification to check if a hypothesis is supported by a given context.

**XLM-RoBERTa-large-XNLI** is a cross-lingual model fine-tuned for zero-shot classification tasks, capable of understanding and classifying text in multiple languages.

### **Code Explanation**

* **zero-shot-classification**: Specifies the task the model will perform. In this case, it classifies text without any prior training on the specific labels.

* `"joeddav/xlm-roberta-large-xnli"`: Specifies the XLM-RoBERTa-large-XNLI model, which is trained for multilingual tasks (XNLI stands for Cross-lingual Natural Language Inference). It can be used to check if a hypothesis holds true for a given context across various languages.

* `hypothesis`: The statement being tested to see if it is true based on the provided context. Here, the hypothesis asserts that the Seine crosses Paris.

* `sequences=context`: The context to evaluate the hypothesis against.

* ``candidate_labels=[hypothesis]``: The hypothesis that needs to be checked.

* `nli()`: The zero-shot classification model runs inference to see if the hypothesis is entailed (i.e., supported) by the context.

### **Expected Output:**

The expected output will show the hypothesis ("La Seine traverse Paris") along with a high confidence score, as the context directly mentions that the Seine is a French river, and it crosses Paris.

* `labels`: The hypothesis that was evaluated. In this case, the hypothesis is that "La Seine traverse Paris."

* `scores`: The confidence score (between 0 and 1). A higher score (e.g., 0.99) means the model strongly supports the hypothesis based on the context.

## 7. **E. Model Options for Zero-Shot QA**

| Approach         | Model Examples                                      | Hugging Face Model Card                                                         |
| ---------------- | --------------------------------------------------- | ------------------------------------------------------------------------------- |
| NLI (Entailment) | facebook/bart-large-mnli, roberta-large-mnli        | [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli)              |
| Generative QA    | google/flan-t5-large, mistralai/Mistral-7B-Instruct | [flan-t5-large](https://huggingface.co/google/flan-t5-large)                    |
| Multilingual NLI | joeddav/xlm-roberta-large-xnli                      | [xlm-roberta-large-xnli](https://huggingface.co/joeddav/xlm-roberta-large-xnli) |
| Classic QA       | deepset/roberta-base-squad2 (not zero-shot)         | [squad2](https://huggingface.co/deepset/roberta-base-squad2)                    |

---