# **RAG Evaluation**
_Authored by: [Aymeric Roucher](https://huggingface.co/m-ric)_

This notebook demonstrates how you can evaluate your RAG (Retrieval Augmented Generation), by building a synthetic evaluation dataset and using LLM-as-a-judge to compute the accuracy of your system.

For an introduction to RAG, you can check [this other cookbook](rag_zephyr_langchain)!

RAG systems are complex: here a RAG diagram, where we noted in blue all possibilities for system enhancement:

<img src="https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/RAG_workflow.png" height="700">

Implementing any of these improvements can bring a huge performance boost; but changing anything is useless if you cannot monitor the impact of your changes on the system's performance!
So let's see how to evaluate our RAG system.

### Evaluating RAG performance

Since there are so many moving parts to tune with a big impact on performance, benchmarking the RAG system is crucial.

For our evaluation pipeline, we will need:
1. An evaluation dataset with question - answer couples (QA couples)
2. An evaluator to compute the accuracy of our system on the above evaluation dataset.

‚û°Ô∏è It turns out, we can use LLMs to help us all along the way!
1. The evaluation dataset will be synthetically generated by an LLM ü§ñ, and questions will be filtered out by other LLMs ü§ñ
2. An [LLM-as-a-judge](https://huggingface.co/papers/2306.05685) agent ü§ñ will then perform the evaluation on this synthetic dataset.

__Let's dig into it and start building our evaluation pipeline!__ First, we install the required model dependancies.

In [1]:
!pip install -q torch transformers langchain sentence-transformers tqdm openpyxl openai pandas datasets langchain-community ragatouille
!pip install -U langchain-text-splitters langchain-community langchain
!pip install -U langchain-huggingface
!pip install -U langchain langchain-openai

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m [32m2.5/2.5 MB[0m [31m80.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.5/2.5 MB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/46.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m46.1/46.1 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [9

In [2]:
# %reload_ext autoreload
# %autoreload 2

In [3]:
from tqdm.auto import tqdm
import pandas as pd
from typing import Optional, List, Tuple
import json
import datasets

pd.set_option("display.max_colwidth",None) #Hi·ªÉn th·ªã tr·ªçn v·∫πn n·ªôi dung vƒÉn b·∫£n

In [4]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

### **Load your knowledge base**

In [6]:
ds = datasets.load_dataset("m-ric/huggingface_doc", split="train")

In [7]:
# Check dataset
print(ds)

Dataset({
    features: ['text', 'source'],
    num_rows: 2647
})


In [8]:
import pandas as pd

df_view = pd.DataFrame(ds.select(range(5)))
display(df_view)

Unnamed: 0,text,source
0,"Create an Endpoint\n\nAfter your first login, you will be directed to the [Endpoint creation page](https://ui.endpoints.huggingface.co/new). As an example, this guide will go through the steps to deploy [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) for text classification. \n\n## 1. Enter the Hugging Face Repository ID and your desired endpoint name:\n\n<img src=""https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/1_repository.png"" alt=""select repository"" />\n\n## 2. Select your Cloud Provider and region. Initially, only AWS will be available as a Cloud Provider with the `us-east-1` and `eu-west-1` regions. We will add Azure soon, and if you need to test Endpoints with other Cloud Providers or regions, please let us know.\n\n<img src=""https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/1_region.png"" alt=""select region"" />\n\n## 3. Define the [Security Level](security) for the Endpoint:\n\n<img src=""https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/1_security.png"" alt=""define security"" />\n\n## 4. Create your Endpoint by clicking **Create Endpoint**. By default, your Endpoint is created with a medium CPU (2 x 4GB vCPUs with Intel Xeon Ice Lake) The cost estimate assumes the Endpoint will be up for an entire month, and does not take autoscaling into account.\n\n<img src=""https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/1_create_cost.png"" alt=""create endpoint"" />\n\n## 5. Wait for the Endpoint to build, initialize and run which can take between 1 to 5 minutes.\n\n<img src=""https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/overview.png"" alt=""overview"" />\n\n## 6. Test your Endpoint in the overview with the Inference widget üèÅ üéâ!\n\n<img src=""https://raw.githubusercontent.com/huggingface/hf-endpoints-documentation/main/assets/1_inference.png"" alt=""run inference"" />\n",huggingface/hf-endpoints-documentation/blob/main/docs/source/guides/create_endpoint.mdx
1,"Choosing a metric for your task\n\n**So you've trained your model and want to see how well it‚Äôs doing on a dataset of your choice. Where do you start?**\n\nThere is no ‚Äúone size fits all‚Äù approach to choosing an evaluation metric, but some good guidelines to keep in mind are:\n\n## Categories of metrics\n\nThere are 3 high-level categories of metrics:\n\n1. *Generic metrics*, which can be applied to a variety of situations and datasets, such as precision and accuracy.\n2. *Task-specific metrics*, which are limited to a given task, such as Machine Translation (often evaluated using metrics [BLEU](https://huggingface.co/metrics/bleu) or [ROUGE](https://huggingface.co/metrics/rouge)) or Named Entity Recognition (often evaluated with [seqeval](https://huggingface.co/metrics/seqeval)).\n3. *Dataset-specific metrics*, which aim to measure model performance on specific benchmarks: for instance, the [GLUE benchmark](https://huggingface.co/datasets/glue) has a dedicated [evaluation metric](https://huggingface.co/metrics/glue).\n\nLet's look at each of these three cases:\n\n### Generic metrics\n\nMany of the metrics used in the Machine Learning community are quite generic and can be applied in a variety of tasks and datasets.\n\nThis is the case for metrics like [accuracy](https://huggingface.co/metrics/accuracy) and [precision](https://huggingface.co/metrics/precision), which can be used for evaluating labeled (supervised) datasets, as well as [perplexity](https://huggingface.co/metrics/perplexity), which can be used for evaluating different kinds of (unsupervised) generative tasks.\n\nTo see the input structure of a given metric, you can look at its metric card. For example, in the case of [precision](https://huggingface.co/metrics/precision), the format is:\n```\n>>> precision_metric = evaluate.load(""precision"")\n>>> results = precision_metric.compute(references=[0, 1], predictions=[0, 1])\n>>> print(results)\n{'precision': 1.0}\n```\n\n### Task-specific metrics\n\nPopular ML tasks like Machine Translation and Named Entity Recognition have specific metrics that can be used to compare models. For example, a series of different metrics have been proposed for text generation, ranging from [BLEU](https://huggingface.co/metrics/bleu) and its derivatives such as [GoogleBLEU](https://huggingface.co/metrics/google_bleu) and [GLEU](https://huggingface.co/metrics/gleu), but also [ROUGE](https://huggingface.co/metrics/rouge), [MAUVE](https://huggingface.co/metrics/mauve), etc.\n\nYou can find the right metric for your task by:\n\n- **Looking at the [Task pages](https://huggingface.co/tasks)** to see what metrics can be used for evaluating models for a given task.\n- **Checking out leaderboards** on sites like [Papers With Code](https://paperswithcode.com/) (you can search by task and by dataset).\n- **Reading the metric cards** for the relevant metrics and see which ones are a good fit for your use case. For example, see the [BLEU metric card](https://github.com/huggingface/evaluate/tree/main/metrics/bleu) or [SQuaD metric card](https://github.com/huggingface/evaluate/tree/main/metrics/squad).\n- **Looking at papers and blog posts** published on the topic and see what metrics they report. This can change over time, so try to pick papers from the last couple of years!\n\n### Dataset-specific metrics\n\nSome datasets have specific metrics associated with them -- this is especially in the case of popular benchmarks like [GLUE](https://huggingface.co/metrics/glue) and [SQuAD](https://huggingface.co/metrics/squad).\n\n<Tip warning={true}>\nüí°\nGLUE is actually a collection of different subsets on different tasks, so first you need to choose the one that corresponds to the NLI task, such as mnli, which is described as ‚Äúcrowdsourced collection of sentence pairs with textual entailment annotations‚Äù\n</Tip>\n\n\nIf you are evaluating your model on a benchmark dataset like the ones mentioned above, you can use its dedicated evaluation metric. Make sure you respect the format that they require. For example, to evaluate your model on the [SQuAD](https://huggingface.co/datasets/squad) dataset, you need to feed the `question` and `context` into your model and return the `prediction_text`, which should be compared with the `references` (based on matching the `id` of the question) :\n\n```\n>>> from evaluate import load\n>>> squad_metric = load(""squad"")\n>>> predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]\n>>> references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]\n>>> results = squad_metric.compute(predictions=predictions, references=references)\n>>> results\n{'exact_match': 100.0, 'f1': 100.0}\n```\n\nYou can find examples of dataset structures by consulting the ""Dataset Preview"" function or the dataset card for a given dataset, and you can see how to use its dedicated evaluation function based on the metric card.\n",huggingface/evaluate/blob/main/docs/source/choosing_a_metric.mdx
2,"‰∏ªË¶ÅÁâπÁÇπ\n\nËÆ©Êàë‰ª¨Êù•‰ªãÁªç‰∏Ä‰∏ã Gradio ÊúÄÂèóÊ¨¢ËøéÁöÑ‰∏Ä‰∫õÂäüËÉΩÔºÅËøôÈáåÊòØ Gradio ÁöÑ‰∏ªË¶ÅÁâπÁÇπÔºö\n\n1. [Ê∑ªÂä†Á§∫‰æãËæìÂÖ•](#example-inputs)\n2. [‰º†ÈÄíËá™ÂÆö‰πâÈîôËØØÊ∂àÊÅØ](#errors)\n3. [Ê∑ªÂä†ÊèèËø∞ÂÜÖÂÆπ](#descriptive-content)\n4. [ËÆæÁΩÆÊóóÊ†á](#flagging)\n5. [È¢ÑÂ§ÑÁêÜÂíåÂêéÂ§ÑÁêÜ](#preprocessing-and-postprocessing)\n6. [Ê†∑ÂºèÂåñÊºîÁ§∫](#styling)\n7. [ÊéíÈòüÁî®Êà∑](#queuing)\n8. [Ëø≠‰ª£ËæìÂá∫](#iterative-outputs)\n9. [ËøõÂ∫¶Êù°](#progress-bars)\n10. [ÊâπÂ§ÑÁêÜÂáΩÊï∞](#batch-functions)\n11. [Âú®Âçè‰ΩúÁ¨îËÆ∞Êú¨‰∏äËøêË°å](#colab-notebooks)\n\n## Á§∫‰æãËæìÂÖ•\n\nÊÇ®ÂèØ‰ª•Êèê‰æõÁî®Êà∑ÂèØ‰ª•ËΩªÊùæÂä†ËΩΩÂà∞ ""Interface"" ‰∏≠ÁöÑÁ§∫‰æãÊï∞ÊçÆ„ÄÇËøôÂØπ‰∫éÊºîÁ§∫Ê®°ÂûãÊúüÊúõÁöÑËæìÂÖ•Á±ªÂûã‰ª•ÂèäÊºîÁ§∫Êï∞ÊçÆÈõÜÂíåÊ®°Âûã‰∏ÄËµ∑Êé¢Á¥¢ÁöÑÊñπÂºèÈùûÂ∏∏ÊúâÂ∏ÆÂä©„ÄÇË¶ÅÂä†ËΩΩÁ§∫‰æãÊï∞ÊçÆÔºåÊÇ®ÂèØ‰ª•Â∞ÜÂµåÂ•óÂàóË°®Êèê‰æõÁªô Interface ÊûÑÈÄ†ÂáΩÊï∞ÁöÑ `examples=` ÂÖ≥ÈîÆÂ≠óÂèÇÊï∞„ÄÇÂ§ñÈÉ®ÂàóË°®‰∏≠ÁöÑÊØè‰∏™Â≠êÂàóË°®Ë°®Á§∫‰∏Ä‰∏™Êï∞ÊçÆÊ†∑Êú¨ÔºåÂ≠êÂàóË°®‰∏≠ÁöÑÊØè‰∏™ÂÖÉÁ¥†Ë°®Á§∫ÊØè‰∏™ËæìÂÖ•ÁªÑ‰ª∂ÁöÑËæìÂÖ•„ÄÇÊúâÂÖ≥ÊØè‰∏™ÁªÑ‰ª∂ÁöÑÁ§∫‰æãÊï∞ÊçÆÊ†ºÂºèÂú®[Docs](https://gradio.app/docs#components)‰∏≠ÊúâËØ¥Êòé„ÄÇ\n\n$code_calculator\n$demo_calculator\n\nÊÇ®ÂèØ‰ª•Â∞ÜÂ§ßÂûãÊï∞ÊçÆÈõÜÂä†ËΩΩÂà∞Á§∫‰æã‰∏≠ÔºåÈÄöËøá Gradio ÊµèËßàÂíå‰∏éÊï∞ÊçÆÈõÜËøõË°å‰∫§‰∫í„ÄÇÁ§∫‰æãÂ∞ÜËá™Âä®ÂàÜÈ°µÔºàÂèØ‰ª•ÈÄöËøá Interface ÁöÑ `examples_per_page` ÂèÇÊï∞ËøõË°åÈÖçÁΩÆÔºâ„ÄÇ\n\nÁªßÁª≠‰∫ÜËß£Á§∫‰æãÔºåËØ∑ÂèÇÈòÖ[Êõ¥Â§öÁ§∫‰æã](https://gradio.app/more-on-examples)ÊåáÂçó„ÄÇ\n\n## ÈîôËØØ\n\nÊÇ®Â∏åÊúõÂêëÁî®Êà∑‰º†ÈÄíËá™ÂÆö‰πâÈîôËØØÊ∂àÊÅØ„ÄÇ‰∏∫Ê≠§Ôºåwith `gr.Error(""custom message"")` Êù•ÊòæÁ§∫ÈîôËØØÊ∂àÊÅØ„ÄÇÂ¶ÇÊûúÂú®‰∏äÈù¢ÁöÑËÆ°ÁÆóÂô®Á§∫‰æã‰∏≠Â∞ùËØïÈô§‰ª•Èõ∂ÔºåÂ∞ÜÊòæÁ§∫Ëá™ÂÆö‰πâÈîôËØØÊ∂àÊÅØÁöÑÂºπÂá∫Ê®°ÊÄÅÁ™óÂè£„ÄÇ‰∫ÜËß£ÊúâÂÖ≥ÈîôËØØÁöÑÊõ¥Â§ö‰ø°ÊÅØÔºåËØ∑ÂèÇÈòÖ[ÊñáÊ°£](https://gradio.app/docs#error)„ÄÇ\n\n## ÊèèËø∞ÊÄßÂÜÖÂÆπ\n\nÂú®ÂâçÈù¢ÁöÑÁ§∫‰æã‰∏≠ÔºåÊÇ®ÂèØËÉΩÂ∑≤ÁªèÊ≥®ÊÑèÂà∞ Interface ÊûÑÈÄ†ÂáΩÊï∞‰∏≠ÁöÑ `title=` Âíå `description=` ÂÖ≥ÈîÆÂ≠óÂèÇÊï∞ÔºåÂ∏ÆÂä©Áî®Êà∑‰∫ÜËß£ÊÇ®ÁöÑÂ∫îÁî®Á®ãÂ∫è„ÄÇ\n\nInterface ÊûÑÈÄ†ÂáΩÊï∞‰∏≠Êúâ‰∏â‰∏™ÂèÇÊï∞Áî®‰∫éÊåáÂÆöÊ≠§ÂÜÖÂÆπÂ∫îÊîæÁΩÆÂú®Âì™ÈáåÔºö\n\n- `title`ÔºöÊé•ÂèóÊñáÊú¨ÔºåÂπ∂ÂèØ‰ª•Â∞ÜÂÖ∂ÊòæÁ§∫Âú®ÁïåÈù¢ÁöÑÈ°∂ÈÉ®Ôºå‰πüÂ∞ÜÊàê‰∏∫È°µÈù¢Ê†áÈ¢ò„ÄÇ\n- `description`ÔºöÊé•ÂèóÊñáÊú¨„ÄÅMarkdown Êàñ HTMLÔºåÂπ∂Â∞ÜÂÖ∂ÊîæÁΩÆÂú®Ê†áÈ¢òÊ≠£‰∏ãÊñπ„ÄÇ\n- `article`Ôºö‰πüÊé•ÂèóÊñáÊú¨„ÄÅMarkdown Êàñ HTMLÔºåÂπ∂Â∞ÜÂÖ∂ÊîæÁΩÆÂú®ÁïåÈù¢‰∏ãÊñπ„ÄÇ\n\n![annotated](/assets/guides/annotated.png)\n\nÂ¶ÇÊûúÊÇ®‰ΩøÁî®ÁöÑÊòØ `Blocks` APIÔºåÂàôÂèØ‰ª• with `gr.Markdown(...)` Êàñ `gr.HTML(...)` ÁªÑ‰ª∂Âú®‰ªª‰Ωï‰ΩçÁΩÆÊèíÂÖ•ÊñáÊú¨„ÄÅMarkdown Êàñ HTMLÔºåÂÖ∂‰∏≠ÊèèËø∞ÊÄßÂÜÖÂÆπ‰Ωç‰∫é `Component` ÊûÑÈÄ†ÂáΩÊï∞ÂÜÖÈÉ®„ÄÇ\n\nÂè¶‰∏Ä‰∏™ÊúâÁî®ÁöÑÂÖ≥ÈîÆÂ≠óÂèÇÊï∞ÊòØ `label=`ÔºåÂÆÉÂ≠òÂú®‰∫éÊØè‰∏™ `Component` ‰∏≠„ÄÇËøô‰øÆÊîπ‰∫ÜÊØè‰∏™ `Component` È°∂ÈÉ®ÁöÑÊ†áÁ≠æÊñáÊú¨„ÄÇËøòÂèØ‰ª•‰∏∫ËØ∏Â¶Ç `Textbox` Êàñ `Radio` ‰πãÁ±ªÁöÑË°®ÂçïÂÖÉÁ¥†Ê∑ªÂä† `info=` ÂÖ≥ÈîÆÂ≠óÂèÇÊï∞Ôºå‰ª•Êèê‰æõÊúâÂÖ≥ÂÖ∂Áî®Ê≥ïÁöÑËøõ‰∏ÄÊ≠•‰ø°ÊÅØ„ÄÇ\n\n```python\ngr.Number(label='Âπ¥ÈæÑ', info='‰ª•Âπ¥‰∏∫Âçï‰ΩçÔºåÂøÖÈ°ªÂ§ß‰∫é0')\n```\n\n## ÊóóÊ†á\n\nÈªòËÆ§ÊÉÖÂÜµ‰∏ãÔºå""Interface"" Â∞ÜÊúâ‰∏Ä‰∏™ ""Flag"" ÊåâÈíÆ„ÄÇÂΩìÁî®Êà∑ÊµãËØïÊÇ®ÁöÑ `Interface` Êó∂ÔºåÂ¶ÇÊûúÁúãÂà∞ÊúâË∂£ÁöÑËæìÂá∫Ôºå‰æãÂ¶ÇÈîôËØØÊàñÊÑèÂ§ñÁöÑÊ®°ÂûãË°å‰∏∫Ôºå‰ªñ‰ª¨ÂèØ‰ª•Â∞ÜËæìÂÖ•Ê†áËÆ∞‰∏∫ÊÇ®ËøõË°åÊü•Áúã„ÄÇÂú®Áî± `Interface` ÊûÑÈÄ†ÂáΩÊï∞ÁöÑ `flagging_dir=` ÂèÇÊï∞Êèê‰æõÁöÑÁõÆÂΩï‰∏≠ÔºåÂ∞ÜËÆ∞ÂΩïÊ†áËÆ∞ÁöÑËæìÂÖ•Âà∞‰∏Ä‰∏™ CSV Êñá‰ª∂‰∏≠„ÄÇÂ¶ÇÊûúÁïåÈù¢Ê∂âÂèäÊñá‰ª∂Êï∞ÊçÆÔºå‰æãÂ¶ÇÂõæÂÉèÂíåÈü≥È¢ëÁªÑ‰ª∂ÔºåÂ∞ÜÂàõÂª∫Êñá‰ª∂Â§πÊù•Â≠òÂÇ®Ëøô‰∫õÊ†áËÆ∞ÁöÑÊï∞ÊçÆ„ÄÇ\n\n‰æãÂ¶ÇÔºåÂØπ‰∫é‰∏äÈù¢ÊòæÁ§∫ÁöÑËÆ°ÁÆóÂô®ÁïåÈù¢ÔºåÊàë‰ª¨Â∞ÜÂú®‰∏ãÈù¢ÁöÑÊóóÊ†áÁõÆÂΩï‰∏≠Â≠òÂÇ®Ê†áËÆ∞ÁöÑÊï∞ÊçÆÔºö\n\n```directory\n+-- calculator.py\n+-- flagged/\n| +-- logs.csv\n```\n\n_flagged/logs.csv_\n\n```csv\nnum1,operation,num2,Output\n5,add,7,12\n6,subtract,1.5,4.5\n```\n\n‰∏éÊó©ÊúüÊòæÁ§∫ÁöÑÂÜ∑Ëâ≤ÁïåÈù¢Áõ∏ÂØπÂ∫îÔºåÊàë‰ª¨Â∞ÜÂú®‰∏ãÈù¢ÁöÑÊóóÊ†áÁõÆÂΩï‰∏≠Â≠òÂÇ®Ê†áËÆ∞ÁöÑÊï∞ÊçÆÔºö\n\n```directory\n+-- sepia.py\n+-- flagged/\n| +-- logs.csv\n| +-- im/\n| | +-- 0.png\n| | +-- 1.png\n| +-- Output/\n| | +-- 0.png\n| | +-- 1.png\n```\n\n_flagged/logs.csv_\n\n```csv\nim,Output\nim/0.png,Output/0.png\nim/1.png,Output/1.png\n```\n\nÂ¶ÇÊûúÊÇ®Â∏åÊúõÁî®Êà∑Êèê‰æõÊóóÊ†áÂéüÂõ†ÔºåÂèØ‰ª•Â∞ÜÂ≠óÁ¨¶‰∏≤ÂàóË°®‰º†ÈÄíÁªô Interface ÁöÑ `flagging_options` ÂèÇÊï∞„ÄÇÁî®Êà∑Âú®ËøõË°åÊóóÊ†áÊó∂ÂøÖÈ°ªÈÄâÊã©ÂÖ∂‰∏≠‰∏Ä‰∏™Â≠óÁ¨¶‰∏≤ÔºåËøôÂ∞Ü‰Ωú‰∏∫ÈôÑÂä†Âàó‰øùÂ≠òÂà∞ CSV ‰∏≠„ÄÇ\n\n## È¢ÑÂ§ÑÁêÜÂíåÂêéÂ§ÑÁêÜ (Preprocessing and Postprocessing)\n\n![annotated](/assets/img/dataflow.svg)\n\nÂ¶ÇÊÇ®ÊâÄËßÅÔºåGradio ÂåÖÊã¨ÂèØ‰ª•Â§ÑÁêÜÂêÑÁßç‰∏çÂêåÊï∞ÊçÆÁ±ªÂûãÁöÑÁªÑ‰ª∂Ôºå‰æãÂ¶ÇÂõæÂÉè„ÄÅÈü≥È¢ëÂíåËßÜÈ¢ë„ÄÇÂ§ßÂ§öÊï∞ÁªÑ‰ª∂ÈÉΩÂèØ‰ª•Áî®‰ΩúËæìÂÖ•ÊàñËæìÂá∫„ÄÇ\n\nÂΩìÁªÑ‰ª∂Áî®‰ΩúËæìÂÖ•Êó∂ÔºåGradio Ëá™Âä®Â§ÑÁêÜ*È¢ÑÂ§ÑÁêÜ*ÔºåÂ∞ÜÊï∞ÊçÆ‰ªéÁî®Êà∑ÊµèËßàÂô®ÂèëÈÄÅÁöÑÁ±ªÂûãÔºà‰æãÂ¶ÇÁΩëÁªúÊëÑÂÉèÂ§¥Âø´ÁÖßÁöÑ base64 Ë°®Á§∫ÔºâËΩ¨Êç¢‰∏∫ÊÇ®ÁöÑÂáΩÊï∞ÂèØ‰ª•Êé•ÂèóÁöÑÂΩ¢ÂºèÔºà‰æãÂ¶Ç `numpy` Êï∞ÁªÑÔºâ„ÄÇ\n\nÂêåÊ†∑ÔºåÂΩìÁªÑ‰ª∂Áî®‰ΩúËæìÂá∫Êó∂ÔºåGradio Ëá™Âä®Â§ÑÁêÜ*ÂêéÂ§ÑÁêÜ*ÔºåÂ∞ÜÊï∞ÊçÆ‰ªéÂáΩÊï∞ËøîÂõûÁöÑÂΩ¢ÂºèÔºà‰æãÂ¶ÇÂõæÂÉèË∑ØÂæÑÂàóË°®ÔºâËΩ¨Êç¢‰∏∫ÂèØ‰ª•Âú®Áî®Êà∑ÊµèËßàÂô®‰∏≠ÊòæÁ§∫ÁöÑÂΩ¢ÂºèÔºà‰æãÂ¶Ç‰ª• base64 Ê†ºÂºèÊòæÁ§∫ÂõæÂÉèÁöÑ `Gallery`Ôºâ„ÄÇ\n\nÊÇ®ÂèØ‰ª•‰ΩøÁî®ÊûÑÂª∫ÂõæÂÉèÁªÑ‰ª∂Êó∂ÁöÑÂèÇÊï∞ÊéßÂà∂*È¢ÑÂ§ÑÁêÜ*„ÄÇ‰æãÂ¶ÇÔºåÂ¶ÇÊûúÊÇ®‰ΩøÁî®‰ª•‰∏ãÂèÇÊï∞ÂÆû‰æãÂåñ `Image` ÁªÑ‰ª∂ÔºåÂÆÉÂ∞ÜÂ∞ÜÂõæÂÉèËΩ¨Êç¢‰∏∫ `PIL` Á±ªÂûãÔºåÂπ∂Â∞ÜÂÖ∂ÈáçÂ°ë‰∏∫`(100, 100)`ÔºåËÄå‰∏çÁÆ°Êèê‰∫§Êó∂ÁöÑÂéüÂßãÂ§ßÂ∞èÂ¶Ç‰ΩïÔºö\n\n```py\nimg = gr.Image(shape=(100, 100), type=""pil"")\n```\n\nÁõ∏ÂèçÔºåËøôÈáåÊàë‰ª¨‰øùÁïôÂõæÂÉèÁöÑÂéüÂßãÂ§ßÂ∞èÔºå‰ΩÜÂú®Â∞ÜÂÖ∂ËΩ¨Êç¢‰∏∫ numpy Êï∞ÁªÑ‰πãÂâçÂèçËΩ¨È¢úËâ≤Ôºö\n\n```py\nimg = gr.Image(invert_colors=True, type=""numpy"")\n```\n\nÂêéÂ§ÑÁêÜË¶ÅÂÆπÊòìÂæóÂ§öÔºÅGradio Ëá™Âä®ËØÜÂà´ËøîÂõûÊï∞ÊçÆÁöÑÊ†ºÂºèÔºà‰æãÂ¶Ç `Image` ÊòØ `numpy` Êï∞ÁªÑËøòÊòØ `str` Êñá‰ª∂Ë∑ØÂæÑÔºüÔºâÔºåÂπ∂Â∞ÜÂÖ∂ÂêéÂ§ÑÁêÜ‰∏∫ÂèØ‰ª•Áî±ÊµèËßàÂô®ÊòæÁ§∫ÁöÑÊ†ºÂºè„ÄÇ\n\nËØ∑Êü•Áúã[ÊñáÊ°£](https://gradio.app/docs)Ôºå‰∫ÜËß£ÊØè‰∏™ÁªÑ‰ª∂ÁöÑÊâÄÊúâ‰∏éÈ¢ÑÂ§ÑÁêÜÁõ∏ÂÖ≥ÁöÑÂèÇÊï∞„ÄÇ\n\n## Ê†∑Âºè (Styling)\n\nGradio ‰∏ªÈ¢òÊòØËá™ÂÆö‰πâÂ∫îÁî®Á®ãÂ∫èÂ§ñËßÇÂíåÊÑüËßâÁöÑÊúÄÁÆÄÂçïÊñπÊ≥ï„ÄÇÊÇ®ÂèØ‰ª•ÈÄâÊã©Â§öÁßç‰∏ªÈ¢òÊàñÂàõÂª∫Ëá™Â∑±ÁöÑ‰∏ªÈ¢ò„ÄÇË¶ÅËøôÊ†∑ÂÅöÔºåËØ∑Â∞Ü `theme=` ÂèÇÊï∞‰º†ÈÄíÁªô `Interface` ÊûÑÈÄ†ÂáΩÊï∞„ÄÇ‰æãÂ¶ÇÔºö\n\n```python\ndemo = gr.Interface(..., theme=gr.themes.Monochrome())\n```\n\nGradio Â∏¶Êúâ‰∏ÄÁªÑÈ¢ÑÂÖàÊûÑÂª∫ÁöÑ‰∏ªÈ¢òÔºåÊÇ®ÂèØ‰ª•‰ªé `gr.themes.*` Âä†ËΩΩ„ÄÇÊÇ®ÂèØ‰ª•Êâ©Â±ïËøô‰∫õ‰∏ªÈ¢òÊàñ‰ªéÂ§¥ÂºÄÂßãÂàõÂª∫Ëá™Â∑±ÁöÑ‰∏ªÈ¢ò - ÊúâÂÖ≥Êõ¥Â§öËØ¶ÁªÜ‰ø°ÊÅØÔºåËØ∑ÂèÇÈòÖ[‰∏ªÈ¢òÊåáÂçó](https://gradio.app/theming-guide)„ÄÇ\n\nË¶ÅÂ¢ûÂä†È¢ùÂ§ñÁöÑÊ†∑ÂºèËÉΩÂäõÔºåÊÇ®ÂèØ‰ª• with `css=` ÂÖ≥ÈîÆÂ≠óÂ∞Ü‰ªª‰Ωï CSS ‰º†ÈÄíÁªôÊÇ®ÁöÑÂ∫îÁî®Á®ãÂ∫è„ÄÇ\nGradio Â∫îÁî®Á®ãÂ∫èÁöÑÂü∫Á±ªÊòØ `gradio-container`ÔºåÂõ†Ê≠§‰ª•‰∏ãÊòØ‰∏Ä‰∏™Êõ¥Êîπ Gradio Â∫îÁî®Á®ãÂ∫èËÉåÊôØÈ¢úËâ≤ÁöÑÁ§∫‰æãÔºö\n\n```python\nwith `gr.Interface(css="".gradio-container {background-color: red}"") as demo:\n ...\n```\n\n## ÈòüÂàó (Queuing)\n\nÂ¶ÇÊûúÊÇ®ÁöÑÂ∫îÁî®Á®ãÂ∫èÈ¢ÑËÆ°‰ºöÊúâÂ§ßÈáèÊµÅÈáèÔºåËØ∑ with `queue()` ÊñπÊ≥ïÊù•ÊéßÂà∂Â§ÑÁêÜÈÄüÁéá„ÄÇËøôÂ∞ÜÊéíÈòüÂ§ÑÁêÜË∞ÉÁî®ÔºåÂõ†Ê≠§‰∏ÄÊ¨°Âè™Â§ÑÁêÜ‰∏ÄÂÆöÊï∞ÈáèÁöÑËØ∑Ê±Ç„ÄÇÈòüÂàó‰ΩøÁî® WebsocketsÔºåËøòÂèØ‰ª•Èò≤Ê≠¢ÁΩëÁªúË∂ÖÊó∂ÔºåÂõ†Ê≠§Â¶ÇÊûúÊÇ®ÁöÑÂáΩÊï∞ÁöÑÊé®ÁêÜÊó∂Èó¥ÂæàÈïøÔºà> 1 ÂàÜÈíüÔºâÔºåÂ∫î‰ΩøÁî®ÈòüÂàó„ÄÇ\n\nwith `Interface`Ôºö\n\n```python\ndemo = gr.Interface(...).queue()\ndemo.launch()\n```\n\nwith `Blocks`Ôºö\n\n```python\nwith gr.Blocks() as demoÔºö\n #...\ndemo.queue()\ndemo.launch()\n```\n\nÊÇ®ÂèØ‰ª•ÈÄöËøá‰ª•‰∏ãÊñπÂºèÊéßÂà∂‰∏ÄÊ¨°Â§ÑÁêÜÁöÑËØ∑Ê±ÇÊï∞ÈáèÔºö\n\n```python\ndemo.queue(concurrency_count=3)\n```\n\nÊü•ÁúãÊúâÂÖ≥ÈÖçÁΩÆÂÖ∂‰ªñÈòüÂàóÂèÇÊï∞ÁöÑ[ÈòüÂàóÊñáÊ°£](/docs/#queue)„ÄÇ\n\nÂú® Blocks ‰∏≠ÊåáÂÆö‰ªÖÂØπÊüê‰∫õÂáΩÊï∞ËøõË°åÊéíÈòüÔºö\n\n```python\nwith gr.Blocks() as demo2Ôºö\n num1 = gr.Number()\n num2 = gr.Number()\n output = gr.Number()\n gr.Button(""Add"").click(\n lambda a, b: a + b, [num1, num2], output)\n gr.Button(""Multiply"").click(\n lambda a, b: a * b, [num1, num2], output, queue=True)\ndemo2.launch()\n```\n\n## Ëø≠‰ª£ËæìÂá∫ (Iterative Outputs)\n\nÂú®Êüê‰∫õÊÉÖÂÜµ‰∏ãÔºåÊÇ®ÂèØËÉΩÈúÄË¶Å‰º†Ëæì‰∏ÄÁ≥ªÂàóËæìÂá∫ËÄå‰∏çÊòØ‰∏ÄÊ¨°ÊòæÁ§∫Âçï‰∏™ËæìÂá∫„ÄÇ‰æãÂ¶ÇÔºåÊÇ®ÂèØËÉΩÊúâ‰∏Ä‰∏™ÂõæÂÉèÁîüÊàêÊ®°ÂûãÔºåÂ∏åÊúõÊòæÁ§∫ÁîüÊàêÁöÑÊØè‰∏™Ê≠•È™§ÁöÑÂõæÂÉèÔºåÁõ¥Âà∞ÊúÄÁªàÂõæÂÉè„ÄÇÊàñËÄÖÊÇ®ÂèØËÉΩÊúâ‰∏Ä‰∏™ËÅäÂ§©Êú∫Âô®‰∫∫ÔºåÂÆÉÈÄêÂ≠óÈÄêÂè•Âú∞ÊµÅÂºè‰º†ËæìÂìçÂ∫îÔºåËÄå‰∏çÊòØ‰∏ÄÊ¨°ËøîÂõûÂÖ®ÈÉ®ÂìçÂ∫î„ÄÇ\n\nÂú®ËøôÁßçÊÉÖÂÜµ‰∏ãÔºåÊÇ®ÂèØ‰ª•Â∞Ü**ÁîüÊàêÂô®**ÂáΩÊï∞Êèê‰æõÁªô GradioÔºåËÄå‰∏çÊòØÂ∏∏ËßÑÂáΩÊï∞„ÄÇÂú® Python ‰∏≠ÂàõÂª∫ÁîüÊàêÂô®ÈùûÂ∏∏ÁÆÄÂçïÔºöÂáΩÊï∞‰∏çÂ∫îËØ•Êúâ‰∏Ä‰∏™ÂçïÁã¨ÁöÑ `return` ÂÄºÔºåËÄåÊòØÂ∫îËØ• with `yield` ËøûÁª≠ËøîÂõû‰∏ÄÁ≥ªÂàóÂÄº„ÄÇÈÄöÂ∏∏Ôºå`yield` ËØ≠Âè•ÊîæÁΩÆÂú®ÊüêÁßçÂæ™ÁéØ‰∏≠„ÄÇ‰∏ãÈù¢ÊòØ‰∏Ä‰∏™ÁÆÄÂçïÁ§∫‰æãÔºåÁîüÊàêÂô®Âè™ÊòØÁÆÄÂçïËÆ°Êï∞Âà∞ÁªôÂÆöÊï∞Â≠óÔºö\n\n```python\ndef my_generator(x):\n for i in range(x):\n yield i\n```\n\nÊÇ®‰ª•‰∏éÂ∏∏ËßÑÂáΩÊï∞Áõ∏ÂêåÁöÑÊñπÂºèÂ∞ÜÁîüÊàêÂô®Êèê‰æõÁªô Gradio„ÄÇ‰æãÂ¶ÇÔºåËøôÊòØ‰∏Ä‰∏™ÔºàËôöÊãüÁöÑÔºâÂõæÂÉèÁîüÊàêÊ®°ÂûãÔºåÂÆÉÂú®ËæìÂá∫ÂõæÂÉè‰πãÂâçÁîüÊàêÊï∞‰∏™Ê≠•È™§ÁöÑÂô™Èü≥Ôºö\n\n$code_fake_diffusion\n$demo_fake_diffusion\n\nËØ∑Ê≥®ÊÑèÔºåÊàë‰ª¨Âú®Ëø≠‰ª£Âô®‰∏≠Ê∑ªÂä†‰∫Ü `time.sleep(1)`Ôºå‰ª•ÂàõÂª∫Ê≠•È™§‰πãÈó¥ÁöÑ‰∫∫Â∑•ÊöÇÂÅúÔºå‰ª•‰æøÊÇ®ÂèØ‰ª•ËßÇÂØüËø≠‰ª£Âô®ÁöÑÊ≠•È™§ÔºàÂú®ÁúüÂÆûÁöÑÂõæÂÉèÁîüÊàêÊ®°Âûã‰∏≠ÔºåËøôÂèØËÉΩÊòØ‰∏çÂøÖË¶ÅÁöÑÔºâ„ÄÇ\n\nÂ∞ÜÁîüÊàêÂô®Êèê‰æõÁªô Gradio **ÈúÄË¶Å**Âú®Â∫ïÂ±Ç Interface Êàñ Blocks ‰∏≠ÂêØÁî®ÈòüÂàóÔºàËØ∑ÂèÇÈòÖ‰∏äÈù¢ÁöÑÈòüÂàóÈÉ®ÂàÜÔºâ„ÄÇ\n\n## ËøõÂ∫¶Êù°\n\nGradio ÊîØÊåÅÂàõÂª∫Ëá™ÂÆö‰πâËøõÂ∫¶Êù°Ôºå‰ª•‰æøÊÇ®ÂèØ‰ª•Ëá™ÂÆö‰πâÂíåÊéßÂà∂ÂêëÁî®Êà∑ÊòæÁ§∫ÁöÑËøõÂ∫¶Êõ¥Êñ∞„ÄÇË¶ÅÂêØÁî®Ê≠§ÂäüËÉΩÔºåÂè™ÈúÄ‰∏∫ÊñπÊ≥ïÊ∑ªÂä†‰∏Ä‰∏™ÈªòËÆ§ÂÄº‰∏∫ `gr.Progress` ÂÆû‰æãÁöÑÂèÇÊï∞Âç≥ÂèØ„ÄÇÁÑ∂ÂêéÔºåÊÇ®ÂèØ‰ª•Áõ¥Êé•Ë∞ÉÁî®Ê≠§ÂÆû‰æãÂπ∂‰º†ÂÖ• 0 Âà∞ 1 ‰πãÈó¥ÁöÑÊµÆÁÇπÊï∞Êù•Êõ¥Êñ∞ËøõÂ∫¶Á∫ßÂà´ÔºåÊàñËÄÖ with `Progress` ÂÆû‰æãÁöÑ `tqdm()` ÊñπÊ≥ïÊù•Ë∑üË∏™ÂèØËø≠‰ª£ÂØπË±°‰∏äÁöÑËøõÂ∫¶ÔºåÂ¶Ç‰∏ãÊâÄÁ§∫„ÄÇÂøÖÈ°ªÂêØÁî®ÈòüÂàó‰ª•ËøõË°åËøõÂ∫¶Êõ¥Êñ∞„ÄÇ\n\n$code_progress_simple\n$demo_progress_simple\n\nÂ¶ÇÊûúÊÇ® with `tqdm` Â∫ìÔºåÂπ∂‰∏îÂ∏åÊúõ‰ªéÂáΩÊï∞ÂÜÖÈÉ®ÁöÑ‰ªª‰Ωï `tqdm.tqdm` Ëá™Âä®Êä•ÂëäËøõÂ∫¶Êõ¥Êñ∞ÔºåËØ∑Â∞ÜÈªòËÆ§ÂèÇÊï∞ËÆæÁΩÆ‰∏∫ `gr.Progress(track_tqdm=True)`ÔºÅ\n\n## ÊâπÂ§ÑÁêÜÂáΩÊï∞ (Batch Functions)\n\nGradio ÊîØÊåÅ‰º†ÈÄí*ÊâπÂ§ÑÁêÜ*ÂáΩÊï∞„ÄÇÊâπÂ§ÑÁêÜÂáΩÊï∞Âè™ÊòØÊé•ÂèóËæìÂÖ•ÂàóË°®Âπ∂ËøîÂõûÈ¢ÑÊµãÂàóË°®ÁöÑÂáΩÊï∞„ÄÇ\n\n‰æãÂ¶ÇÔºåËøôÊòØ‰∏Ä‰∏™ÊâπÂ§ÑÁêÜÂáΩÊï∞ÔºåÂÆÉÊé•Âèó‰∏§‰∏™ËæìÂÖ•ÂàóË°®Ôºà‰∏Ä‰∏™ÂçïËØçÂàóË°®Âíå‰∏Ä‰∏™Êï¥Êï∞ÂàóË°®ÔºâÔºåÂπ∂ËøîÂõû‰øÆÂâ™ËøáÁöÑÂçïËØçÂàóË°®‰Ωú‰∏∫ËæìÂá∫Ôºö\n\n```python\nimport time\n\ndef trim_words(words, lens):\n trimmed_words = []\n time.sleep(5)\n for w, l in zip(words, lens):\n trimmed_words.append(w[:int(l)])\n return [trimmed_words]\n for w, l in zip(words, lens):\n```\n\n‰ΩøÁî®ÊâπÂ§ÑÁêÜÂáΩÊï∞ÁöÑ‰ºòÁÇπÊòØÔºåÂ¶ÇÊûúÂêØÁî®‰∫ÜÈòüÂàóÔºåGradio ÊúçÂä°Âô®ÂèØ‰ª•Ëá™Âä®*ÊâπÂ§ÑÁêÜ*‰º†ÂÖ•ÁöÑËØ∑Ê±ÇÂπ∂Âπ∂Ë°åÂ§ÑÁêÜÂÆÉ‰ª¨Ôºå‰ªéËÄåÂèØËÉΩÂä†Âø´ÊºîÁ§∫ÈÄüÂ∫¶„ÄÇ‰ª•‰∏ãÊòØ Gradio ‰ª£Á†ÅÁöÑÁ§∫‰æãÔºàËØ∑Ê≥®ÊÑè `batch=True` Âíå `max_batch_size=16` - Ëøô‰∏§‰∏™ÂèÇÊï∞ÈÉΩÂèØ‰ª•‰º†ÈÄíÁªô‰∫ã‰ª∂Ëß¶ÂèëÂô®Êàñ `Interface` Á±ªÔºâ\n\nwith `Interface`Ôºö\n\n```python\ndemo = gr.Interface(trim_words, [""textbox"", ""number""], [""output""],\n batch=True, max_batch_size=16)\ndemo.queue()\ndemo.launch()\n```\n\nwith `Blocks`Ôºö\n\n```python\nimport gradio as gr\n\nwith gr.Blocks() as demo:\n with gr.Row():\n word = gr.Textbox(label=""word"")\n leng = gr.Number(label=""leng"")\n output = gr.Textbox(label=""Output"")\n with gr.Row():\n run = gr.Button()\n\n event = run.click(trim_words, [word, leng], output, batch=True, max_batch_size=16)\n\ndemo.queue()\ndemo.launch()\n```\n\nÂú®‰∏äÈù¢ÁöÑÁ§∫‰æã‰∏≠ÔºåÂèØ‰ª•Âπ∂Ë°åÂ§ÑÁêÜ 16 ‰∏™ËØ∑Ê±ÇÔºàÊÄªÊé®ÁêÜÊó∂Èó¥‰∏∫ 5 ÁßíÔºâÔºåËÄå‰∏çÊòØÂàÜÂà´Â§ÑÁêÜÊØè‰∏™ËØ∑Ê±ÇÔºàÊÄªÊé®ÁêÜÊó∂Èó¥‰∏∫ 80 ÁßíÔºâ„ÄÇËÆ∏Â§ö Hugging Face ÁöÑ `transformers` Âíå `diffusers` Ê®°ÂûãÂú® Gradio ÁöÑÊâπÂ§ÑÁêÜÊ®°Âºè‰∏ãËá™ÁÑ∂Â∑•‰ΩúÔºöËøôÊòØ[‰ΩøÁî®ÊâπÂ§ÑÁêÜÁîüÊàêÂõæÂÉèÁöÑÁ§∫‰æãÊºîÁ§∫](https://github.com/gradio-app/gradio/blob/main/demo/diffusers_with_batching/run.py)\n\nÊ≥®ÊÑèÔºö‰ΩøÁî® Gradio ÁöÑÊâπÂ§ÑÁêÜÂáΩÊï∞ **requires** Âú®Â∫ïÂ±Ç Interface Êàñ Blocks ‰∏≠ÂêØÁî®ÈòüÂàóÔºàËØ∑ÂèÇÈòÖ‰∏äÈù¢ÁöÑÈòüÂàóÈÉ®ÂàÜÔºâ„ÄÇ\n\n## Gradio Á¨îËÆ∞Êú¨ (Colab Notebooks)\n\nGradio ÂèØ‰ª•Âú®‰ªª‰ΩïËøêË°å Python ÁöÑÂú∞ÊñπËøêË°åÔºåÂåÖÊã¨Êú¨Âú∞ Jupyter Á¨îËÆ∞Êú¨ÂíåÂçè‰ΩúÁ¨îËÆ∞Êú¨ÔºåÂ¶Ç[Google Colab](https://colab.research.google.com/)„ÄÇÂØπ‰∫éÊú¨Âú∞ Jupyter Á¨îËÆ∞Êú¨Âíå Google Colab Á¨îËÆ∞Êú¨ÔºåGradio Âú®Êú¨Âú∞ÊúçÂä°Âô®‰∏äËøêË°åÔºåÊÇ®ÂèØ‰ª•Âú®ÊµèËßàÂô®‰∏≠‰∏é‰πã‰∫§‰∫í„ÄÇÔºàÊ≥®ÊÑèÔºöÂØπ‰∫é Google ColabÔºåËøôÊòØÈÄöËøá[ÊúçÂä°Â∑•‰ΩúÂô®ÈößÈÅì](https://github.com/tensorflow/tensorboard/blob/master/docs/design/colab_integration.md)ÂÆûÁé∞ÁöÑÔºåÊÇ®ÁöÑÊµèËßàÂô®ÈúÄË¶ÅÂêØÁî® cookies„ÄÇÔºâÂØπ‰∫éÂÖ∂‰ªñËøúÁ®ãÁ¨îËÆ∞Êú¨ÔºåGradio ‰πüÂ∞ÜÂú®ÊúçÂä°Âô®‰∏äËøêË°åÔºå‰ΩÜÊÇ®ÈúÄË¶Å‰ΩøÁî®[SSH ÈößÈÅì](https://coderwall.com/p/ohk6cg/remote-access-to-ipython-notebooks-via-ssh)Âú®Êú¨Âú∞ÊµèËßàÂô®‰∏≠Êü•ÁúãÂ∫îÁî®Á®ãÂ∫è„ÄÇÈÄöÂ∏∏ÔºåÊõ¥ÁÆÄÂçïÁöÑÈÄâÊã©ÊòØ‰ΩøÁî® Gradio ÂÜÖÁΩÆÁöÑÂÖ¨ÂÖ±ÈìæÊé•Ôºå[Âú®‰∏ã‰∏ÄÁØáÊåáÂçó‰∏≠ËÆ®ËÆ∫](/sharing-your-app/#sharing-demos)„ÄÇ\n",gradio-app/gradio/blob/main/guides/cn/01_getting-started/02_key-features.md
3,"!--Copyright 2023 The HuggingFace Team. All rights reserved.\n\nLicensed under the Apache License, Version 2.0 (the ""License""); you may not use this file except in compliance with\nthe License. You may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software distributed under the License is distributed on\nan ""AS IS"" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the\n\n‚ö†Ô∏è Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be\nrendered properly in your Markdown viewer.\n\n-->\n\n# Training on TPU with TensorFlow\n\n<Tip>\n\nIf you don't need long explanations and just want TPU code samples to get started with, check out [our TPU example notebook!](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb)\n\n</Tip>\n\n### What is a TPU?\n\nA TPU is a **Tensor Processing Unit.** They are hardware designed by Google, which are used to greatly speed up the tensor computations within neural networks, much like GPUs. They can be used for both network training and inference. They are generally accessed through Google‚Äôs cloud services, but small TPUs can also be accessed directly for free through Google Colab and Kaggle Kernels.\n\nBecause [all TensorFlow models in ü§ó Transformers are Keras models](https://huggingface.co/blog/tensorflow-philosophy), most of the methods in this document are generally applicable to TPU training for any Keras model! However, there are a few points that are specific to the HuggingFace ecosystem (hug-o-system?) of Transformers and Datasets, and we‚Äôll make sure to flag them up when we get to them.\n\n### What kinds of TPU are available?\n\nNew users are often very confused by the range of TPUs, and the different ways to access them. The first key distinction to understand is the difference between **TPU Nodes** and **TPU VMs.**\n\nWhen you use a **TPU Node**, you are effectively indirectly accessing a remote TPU. You will need a separate VM, which will initialize your network and data pipeline and then forward them to the remote node. When you use a TPU on Google Colab, you are accessing it in the **TPU Node** style.\n\nUsing TPU Nodes can have some quite unexpected behaviour for people who aren‚Äôt used to them! In particular, because the TPU is located on a physically different system to the machine you‚Äôre running your Python code on, your data cannot be local to your machine - any data pipeline that loads from your machine‚Äôs internal storage will totally fail! Instead, data must be stored in Google Cloud Storage where your data pipeline can still access it, even when the pipeline is running on the remote TPU node.\n\n<Tip>\n\nIf you can fit all your data in memory as `np.ndarray` or `tf.Tensor`, then you can `fit()` on that data even when using Colab or a TPU Node, without needing to upload it to Google Cloud Storage.\n\n</Tip>\n\n<Tip>\n\n**ü§óSpecific Hugging Face Tipü§ó:** The methods `Dataset.to_tf_dataset()` and its higher-level wrapper `model.prepare_tf_dataset()` , which you will see throughout our TF code examples, will both fail on a TPU Node. The reason for this is that even though they create a `tf.data.Dataset` it is not a ‚Äúpure‚Äù `tf.data` pipeline and uses `tf.numpy_function` or `Dataset.from_generator()` to stream data from the underlying HuggingFace `Dataset`. This HuggingFace `Dataset` is backed by data that is on a local disc and which the remote TPU Node will not be able to read.\n\n</Tip>\n\nThe second way to access a TPU is via a **TPU VM.** When using a TPU VM, you connect directly to the machine that the TPU is attached to, much like training on a GPU VM. TPU VMs are generally easier to work with, particularly when it comes to your data pipeline. All of the above warnings do not apply to TPU VMs!\n\nThis is an opinionated document, so here‚Äôs our opinion: **Avoid using TPU Node if possible.** It is more confusing and more difficult to debug than TPU VMs. It is also likely to be unsupported in future - Google‚Äôs latest TPU, TPUv4, can only be accessed as a TPU VM, which suggests that TPU Nodes are increasingly going to become a ‚Äúlegacy‚Äù access method. However, we understand that the only free TPU access is on Colab and Kaggle Kernels, which uses TPU Node - so we‚Äôll try to explain how to handle it if you have to! Check the [TPU example notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb) for code samples that explain this in more detail.\n\n### What sizes of TPU are available?\n\nA single TPU (a v2-8/v3-8/v4-8) runs 8 replicas. TPUs exist in **pods** that can run hundreds or thousands of replicas simultaneously. When you use more than a single TPU but less than a whole pod (for example, a v3-32), your TPU fleet is referred to as a **pod slice.**\n\nWhen you access a free TPU via Colab, you generally get a single v2-8 TPU.\n\n### I keep hearing about this XLA thing. What‚Äôs XLA, and how does it relate to TPUs?\n\nXLA is an optimizing compiler, used by both TensorFlow and JAX. In JAX it is the only compiler, whereas in TensorFlow it is optional (but mandatory on TPU!). The easiest way to enable it when training a Keras model is to pass the argument `jit_compile=True` to `model.compile()`. If you don‚Äôt get any errors and performance is good, that‚Äôs a great sign that you‚Äôre ready to move to TPU!\n\nDebugging on TPU is generally a bit harder than on CPU/GPU, so we recommend getting your code running on CPU/GPU with XLA first before trying it on TPU. You don‚Äôt have to train for long, of course - just for a few steps to make sure that your model and data pipeline are working like you expect them to.\n\n<Tip>\n\nXLA compiled code is usually faster - so even if you‚Äôre not planning to run on TPU, adding `jit_compile=True` can improve your performance. Be sure to note the caveats below about XLA compatibility, though!\n\n</Tip>\n\n<Tip warning={true}>\n\n**Tip born of painful experience:** Although using `jit_compile=True` is a good way to get a speed boost and test if your CPU/GPU code is XLA-compatible, it can actually cause a lot of problems if you leave it in when actually training on TPU. XLA compilation will happen implicitly on TPU, so remember to remove that line before actually running your code on a TPU!\n\n</Tip>\n\n### How do I make my model XLA compatible?\n\nIn many cases, your code is probably XLA-compatible already! However, there are a few things that work in normal TensorFlow that don‚Äôt work in XLA. We‚Äôve distilled them into three core rules below:\n\n<Tip>\n\n**ü§óSpecific HuggingFace Tipü§ó:** We‚Äôve put a lot of effort into rewriting our TensorFlow models and loss functions to be XLA-compatible. Our models and loss functions generally obey rule #1 and #2 by default, so you can skip over them if you‚Äôre using `transformers` models. Don‚Äôt forget about these rules when writing your own models and loss functions, though!\n\n</Tip>\n\n#### XLA Rule #1: Your code cannot have ‚Äúdata-dependent conditionals‚Äù\n\nWhat that means is that any `if` statement cannot depend on values inside a `tf.Tensor`. For example, this code block cannot be compiled with XLA!\n\n```python\nif tf.reduce_sum(tensor) > 10:\n tensor = tensor / 2.0\n```\n\nThis might seem very restrictive at first, but most neural net code doesn‚Äôt need to do this. You can often get around this restriction by using `tf.cond` (see the documentation [here](https://www.tensorflow.org/api_docs/python/tf/cond)) or by removing the conditional and finding a clever math trick with indicator variables instead, like so:\n\n```python\nsum_over_10 = tf.cast(tf.reduce_sum(tensor) > 10, tf.float32)\ntensor = tensor / (1.0 + sum_over_10)\n```\n\nThis code has exactly the same effect as the code above, but by avoiding a conditional, we ensure it will compile with XLA without problems!\n\n#### XLA Rule #2: Your code cannot have ‚Äúdata-dependent shapes‚Äù\n\nWhat this means is that the shape of all of the `tf.Tensor` objects in your code cannot depend on their values. For example, the function `tf.unique` cannot be compiled with XLA, because it returns a `tensor` containing one instance of each unique value in the input. The shape of this output will obviously be different depending on how repetitive the input `Tensor` was, and so XLA refuses to handle it!\n\nIn general, most neural network code obeys rule #2 by default. However, there are a few common cases where it becomes a problem. One very common one is when you use **label masking**, setting your labels to a negative value to indicate that those positions should be ignored when computing the loss. If you look at NumPy or PyTorch loss functions that support label masking, you will often see code like this that uses [boolean indexing](https://numpy.org/doc/stable/user/basics.indexing.html#boolean-array-indexing):\n\n```python\nlabel_mask = labels >= 0\nmasked_outputs = outputs[label_mask]\nmasked_labels = labels[label_mask]\nloss = compute_loss(masked_outputs, masked_labels)\nmean_loss = torch.mean(loss)\n```\n\nThis code is totally fine in NumPy or PyTorch, but it breaks in XLA! Why? Because the shape of `masked_outputs` and `masked_labels` depends on how many positions are masked - that makes it a **data-dependent shape.** However, just like for rule #1, we can often rewrite this code to yield exactly the same output without any data-dependent shapes.\n\n```python\nlabel_mask = tf.cast(labels >= 0, tf.float32)\nloss = compute_loss(outputs, labels)\nloss = loss * label_mask # Set negative label positions to 0\nmean_loss = tf.reduce_sum(loss) / tf.reduce_sum(label_mask)\n```\n\nHere, we avoid data-dependent shapes by computing the loss for every position, but zeroing out the masked positions in both the numerator and denominator when we calculate the mean, which yields exactly the same result as the first block while maintaining XLA compatibility. Note that we use the same trick as in rule #1 - converting a `tf.bool` to `tf.float32` and using it as an indicator variable. This is a really useful trick, so remember it if you need to convert your own code to XLA!\n\n#### XLA Rule #3: XLA will need to recompile your model for every different input shape it sees\n\nThis is the big one. What this means is that if your input shapes are very variable, XLA will have to recompile your model over and over, which will create huge performance problems. This commonly arises in NLP models, where input texts have variable lengths after tokenization. In other modalities, static shapes are more common and this rule is much less of a problem.\n\nHow can you get around rule #3? The key is **padding** - if you pad all your inputs to the same length, and then use an `attention_mask`, you can get the same results as you‚Äôd get from variable shapes, but without any XLA issues. However, excessive padding can cause severe slowdown too - if you pad all your samples to the maximum length in the whole dataset, you might end up with batches consisting endless padding tokens, which will waste a lot of compute and memory!\n\nThere isn‚Äôt a perfect solution to this problem. However, you can try some tricks. One very useful trick is to **pad batches of samples up to a multiple of a number like 32 or 64 tokens.** This often only increases the number of tokens by a small amount, but it hugely reduces the number of unique input shapes, because every input shape now has to be a multiple of 32 or 64. Fewer unique input shapes means fewer XLA compilations!\n\n<Tip>\n\n**ü§óSpecific HuggingFace Tipü§ó:** Our tokenizers and data collators have methods that can help you here. You can use `padding=""max_length""` or `padding=""longest""` when calling tokenizers to get them to output padded data. Our tokenizers and data collators also have a `pad_to_multiple_of` argument that you can use to reduce the number of unique input shapes you see!\n\n</Tip>\n\n### How do I actually train my model on TPU?\n\nOnce your training is XLA-compatible and (if you‚Äôre using TPU Node / Colab) your dataset has been prepared appropriately, running on TPU is surprisingly easy! All you really need to change in your code is to add a few lines to initialize your TPU, and to ensure that your model and dataset are created inside a `TPUStrategy` scope. Take a look at [our TPU example notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb) to see this in action!\n\n### Summary\n\nThere was a lot in here, so let‚Äôs summarize with a quick checklist you can follow when you want to get your model ready for TPU training:\n\n- Make sure your code follows the three rules of XLA\n- Compile your model with `jit_compile=True` on CPU/GPU and confirm that you can train it with XLA\n- Either load your dataset into memory or use a TPU-compatible dataset loading approach (see [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb))\n- Migrate your code either to Colab (with accelerator set to ‚ÄúTPU‚Äù) or a TPU VM on Google Cloud\n- Add TPU initializer code (see [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb))\n- Create your `TPUStrategy` and make sure dataset loading and model creation are inside the `strategy.scope()` (see [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb))\n- Don‚Äôt forget to take `jit_compile=True` out again when you move to TPU!\n- üôèüôèüôèü•∫ü•∫ü•∫\n- Call model.fit()\n- You did it!",huggingface/transformers/blob/main/docs/source/en/perf_train_tpu_tf.md
4,"Gradio Demo: blocks_random_slider\n\n\n```\n!pip install -q gradio \n```\n\n\n```\n\nimport gradio as gr\n\n\ndef func(slider_1, slider_2):\n return slider_1 * 5 + slider_2\n\n\nwith gr.Blocks() as demo:\n slider = gr.Slider(minimum=-10.2, maximum=15, label=""Random Slider (Static)"", randomize=True)\n slider_1 = gr.Slider(minimum=100, maximum=200, label=""Random Slider (Input 1)"", randomize=True)\n slider_2 = gr.Slider(minimum=10, maximum=23.2, label=""Random Slider (Input 2)"", randomize=True)\n slider_3 = gr.Slider(value=3, label=""Non random slider"")\n btn = gr.Button(""Run"")\n btn.click(func, inputs=[slider_1, slider_2], outputs=gr.Number())\n\nif __name__ == ""__main__"":\n demo.launch()\n\n```\n",gradio-app/gradio/blob/main/demo/blocks_random_slider/run.ipynb


# **1. Build a synthetic dataset for evaluation**
We first build a synthetic dataset of questions and associated contexts. The method is to get elements from our knowledge base, and ask an LLM to generate questions based on these documents.

Then we setup other LLM agents to act as quality filters for the generated QA couples: each of them will act as the filter for a specific flaw.

### **1.1. Prepare source documents**

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document as LangchainDocument

langchain_docs = [
    LangchainDocument(
        page_content=doc["text"],
        metadata={"source": doc["source"]}
    )
    for doc in tqdm(ds)
]


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    add_start_index=True,
    separators=["\n\n", "\n", ".", " ", ""],
)

docs_processed = []
for doc in langchain_docs:
    docs_processed += text_splitter.split_documents([doc])

print(len(docs_processed))

  0%|          | 0/2647 [00:00<?, ?it/s]

28308


In [10]:
# Check (ds --> langchain_docs --> docs_processed)
import random
i = random.randint(1,1000)
print("\n===== DS =====")
print(ds[i])

print("\n===== LANGCHAIN DOCUMENT =====")
print(langchain_docs[i])

print("\n===== PROCESSED DOCUMENT (CHUNK) =====")
print(docs_processed[i])



===== DS =====
{'text': ' Archived Changes\n\n### Nov 22, 2021\n* A number of updated weights anew new model defs\n  * `eca_halonext26ts` - 79.5 @ 256\n  * `resnet50_gn` (new) - 80.1 @ 224, 81.3 @ 288\n  * `resnet50` - 80.7 @ 224, 80.9 @ 288 (trained at 176, not replacing current a1 weights as default since these don\'t scale as well to higher res, [weights](https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-rsb-weights/resnet50_a1h2_176-001a1197.pth))\n  * `resnext50_32x4d` - 81.1 @ 224, 82.0 @ 288\n  * `sebotnet33ts_256` (new) - 81.2 @ 224\n  * `lamhalobotnet50ts_256` - 81.5 @ 256\n  * `halonet50ts` - 81.7 @ 256\n  * `halo2botnet50ts_256` - 82.0 @ 256\n  * `resnet101` - 82.0 @ 224, 82.8 @ 288\n  * `resnetv2_101` (new) - 82.1 @ 224, 83.0 @ 288\n  * `resnet152` - 82.8 @ 224, 83.5 @ 288\n  * `regnetz_d8` (new) - 83.5 @ 256, 84.0 @ 320\n  * `regnetz_e8` (new) - 84.5 @ 256, 85.0 @ 320\n* `vit_base_patch8_224` (85.8 top-1) & `in21k` variant weights added thanks [Mart

### **1.2. Setup agents for question generation**

We use [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) for QA couple generation because it it has excellent performance in leaderboards such as [Chatbot Arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard).

In [11]:
from huggingface_hub import InferenceClient


# repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
# repo_id = "HuggingFaceH4/zephyr-7b-beta"
repo_id = "meta-llama/Llama-3.2-3B-Instruct"

llm_client = InferenceClient(model=repo_id, timeout=120)


def call_llm(inference_client: InferenceClient, prompt: str):

    messages1 = [{"role": "user", "content": prompt}]

    response = inference_client.chat_completion(
        messages=messages1,
        max_tokens=200,
        temperature=0.1
    )
    return response.choices[0].message.content

# Run
try:
    result = call_llm(llm_client, "Vi·ªát Nam c√≥ bao nhi√™u t·ªânh/th√†nh?")
    print(result)
except Exception as e:
    print(f"L·ªói: {e}")

Vi·ªát Nam hi·ªán c√≥ 63 t·ªânh, th√†nh ph·ªë.


In [12]:
QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

In [13]:

import random

idx = random.randint(0, len(docs_processed)-1)
sample_context = docs_processed[idx].page_content


print(f"--- ƒêO·∫†N VƒÇN G·ªêC ---\n{sample_context}\n")
result = call_llm(llm_client, QA_generation_prompt.format(context=sample_context))
print(f"--- AI T·ª∞ ƒê·∫∂T C√ÇU H·ªéI ---\n{result}")

--- ƒêO·∫†N VƒÇN G·ªêC ---
Here's an example of how to use these attributes to create a Gradio app that does not lazy load and has an initial height of 0px.

```html
<gradio-app
	space="gradio/Echocardiogram-Segmentation"
	eager="true"
	initial_height="0px"
></gradio-app>
```

Here's another example of how to use the `render` event. An event listener is used to capture the `render` event and will call the `handleLoadComplete()` function once rendering is complete. 

```html
<script>
	function handleLoadComplete() {
		console.log("Embedded space has finished rendering");
	}

	const gradioApp = document.querySelector("gradio-app");
	gradioApp.addEventListener("render", handleLoadComplete);
</script>
```

--- AI T·ª∞ ƒê·∫∂T C√ÇU H·ªéI ---
Factoid question: What is the default value of the `eager` attribute in the Gradio app tag?
Answer: true


Now let's generate our QA couples.
For this example, we generate only 5 QA couples and will load the rest from the Hub.

But for your specific knowledge base, given that you want to get at least ~100 test samples, and accounting for the fact that we will filter out around half of these with our critique agents later on, you should generate much more, in the >200 samples.

In [26]:
import random
import re

N_GENERATIONS = 5  # We intentionally generate only 10 QA couples here for cost and time considerations

print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
for idx in tqdm(random.sample(docs_processed, N_GENERATIONS)):
    raw_output = call_llm(
        llm_client, QA_generation_prompt.format(context=idx.page_content)
    )

    try:

        question_match = re.search(r"(?:Question|Factoid question):\s*(.+?)\n", raw_output, re.IGNORECASE)
        answer_match = re.search(r"Answer:\s*(.+)", raw_output, re.IGNORECASE)

        if question_match and answer_match:
            question = question_match.group(1).split("Answer:")[0].strip()
            answer = answer_match.group(1).strip()

            if 10 < len(answer) < 500:
                outputs.append({
                    "context": idx.page_content,
                    "question": question,
                    "answer": answer,
                    "source_doc": idx.metadata["source"],
                })
    except Exception as e:
        continue

Generating 5 QA couples...


  0%|          | 0/5 [00:00<?, ?it/s]

In [27]:
# Check
print(outputs)
print(f"S·ªë outputs ƒë∆∞·ª£c gi·ªØ l·∫°i sau khi l·ªçc: {len(outputs)}")

[{'context': '*Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images', 'question': 'What are the three major issues tackled in the training and application of large vision models?', 'answer': 'Tra

In [28]:

display(pd.DataFrame(outputs).head(5))

Unnamed: 0,context,question,answer,source_doc
0,"*Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images",What are the three major issues tackled in the training and application of large vision models?,"Training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data.",huggingface/transformers/blob/main/docs/source/en/model_doc/swinv2.md
1,"```bash\ngit lfs install\nmkdir data\ngit -C ""./data"" clone https://huggingface.co/datasets/codeparrot/codeparrot-clean-train\ngit -C ""./data"" clone https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid\n```\n\nAnd then pass the paths to the datasets when we run the training script:\n\n```bash\naccelerate launch scripts/codeparrot_training.py \\n--model_ckpt codeparrot/codeparrot-small \\n--dataset_name_train ./data/codeparrot-clean-train \\n--dataset_name_valid ./data/codeparrot-clean-valid \\n--train_batch_size 12 \\n--valid_batch_size 12 \\n--learning_rate 5e-4 \\n--num_warmup_steps 2000 \\n--gradient_accumulation 1 \\n--gradient_checkpointing False \\n--max_train_steps 150000 \\n--save_checkpoint_steps 15000\n```",What is the name of the pre-trained model checkpoint used in the training script?,codeparrot/codeparrot-small,huggingface/transformers/blob/main/examples/research_projects/codeparrot/README.md
2,"Fortunately, you will need to do this only once because you can save your model and reload it later.\n\n```\n>>> model.save_pretrained(""a_local_path_for_compiled_neuron_model"")\n```\n\nEven better, you can push it to the [Hugging Face hub](https://huggingface.co/models).\n\n```\n>>> model.push_to_hub(\n ""a_local_path_for_compiled_neuron_model"",\n repository_id=""aws-neuron/Llama-2-7b-hf-neuron-latency"")\n```\n\n## Generate Text using Llama 2 on AWS Inferentia2\n\nOnce your model has been exported, you can generate text using the transformers library, as it has been described in [detail in this previous post](https://huggingface.co/blog/how-to-generate).\n\n```\n>>> from optimum.neuron import NeuronModelForCausalLM\n>>> from transformers import AutoTokenizer\n\n>>> model = NeuronModelForCausalLM.from_pretrained('aws-neuron/Llama-2-7b-hf-neuron-latency')\n>>> tokenizer = AutoTokenizer.from_pretrained(""aws-neuron/Llama-2-7b-hf-neuron-latency"")",How do you save a compiled Neuron model?,"You can save a compiled Neuron model using the `save_pretrained` method and specify a local path, or push it to the Hugging Face hub.",huggingface/blog/blob/main/inferentia-llama2.md
3,"Every time the pipeline is run, [`torch.randn`](https://pytorch.org/docs/stable/generated/torch.randn.html) uses a different random seed to create Gaussian noise which is denoised stepwise. This leads to a different result each time it is run, which is great for diffusion pipelines since it generates a different random image each time.\n\nBut if you need to reliably generate the same image, that'll depend on whether you're running the pipeline on a CPU or GPU.\n\n### CPU\n\nTo generate reproducible results on a CPU, you'll need to use a PyTorch [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed:\n\n```python\nimport torch\nfrom diffusers import DDIMPipeline\nimport numpy as np\n\nmodel_id = ""google/ddpm-cifar10-32""\n\n# load model and scheduler\nddim = DDIMPipeline.from_pretrained(model_id, use_safetensors=True)\n\n# create a generator for reproducibility\ngenerator = torch.Generator(device=""cpu"").manual_seed(0)",How can you generate reproducible results with a CPU?,You can generate reproducible results with a CPU by using a PyTorch Generator and setting a seed.,huggingface/diffusers/blob/main/docs/source/en/using-diffusers/reproducibility.md


### **1.3. Setup critique agents**

The questions generated by the previous agent can have many flaws: we should do a quality check before validating these questions.

We thus build critique agents that will rate each question on several criteria, given in [this paper](https://huggingface.co/papers/2312.10003):
- **Groundedness:** can the question be answered from the given context?
- **Relevance:** is the question relevant to users? For instance, `"What is the date when transformers 4.29.1 was released?"` is not relevant for ML practitioners.

One last failure case we've noticed is when a function is tailored for the particular setting where the question was generated, but undecipherable by itself, like `"What is the name of the function used in this guide?"`.
We also build a critique agent for this criteria:
- **Stand-alone**: is the question understandable free of any context, for someone with domain knowledge/Internet access? The opposite of this would be `What is the function used in this article?` for a question generated from a specific blog article.

We systematically score functions with all these agents, and whenever the score is too low for any one of the agents, we eliminate the question from our eval dataset.

üí° ___When asking the agents to output a score, we first ask them to produce its rationale. This will help us verify scores, but most importantly, asking it to first output rationale gives the model more tokens to think and elaborate an answer before summarizing it into a single score token.___

We now build and run these critique agents.

In [17]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

In [19]:

import re

def parse_llm_judge_output(text):
    score_match = re.search(r"Total rating:\s*([1-5])", text, re.IGNORECASE)
    eval_match = re.search(
        r"Evaluation:\s*(.*?)(?:\n\s*Total rating:|$)",
        text,
        re.IGNORECASE | re.DOTALL
    )

    if not score_match or not eval_match:
        raise ValueError("Cannot parse judge output")

    score = int(score_match.group(1))
    evaluation = eval_match.group(1).strip()

    return score, evaluation


In [29]:
print("Generating critique for each QA couple...")
for output in tqdm(outputs):
    evaluations = {
        "groundedness": call_llm(
            llm_client,
            question_groundedness_critique_prompt.format(
                context=output["context"], question=output["question"]
            ),
        ),
        "relevance": call_llm(
            llm_client,
            question_relevance_critique_prompt.format(question=output["question"]),
        ),
        "standalone": call_llm(
            llm_client,
            question_standalone_critique_prompt.format(question=output["question"]),
        ),
    }
    try:
      for criterion, evaluation in evaluations.items():
        score, eval_text = parse_llm_judge_output(evaluation)

        output.update({
            f"{criterion}_score": score,
            f"{criterion}_eval": eval_text,
        })
    except Exception:
      continue


Generating critique for each QA couple...


  0%|          | 0/4 [00:00<?, ?it/s]

Now let us filter out bad questions based on our critique agent scores:

In [30]:
import pandas as pd

pd.set_option("display.max_colwidth", None)

generated_questions = pd.DataFrame.from_dict(outputs)

print("Evaluation dataset before filtering:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)
generated_questions = generated_questions.loc[
    (generated_questions["groundedness_score"] >= 4)
    & (generated_questions["relevance_score"] >= 4)
    & (generated_questions["standalone_score"] >= 4)
]
print("============================================")
print("Final evaluation dataset:")
display(
    generated_questions[
        [
            "question",
            "answer",
            "groundedness_score",
            "relevance_score",
            "standalone_score",
        ]
    ]
)

eval_dataset = datasets.Dataset.from_pandas(
    generated_questions, split="train", preserve_index=False
)

Evaluation dataset before filtering:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,What are the three major issues tackled in the training and application of large vision models?,"Training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data.",5,4,5
1,What is the name of the pre-trained model checkpoint used in the training script?,codeparrot/codeparrot-small,4,4,4
2,How do you save a compiled Neuron model?,"You can save a compiled Neuron model using the `save_pretrained` method and specify a local path, or push it to the Hugging Face hub.",5,4,1
3,How can you generate reproducible results with a CPU?,You can generate reproducible results with a CPU by using a PyTorch Generator and setting a seed.,4,4,5


Final evaluation dataset:


Unnamed: 0,question,answer,groundedness_score,relevance_score,standalone_score
0,What are the three major issues tackled in the training and application of large vision models?,"Training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data.",5,4,5
1,What is the name of the pre-trained model checkpoint used in the training script?,codeparrot/codeparrot-small,4,4,4
3,How can you generate reproducible results with a CPU?,You can generate reproducible results with a CPU by using a PyTorch Generator and setting a seed.,4,4,5


Now our synthetic evaluation dataset is complete! We can evaluate different RAG systems on this evaluation dataset.

We have generated only a few QA couples here to reduce time and cost. But let's kickstart the next part by loading a pre-generated dataset:

Gi·ªù th√¨ b·ªô d·ªØ li·ªáu ƒë√°nh gi√° t·ªïng h·ª£p c·ªßa ch√∫ng ta ƒë√£ ho√†n ch·ªânh! Ch√∫ng ta c√≥ th·ªÉ ƒë√°nh gi√° c√°c h·ªá th·ªëng **RAG kh√°c nhau** tr√™n b·ªô d·ªØ li·ªáu ƒë√°nh gi√° n√†y.

Ch√∫ng ta ch·ªâ t·∫°o ra m·ªôt v√†i c·∫∑p QA ·ªü ƒë√¢y ƒë·ªÉ gi·∫£m th·ªùi gian v√† chi ph√≠. Nh∆∞ng h√£y b·∫Øt ƒë·∫ßu ph·∫ßn ti·∫øp theo b·∫±ng c√°ch t·∫£i m·ªôt b·ªô d·ªØ li·ªáu ƒë√£ ƒë∆∞·ª£c t·∫°o s·∫µn:

In [31]:
eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

README.md:   0%|          | 0.00/893 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/289k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/65 [00:00<?, ? examples/s]

# **2. Build our RAG System**

### **2.1. Preprocessing documents to build our vector database**

- In this part, __we split the documents from our knowledge base into smaller chunks__: these will be the snippets that are picked by the Retriever, to then be ingested by the Reader LLM as supporting elements for its answer.
- The goal is to build semantically relevant snippets: not too small to be sufficient for supporting an answer, and not too large too avoid diluting individual ideas.

Many options exist for text splitting:
- split every `n` words / characters, but this has the risk of cutting in half paragraphs or even sentences
- split after `n` words / character, but only on sentence boundaries
- **recursive split** tries to preserve even more of the document structure, by processing it tree-like way, splitting first on the largest units (chapters) then recursively splitting on smaller units (paragraphs, sentences).

To learn more about chunking, I recommend you read [this great notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt.

[This space](https://huggingface.co/spaces/m-ric/chunk_visualizer) lets you visualize how different splitting options affect the chunks you get.

> In the following, we use Langchain's `RecursiveCharacterTextSplitter`.

üí° _To measure chunk length in our Text Splitter, our length function will not be the count of characters, but the count of tokens in the tokenized text: indeed, for subsequent embedder that processes token, measuring length in tokens is more relevant and empirically performs better._

In [32]:
# from langchain.docstore.document import Document as LangchainDocument
from langchain_core.documents import Document as LangchainDocument

RAW_KNOWLEDGE_BASE = []

for doc in ds:
    lang_doc = LangchainDocument(
        page_content=doc["text"],
        metadata={"source": doc["source"]}
    )
    RAW_KNOWLEDGE_BASE.append(lang_doc)


In [33]:
#Check
print(len(RAW_KNOWLEDGE_BASE))
print(type(RAW_KNOWLEDGE_BASE[0]))
print(RAW_KNOWLEDGE_BASE[0].metadata)


2647
<class 'langchain_core.documents.base.Document'>
{'source': 'huggingface/hf-endpoints-documentation/blob/main/docs/source/guides/create_endpoint.mdx'}


In [34]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

def split_documents(
    chunk_size: int,
    knowledge_base: List[LangchainDocument],
    tokenizer_name: str,
) -> List[LangchainDocument]:
    """
    Split documents into chunks of size `chunk_size` characters and return a list of documents.
    """

    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained(tokenizer_name),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
        separators=["\n\n", "\n", ".", " ", ""],
    )

    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique

### **2.2. Retriever - embeddings üóÇÔ∏è**
The __retriever acts like an internal search engine__: given the user query, it returns the most relevant documents from your knowledge base.

> For the knowledge base, we use Langchain vector databases since __it offers a convenient [FAISS](https://github.com/facebookresearch/faiss) index and allows us to keep document metadata throughout the processing__.

üõ†Ô∏è __Options included:__

- Tune the chunking method:
    - Size of the chunks
    - Method: split on different separators, use [semantic chunking](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker)...
- Change the embedding model

In [56]:
from langchain_community.vectorstores import FAISS # FAISS = vector database
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
import os

def load_embeddings(
    langchain_docs: List[LangchainDocument],
    chunk_size: int,
    embedding_model_name: Optional[str] = "thenlper/gte-small",
) -> FAISS:
    """
    Creates a FAISS index from the given embedding model and documents. Loads the index directly if it already exists.

    Args:
        langchain_docs: list of documents
        chunk_size: size of the chunks to split the documents into
        embedding_model_name: name of the embedding model to use

    Returns:
        FAISS index
    """
    # load embedding_model
    embedding_model = HuggingFaceEmbeddings(
        model_name=embedding_model_name,
        multi_process=True,
        model_kwargs={"device": "cuda"},
        encode_kwargs={
            "normalize_embeddings": True
        },
    )


    index_name = (
        f"index_chunk:{chunk_size}_embeddings:{embedding_model_name.replace('/', '~')}"
    )
    index_folder_path = f"./data/indexes/{index_name}/"
    # Check if embeddings already exist on disk
    if os.path.isdir(index_folder_path):
        return FAISS.load_local(
            index_folder_path,
            embedding_model,
            distance_strategy=DistanceStrategy.COSINE,
            allow_dangerous_deserialization=True,
        )

    else:
        print("Index not found, generating it...")
        docs_processed = split_documents(
            chunk_size,
            langchain_docs,
            embedding_model_name,
        )

        knowledge_index = FAISS.from_documents(
            docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
        )
        knowledge_index.save_local(index_folder_path)
        return knowledge_index

### 2.3. Reader - LLM üí¨

In this part, the __LLM Reader reads the retrieved documents to formulate its answer.__

üõ†Ô∏è Here we tried the following options to improve results:
- Switch reranking on/off
- Change the reader model

In [37]:
RAG_PROMPT_TEMPLATE = """[INST]
You are a helpful assistant.
Use ONLY the information in the context to answer the question.
If the answer cannot be deduced from the context, say "I don't know".

Context:
{context}

Question:
{question}
[/INST]
"""


In [58]:

from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace
from google.colab import userdata
HF_API_TOKEN  = userdata.get('HF_KEY')


# Base LLM (text-generation)
base_llm = HuggingFaceEndpoint(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    huggingfacehub_api_token=HF_API_TOKEN,
    max_new_tokens=512,
    temperature=0.1,
    repetition_penalty=1.03,
)

# Wrap --> Chat Model
READER_LLM = ChatHuggingFace(llm=base_llm)
repo_id = "HuggingFaceH4/zephyr-7b-beta"
READER_MODEL_NAME = repo_id.split("/")[-1]

In [43]:
from langchain_core.prompts import ChatPromptTemplate

rag_prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Using the information contained in the context, "
     "give a comprehensive answer to the question. "
     "Respond only to the question asked. "
     "If the answer cannot be deduced from the context, say you don't know."),
    ("human",
     "Context:\n{context}\n\nQuestion: {question}")
])


In [44]:
def answer_with_rag(
    question: str,
    llm,
    knowledge_index,
    reranker=None,
    num_retrieved_docs=30,
    num_docs_final=7,
):
    # Retrieve
    docs = knowledge_index.similarity_search(question, k=num_retrieved_docs)
    docs = [d.page_content for d in docs]

    # Rerank (optional)
    if reranker:
        docs = reranker.rerank(question, docs, k=num_docs_final)
        docs = [d["content"] for d in docs]

    docs = docs[:num_docs_final]

    context = "\n\n".join(
        [f"Document {i}:\n{doc}" for i, doc in enumerate(docs)]
    )

    # Build chat prompt
    messages = rag_prompt.format_messages(
        question=question,
        context=context
    )

    # [
    # SystemMessage(content="Using the information..."),
    # HumanMessage(content="Context: ... Question: ...")
    # ]




    # Invoke chat model
    response = llm.invoke(messages)

    return response.content, docs


# **3. Benchmarking the RAG system**

The RAG system and the evaluation datasets are now ready. The last step is to judge the RAG system's output on this evaluation dataset.

To this end, __we setup a judge agent__. ‚öñÔ∏èü§ñ

Out of [the different RAG evaluation metrics](https://docs.ragas.io/en/latest/concepts/metrics/index.html), we choose to focus only on Answer Correctness since it is the best end-to-end metric of our system's performance.

> We use GPT4 as a judge for its empirically good performance, but you could try with other models such as [kaist-ai/prometheus-13b-v1.0](https://huggingface.co/kaist-ai/prometheus-13b-v1.0) or [BAAI/JudgeLM-33B-v1.0](https://huggingface.co/BAAI/JudgeLM-33B-v1.0).

üí° _In the evaluation prompt, we give a detailed description each metric on the scale 1-5, as is done in [Prometheus's prompt template](https://huggingface.co/kaist-ai/prometheus-13b-v1.0): this helps the model ground its metric precisely. If instead you give the judge LLM a vague scale to work with, the outputs will not be consistent enough between different examples._

üí° _Again, prompting the LLM to output rationale before giving its final score gives it more tokens to help it formalize and elaborate a judgement._

In [45]:
print(len(eval_dataset))

65


In [49]:
from langchain_core.vectorstores import VectorStore
from ragatouille import RAGPretrainedModel
from langchain_core.language_models import BaseChatModel

def run_rag_tests(
    eval_dataset: datasets.Dataset,
    llm,
    knowledge_index: VectorStore,
    output_file: str,
    reranker: Optional[RAGPretrainedModel] = None,
    verbose: Optional[bool] = True,
    test_settings: Optional[str] = None,  # To document the test settings used
):
    """Runs RAG tests on the given dataset and saves the results to the given output file."""
    try:  # load previous generations if they exist
        with open(output_file, "r") as f:
            outputs = json.load(f)
    except:
        outputs = []

    for example in tqdm(eval_dataset):
        question = example["question"]
        if question in [output["question"] for output in outputs]:
            continue

        answer, relevant_docs = answer_with_rag(
            question, llm, knowledge_index, reranker=reranker
        )
        if verbose:
            print("=======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')
        result = {
            "question": question,
            "true_answer": example["answer"],
            "source_doc": example["source_doc"],
            "generated_answer": answer,
            "retrieved_docs": [doc for doc in relevant_docs],
        }
        if test_settings:
            result["test_settings"] = test_settings
        outputs.append(result)

        with open(output_file, "w") as f:
            json.dump(outputs, f)

********************************************************************************
--------------------------------------------
RAGatouille version 0.0.10 will be migrating to a PyLate backend 
instead of the current Stanford ColBERT backend.
PyLate is a fully mature, feature-equivalent backend, that greatly facilitates compatibility.
However, please pin version <0.0.10 if you require the Stanford ColBERT backend.
********************************************************************************
  from ragatouille import RAGPretrainedModel


In [50]:
EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import SystemMessage
from langchain_core.prompts import HumanMessagePromptTemplate



evaluation_prompt_template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(content="You are a fair evaluator language model."),
        HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),
    ]
)

In [51]:
from langchain_openai import ChatOpenAI
from google.colab import userdata
OPENAI_API_KEY =  userdata.get('GPT_KEY')

eval_chat_model = ChatOpenAI(model="gpt-4.1", temperature=0, openai_api_key=OPENAI_API_KEY, base_url="https://llm.ptnglobalcorp.com")
evaluator_name = "GPT4.1"


def evaluate_answers(
    answer_path: str,
    eval_chat_model,
    evaluator_name: str,
    evaluation_prompt_template: ChatPromptTemplate,
) -> None:
    """Evaluates generated answers. Modifies the given answer file in place for better checkpointing."""
    answers = []
    if os.path.isfile(answer_path):  # load previous generations if they exist
        answers = json.load(open(answer_path, "r"))
    for experiment in tqdm(answers):
        if f"eval_score_{evaluator_name}" in experiment:
            continue

        eval_prompt = evaluation_prompt_template.format_messages(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        eval_result = eval_chat_model.invoke(eval_prompt)
        feedback, score = [
            item.strip() for item in eval_result.content.split("[RESULT]")
        ]
        experiment[f"eval_score_{evaluator_name}"] = score
        experiment[f"eval_feedback_{evaluator_name}"] = feedback

        with open(answer_path, "w") as f:
            json.dump(answers, f)

üöÄ Let's run the tests and evaluate answers!üëá

In [59]:
# ===== CELL 1: RUN RAG ONLY =====

if not os.path.exists("./output"):
    os.mkdir("./output")

for chunk_size in [200]:
    for embeddings in ["thenlper/gte-small"]:
        for rerank in [True, False]:

            settings_name = (
                f"chunk:{chunk_size}_"
                f"embeddings:{embeddings.replace('/', '~')}_"
                f"rerank:{rerank}_"
                f"reader-model:{READER_MODEL_NAME}"
            )

            output_file_name = f"./output/rag_{settings_name}.json"

            print(f"Running RAG for {settings_name}")

            knowledge_index = load_embeddings(
                RAW_KNOWLEDGE_BASE,
                chunk_size=chunk_size,
                embedding_model_name=embeddings,
            )

            reranker = (
                RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
                if rerank else None
            )

            run_rag_tests(
                eval_dataset=eval_dataset,
                llm=READER_LLM,
                knowledge_index=knowledge_index,
                output_file=output_file_name,
                reranker=reranker,
                verbose=False,
                test_settings=settings_name,
            )


Running RAG for chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:zephyr-7b-beta


  self.scaler = torch.cuda.amp.GradScaler()


  0%|          | 0/65 [00:00<?, ?it/s]

  return torch.cuda.amp.autocast() if self.activated else NullContextManager()

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  9.45it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  9.49it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  9.34it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  9.40it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  9.45it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  9.20it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  9.05it/s]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 11.01it/s]

  0%|          | 0/1 [00:00<?, ?it/s][A
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  

Running RAG for chunk:200_embeddings:thenlper~gte-small_rerank:False_reader-model:zephyr-7b-beta


  0%|          | 0/65 [00:00<?, ?it/s]

In [60]:
# ===== CELL 2: RUN EVALUATION ONLY =====

for chunk_size in [200]:
    for embeddings in ["thenlper/gte-small"]:
        for rerank in [True, False]:

            settings_name = (
                f"chunk:{chunk_size}_"
                f"embeddings:{embeddings.replace('/', '~')}_"
                f"rerank:{rerank}_"
                f"reader-model:{READER_MODEL_NAME}"
            )

            output_file_name = f"./output/rag_{settings_name}.json"

            print(f"Evaluating results for {settings_name}")

            evaluate_answers(
                output_file_name,
                eval_chat_model,
                evaluator_name,
                evaluation_prompt_template,
            )


Evaluating results for chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:zephyr-7b-beta


  0%|          | 0/65 [00:00<?, ?it/s]

Evaluating results for chunk:200_embeddings:thenlper~gte-small_rerank:False_reader-model:zephyr-7b-beta


  0%|          | 0/65 [00:00<?, ?it/s]

### Inspect results

In [61]:
import glob

outputs = []
for file in glob.glob("./output/*.json"):
    output = pd.DataFrame(json.load(open(file, "r")))
    output["settings"] = file
    outputs.append(output)
result = pd.concat(outputs)

In [62]:
result["eval_score_GPT4.1"] = result["eval_score_GPT4.1"].apply(
    lambda x: int(x) if isinstance(x, str) else 1
)
result["eval_score_GPT4.1"] = (result["eval_score_GPT4.1"] - 1) / 4

In [64]:
average_scores = result.groupby("settings")["eval_score_GPT4.1"].mean()
average_scores.sort_values()

Unnamed: 0_level_0,eval_score_GPT4.1
settings,Unnamed: 1_level_1
./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:False_reader-model:zephyr-7b-beta.json,0.696154
./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:zephyr-7b-beta.json,0.711538


## Example results

Let us load the results that I obtained by tweaking the different options available in this notebook.
For more detail on why these options could work or not, see the notebook on [advanced_RAG](advanced_rag).

As you can see in the graph below, some tweaks do not bring any improvement, some give huge performance boosts.

‚û°Ô∏è ___There is no single good recipe: you should try several different directions when tuning your RAG systems.___


In [65]:
import plotly.express as px

scores = datasets.load_dataset("m-ric/rag_scores_cookbook", split="train")
scores = pd.Series(scores["score"], index=scores["settings"])

README.md:   0%|          | 0.00/319 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/1.45k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5 [00:00<?, ? examples/s]

In [66]:
fig = px.bar(
    scores,
    color=scores,
    labels={
        "value": "Accuracy",
        "settings": "Configuration",
    },
    color_continuous_scale="bluered",
)
fig.update_layout(
    width=1000,
    height=600,
    barmode="group",
    yaxis_range=[0, 100],
    title="<b>Accuracy of different RAG configurations</b>",
    xaxis_title="RAG settings",
    font=dict(size=15),
)
fig.layout.yaxis.ticksuffix = "%"
fig.update_coloraxes(showscale=False)
fig.update_traces(texttemplate="%{y:.1f}", textposition="outside")
fig.show()

<img src="https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/RAG_settings_accuracy.png" height="500" width="800">

As you can see, these had varying impact on performance. In particular, tuning the chunk size is both easy and very impactful.

But this is our case: your results could be very different: now that you have a robust evaluation pipeline, you can set on to explore other options! üó∫Ô∏è