<img align="left" src="../All-sample-files/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

# Language Models 3: 🤗 Hugging Face with RAG and Open AI Web Search

**Description:** 

Learners will use 🤗 Hugging Face Inference Client combined with Llama Index to create a basic Retrieval Augmented Generation (RAG) system. They will also use the Open AI Responses API in order to answer prompts with web data.

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion Time:** 75 minutes

**Knowledge Required:** 
* Python Basics

**Knowledge Recommended:** 
* Python Intermediate

**Data Format:** None

**Libraries Used:** 
* [🤗 Transformers](https://huggingface.co/docs/transformers/index)- provides APIs and tools to easily download and train pretrained models
* [Llama_index](https://docs.llamaindex.ai/en/stable/)- helps index our documents
* [Open AI Responses API](https://platform.openai.com/docs/api-reference/responses)- allows us to combine LLM completions with web search

**Research Pipeline:** None
___

# Introduction to Retrieval Augmented Generation

Large Language Models (LLMs) are trained on an enormous variety of content, including books, wikipedia, and social media. They are often able to answer basic questions in a wide-ranging variety of contexts. Researchers, on the other hand, tend to specialize in their research area—going deep rather than wide. Researchers also tend to be interested in the latest articles and research in their field, while language models “knowledge” is frozen in time once trained. (There are some ways to update the knowledge in a language model, but they can be impractical.) Finally, researchers are concerned with citation and reference. In brief, *LLMs often lack knowledge that is specialized, current, and citable*: the type of knowledge researchers want most.

Retrieval Augmented Generation (RAG), formalized in “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” ([Lews, et. al 2020](https://arxiv.org/abs/2005.11401)), has emerged as a solution for these problems. While other methods have focused on re-training an existing model, RAG introduces a new step into the process: retrieval.

In the retrieval step, the user’s query is matched with a vector database of reference documents (called a “knowledge base”) in order to find document chunks that are likely to contain the answer. Once the document chunks have been retrieved, they can be submitted as context with the user’s query to the LLM. 

![The steps of RAG described below in visual form.](../All-sample-files/rag-process.png)

While RAG systems can be quite sophisticated, the basic steps remain the same:

1. User submits a query
2. Relevant document chunks are returned from the vector database
3. A prompt containing the chunks is submitted to the LLM with the user’s query


## What about transfer learning, fine-tuning, parameter-efficient fine-tuning, etc.?

RAG can be combined with fine-tuning and other techniques to improve outputs. An ideal solution may combine RAG with other techniques. At the current time, some research suggests RAG has a more profound effect than fine-tuning. In other words, RAG may improve LLM benchmark scores by a greater degree than other techniques, but the highest scores usually come from a combination of techniques.

![Table showing RAG has a greater affect than finetuning](../All-sample-files/ragvsfinetune.png)

From "Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge" ([Soudani, et. al. 2024](https://arxiv.org/abs/2403.01432))


# Building a basic RAG system

By combining our knowledge of working with Hugging Face with a vector database, we can create a basic RAG system. First, we will need to create a knowledge base, including the following steps:

1. Curate a body of relevant documents
2. Extract the texts and chunk them
3. Embed the chunks
4. Create vector database from embeddings

## Installations

In [None]:
# Install transformers and llama-index libraries
!pip install transformers
!pip install llama-index
!pip install llama-index-embeddings-huggingface

## Import Libraries

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from transformers import pipeline
from huggingface_hub import login
from huggingface_hub import InferenceClient
import urllib.request
from pathlib import Path

## Gather documents for knowledge base

Let's create a knowledge base that relies on recent, specialized knowledge. Our LLM for this system will be Meta's [Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct), released July 23, 2024. We can find the "freshness" of the model from the model card:

>Data Freshness: The pretraining data has a cutoff of December 2023.

Let's include some specialized, recent knowledge that the model could not have been trained on. Even in early 2025, Llama 3.1 405B is an enormous model which ranks in the top 30 best in the world on key benchmarks. In our knowledge base, we'll include technical reports for some more recent models:

1. [Google's Gemma 3](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d)- Released March 10, 2025
2. [Deepseek V3](https://github.com/deepseek-ai/DeepSeek-V3)- Released December 26, 2024
3. [Microsoft's Phi-4](https://huggingface.co/microsoft/phi-4)- Released December 12, 2024

In [None]:
# Download the documents and put them in a directory called "documents"
dir_path = Path.cwd() / "documents"
dir_path.mkdir(exist_ok=True)

files ={
    "phi-4-technical-report.pdf" : 'https://arxiv.org/pdf/2503.01743',
    "gemma-3-technical-report.pdf" : 'https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf',
    "deepseek-v3-technical-report.pdf" : 'https://arxiv.org/pdf/2412.19437'
}
    
for file_name, url in files.items():
    urllib.request.urlretrieve(url, f'./documents/{file_name}')

## Simple Directory Reader

The simple directory reader will gather up all the files in a directory and turn them into a list of document objects. It can parse many kinds of files including pdfs, text files, markdown files, etc. It will intelligently select the right reader for the right file, and it will process them differently. For example, a text file is treated as a single document whereas a markdown file is broken down by headings.

### Using other files for the knowledge base

You don't need to use our example documents. Our code is creating the knowledge base from the documents in a directory called documents. You can create this directory and put any kind of files you would like in there for your own knowledge base. We recommend using text or markdown files for this example, but you can consult [the documentation](https://docs.llamaindex.ai/en/stable/examples/data_connectors/simple_directory_reader/) if you're curious about how `SimpleDirectoryReader()` interacts with other kinds of files.


In [None]:
# Collect documents into a list
docs = SimpleDirectoryReader("documents").load_data()

All of our files are saved as text files (.txt), so they will be individual document objects. They are also valid markdown files (.md), however, so we could have saved them with the `.md` extension. By default, `SimpleDirectoryReader()` will chunk markdown files into smaller files based on their structure. We will do some basic chunking ourselves, but this kind of intelligent chunking may give better results. For our example, we get 3 documents. How many documents would the markdown versions generate?

In [None]:
print(len(docs))

## Embedding Settings

We will use [LlamaIndex](https://docs.llamaindex.ai/en/stable/) to create our vector database. We will select an embedding model from Hugging Face. We are free to choose any embedding model, since the embedding model *does not* have to match our LLM. We have chosen a popular embedding model from Hugging Face, but feel free to update or change it.

In [None]:
# Log in to Hugging Face using our API token
login()

In [None]:
# Choose the Embedding Model from Hugging Face
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

We will create some additional settings:
1. Not specifying an LLM model
2. Choosing a chunk size
3. Choosing a chunk overlap

## Chunking documents
The chunk size is important for the performance of the vector database and the LLM. There are many ways to chunk, including fixed sizes, random chunk sizes, sliding windows, and context-aware chunking. The right size and method will take into consideration the documents, the LLM's context window, and other factors.

In [None]:
# Set a Hugging Face embedding model
Settings.llm = None
Settings.chunk_size = 256
Settings.chunk_overlap = 25

In [None]:
# Create a vector database from docs object
index = VectorStoreIndex.from_documents(docs)

## Search function
Now we can set up our retrieval system. The most significant thing we can adjust here is how many documents to retrieve under the variable `top_k`. We can also change the similarity cutoff using `similarity_cutoff`. Essentially, this changes how similar a document needs to be in order to be included. Both of these are worth experimenting with. Keep in mind that there is a limit on the context that can be supplied for the model. More is not always better.


In [None]:
# Documents to retrieve
top_k = 3

# Retriever configuration
retriever = VectorIndexRetriever(
    index = index,
    similarity_top_k=top_k
)

In [None]:
# Query Engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

## Query
Here we craft our query and receive a response from the vector database.

In [None]:
# Query
query = 'How does DeepSeek-V2-Base compare to DeepSeek-V3-Base?'
response = query_engine.query(query)

In [None]:
# Print the responses
print(response)

## Create LLM prompt (without RAG context)

First, let's create a prompt to pass to the LLM. We'll automatically insert the query.

In [None]:
# Create some instructions for the model

ragless_prompt = f"""
[INST] ResearchBuddy, a virtual consultant for research tasks communicates in clear, accessible language helping answer technical questions on documentation.

Please respond to the following comment.
{query}

[/INST]
"""

## Add RAG context to our LLM Prompt
Now let's create a context string from our responses received above.

In [None]:
# Create a context string from response
context = "Context:\n"
for i in range(top_k):
    context = context + response.source_nodes[i].text + "\n\n"

print(context)

In [None]:
# Create a RAG prompt with the context
ragful_prompt = ragless_prompt + context

Now we have two versions of LLM prompt:

* `ragless_prompt`- Has basic instructions with our query
* `ragful_prompt`- Has basic instructions, our query, and the context from our vector database

We are ready to pass these prompts to the LLM.

## Pass the prompts to the LLM
We can choose to pass these prompts to the LLM of our choice. In this case, we are using Llama 3.1-405B-Instruct, but we could easily choose another model using the `InferenceClient()`.

In [None]:
# Choose the model
client = InferenceClient(model = "meta-llama/Llama-3.1-405B-Instruct", provider = 'nebius')

In [None]:
# Ask the model without RAG context
completion = client.chat.completions.create(
	messages=[{"role": "user", "content": ragless_prompt}],
)

print(completion.choices[0].message.content)

In [None]:
# Ask the model with RAG context
completion = client.chat.completions.create(
	messages=[{"role": "user", "content": ragful_prompt}],
)

print(completion.choices[0].message.content)

# LLM Agents Basics with Open AI

The concept of AI agents gained significant traction in 2024. This seachange compelled Open AI to release a new API [Responses](https://platform.openai.com/docs/api-reference/responses), which will be the eventual successor to the [Chat Completions](https://platform.openai.com/docs/api-reference/chat) API. With this shift, Open AI is planning to create more built-in tools for web search, file search, and computer use. The differences in the two APIs are detailed in the [Open AI documentation](https://platform.openai.com/docs/guides/responses-vs-chat-completions).

Let's install the Open AI library and enter an API token in order to compare.

In [None]:
# Install Open AI 
!pip install --upgrade openai

In [None]:
# Import getpass and Open AI
from getpass import getpass
from openai import OpenAI

In [None]:
# Store Hugging Face API Key in a variable
# The getpass function obscures your token if you share the notebook with others

openai_api_key = getpass('Enter your Open API Key')

In [None]:
# Load key from environment variable
client = OpenAI(api_key=openai_api_key)

# Chat Completions API
The older chat completions API is very similar to the Hugging Face Inference API. We use create a variable to store the output of `client.chat.completions.create` while specifying:

* `model`- the Open AI model we would like to use
* `messages`- a list containing a dictionary with the `role` and `content` of the conversation

In [None]:
# Chat Completions API
completion = client.chat.completions.create(
  model="gpt-4o",
  messages=[
      {
          "role": "user",
          "content": "What is a language model?"
      }
  ]
)

print(completion.choices[0].message.content)

The completions object is quite complex to navigate to get to the response text. See the [full documentation for details](https://platform.openai.com/docs/api-reference/chat/list-object). At each level of the object, there are many attributes that we need to drill down into in order to get the response at: `completion.choices[0].message.content`. Compare the original API syntax with the streamlined version in responses.

In [None]:
# New Responses API
response = client.responses.create(
    model="gpt-4o", 
    input="What is a language model?"
)

print(response.output_text)

We can now simply include a string for the `input` parameter. With the older **Chat Completions** API, the messages list must be manually updated each time, adding what was said to a growing list that forms a conversation. The **Responses** API automatically assigns an ID number for each response and stores it. (If we do not want it stored, we can set the parameter `store = False`.

So we see that the API has been significantly streamlined, with the most important data being easier to surface. This is also clear if we take a look at the difference between our `completion` with the older Chat Completions API and our `response` with the newer Responses API.

In [None]:
# Chat completions object
print(completion)

# Responses object
print(response)

The syntax for streaming is a little different. The stream comes through in a series of events and we check to see if there is a delta (or difference) in each event. If there is, we print it out here.

In [None]:
# New Responses API with streaming response
response = client.responses.create(
    model="gpt-4o", 
    input="What is a language model?",
    stream=True
)

for event in response: # Unfortunately, this still has a bug in Jupyter where it overwrites characters
    if hasattr(event, 'delta'):
        print(event.delta, end='')

# The Building Blocks of Agentic Workflows

The Responses API is designed with agent tasks and workflows in mind, prioritizing web search alongside new tools such as file search, computer use, and a forthcoming code interpreter. The concept is that agent models do not respond simply by providing text, images, or content, but that they work actively to solve problems. This could require multiple steps or multiple models working together on a task. 

A recent post by Anthropic [Building Effective Agents](https://www.anthropic.com/engineering/building-effective-agents) describes the recent shift into agentic systems with two categories:

* **Workflows** are systems where LLMs and tools are orchestrated through predefined code paths.
* **Agents**, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.

Workflows are almost always the better choice since they define a clear process for using particular tools at particular times with clear end states. Some workflows include:

* **Building Block**- An LLM is augmented with additional tools and retrieval such as Retrieval Augmented Generation
* **Prompt Chaining**- A series of LLM prompts are chained together with a "gate" to check progress after each step is completed
* **Routing**- A specialized router or triage agent routes the task through to a particular agent designed for the task
* **Parellelization**- Multiple LLMs run similar tasks at the same time. The results are then either combined into a single result or voted on for the best solution
* **Orchestrator-workers**- A central LLM breaks down a task and then assigns it to worker LLMs before synthesizing the results
* **Evaluator-optimizer**- One LLM generates solutions and works in tandem with an evaluator/optimizer, improving a solution through successive iterations

These workflows can be connected in sophisticated pathways to accomplish difficult tasks.

In distinction, **Agents** "...are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks." Since the agent is making strategic decisions on the process, tools, and solution-state, the results of agents can be highly variable. 

# File Search, Computer Use, and Web Search

The Responses API offers basic capabilities with File Search, Computer Use, and Web Search. There are additional tools planned, including a code interpreter.

## File Search

The File Search is an implementation of RAG, where you create a vector store which can be stored on Open AI infrastructure. The data storage does have a small cost: "You first GB is free and beyond that, usage is billed at $0.10/GB/day of vector storage. There are no other costs associated with vector store operations" ([File Search Documentation](https://platform.openai.com/docs/assistants/tools/file-search). There are arguably better players in this space, and it is not too hard to build a RAG system (as we demonstrated above). However, the nice thing about this service is it allows Open AI to be a one-stop shop for your models and your data.

## Computer Use

It is still very early days for computer use models, but Open AI's Computer-Using Agent (CUA) model shows promise. Basically, the CUA model operates in a loop. An image is sent to the model and then it can take an action such as: clicking, typing, or scrolling. After each action, a new image is sent to the model, so it can decide what choice to make next. The CUA model can use a browser automation framework such as [Playwright](https://playwright.dev/) or [Selenium](https://www.selenium.dev/) to control a web browser on your machine. The CUA Model can also control a local virtual machine through [Docker](https://www.docker.com/).

## Web Search

The web search tool enables the LLM to search the web in order to find relevant information.

In [None]:
# Web Search Tool

response = client.responses.create(
    model="gpt-4o",
    tools=[{"type": "web_search_preview"}],
    input="What happened in AI news today?"
)

print(response.output_text)

Notice we did not ask the model to search the web. We can make the web search tool available, and the model *decides* whether it should check the web based on the prompt. 

In [None]:
# The model does not use web search to answer
response = client.responses.create(
    model="gpt-4o",
    tools=[{
        "type": "web_search_preview"
        
        }],
    input="What is a unicorn?"
)

print(response.output_text)

We can force the model to search the web using the `tool_choice` parameter: `tool_choice= {type: "web_search_preview"}`. We can also set `search_context_size` to `high` in order to retrieve more context from the web. Using more context will impact the quality, cost, and speed of the response. 

In [None]:
# The model is forced to use the web browser search to help answer
response = client.responses.create(
    model="gpt-4o",
    tools=[{
        "type": "web_search_preview",
        "search_context_size": "low"
        
    }],
    tool_choice={"type": "web_search_preview"},
    input="What is a unicorn?",
)

print(response.output_text)

We can also set a user location in our tools in order to influence the output.

In [None]:
# The model is given a location in order to answer the question
response = client.responses.create(
    model="gpt-4o",
    tools=[{
        "type": "web_search_preview",
        "user_location": {
            "type": "approximate",
            "country": "US",
            "city": "Detroit",
            "region": "Detroit",
        }
    }],
    tool_choice={"type": "web_search_preview"},
    input="What is a burger place open tomorrow for lunch?",
)

print(response.output_text)