# Retrieval-augmented Generation (RAG)

In this example, we will use RAG to generate code based on an external API defined in a Swagger file.

## Dependencies

### LangChain

Here, we will use the LangChain libraries.

In [None]:
import sys
!{sys.executable} -m pip install langchain jq langchain-community langchain-openai

### Model

We will use OpenAI's models, so we need an API key.


In [None]:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

# Target

Let's work with the OpenDataHub API to get tourism data from South Tyrol.

To do that, let's download the JSON file with the definitions of the API calls.

In [None]:
!wget https://raw.githubusercontent.com/melegati/genai4se-course/refs/heads/master/opendatahub.json

## Retrieval

First, we have to load the documents that will be stored in the database.

To this aim, we will read the JSON file describing the API.

In [None]:
from langchain_community.document_loaders import JSONLoader
import json
from pathlib import Path
import jq

file_path='./opendatahub.json'

with open(file_path) as f:
    data = json.load(f)
    api_url = jq.compile('.servers[0].url').input(data).first()
    print(api_url)

loader = JSONLoader(
         file_path=file_path,
         jq_schema='.paths | to_entries[] | .key as $path | .value | to_entries[] | { path:$path, method:.key, tag:.value.tags[0], summary:.value.summary, parameters: [ {name: .value.parameters[]?.name } ] }',
         text_content=False)

docs = loader.load()

Checking the number of documents:

In [None]:
len(docs)

Let's check the first document to take a look on it.

In [None]:
docs[1]

Let's use OpenAI's text-embedding-3-large model for creating the embeddings.

In [None]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

For this didactic example, we will use the LangChain's InMemoryVectorStore. According to the [documentation](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.in_memory.InMemoryVectorStore.html), it uses a dictionary and the similarity is calculated using cosine similarity.

In [None]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

Now let's add the documents to the vector store. In this process, the embeddings are calculated using the defined model.

In [None]:
vector_store.add_documents(documents=docs)

Let's create a function to retrieve the relevant documents given the text definition of a task.

In [None]:
def retrieve(task):
    retrieved_docs = vector_store.similarity_search(task)
    return "\n\n".join(doc.page_content for doc in retrieved_docs)

Let's check if it works:

In [None]:
task = "Write a piece of code to list the events in the last year."
retrieve(task)

## Generation

Let's use OpenAI's GPT-4o-mini to generate the answer.

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI


llm = ChatOpenAI(model="gpt-4o-mini")

Here we define the prompt template, leaving the space for information retrieved about the API and the task. We also add the information about the API url so the generated code will be runnable. Let's save the result in a Python file.

In [None]:
def generate(task, api_info, api_url):

    prompt_template = ChatPromptTemplate([
        ("system", "You are a developer using an API to implement a solution in Python."),
        ("user", "Below, there is the information about the API you need to use {apiInfo}. Your task is: {task}. Just return the code without anything else. The API URL is {apiUrl}")
    ])

    prompt = prompt_template.invoke({"apiInfo": api_info, "task": task, "apiUrl": api_url})
    answer = llm.invoke(prompt)
    result = answer.content

    answer.pretty_print()

    if "```python" in result:
            result = result[10:-3]

    output_path = "output/main.py"
    with open(output_path, "w") as f:
        f.write(result)

We need a folder to save the output:

In [None]:
!mkdir output

In [None]:
api_info = retrieve(task)
generate(task, api_info, api_url)

Let's execute the generate code to check if it works!

In [None]:
!python output/main.py