# Scientific Paper Recommendation System with RAG
## Project Description
This project implements a Retrieval Augmented Generation (RAG) system for scientific paper recommendations. The system allows users to input a document or query and receive recommendations for relevant scientific papers from the ArXiv database.

## Key Components
Vector Database: Uses a ChromaDB vector database containing embeddings of 10,000 ArXiv research papers
Document Processing: Extracts and processes PDF content using PyMuPDF
Semantic Search: Performs similarity searches based on document content

## GenAI Functionalities
- Embeddings: Generates semantic embeddings using Google's text-embedding-004 model
- Prompt Engineering: Utilizes carefully crafted prompts to guide the AI's behavior
- RAG Implementation: Combines vector search results with generative AI responses
- Document Understanding: Processes and interprets PDF research papers
- Vector Embedding and Vector Search: Performs semantic similarity searches in high-dimensional vector space

The system orchestrates these components through a chatbot interface that processes user queries, searches for relevant papers, and generates comprehensive responses that include paper details like authors and publication dates.

## Database
ArXiv serves as an excellent data source for our recommendation system for several key reasons:

- **Rich Scientific Content**: Contains over 2 million scholarly articles across multiple disciplines
- **Well-Structured Metadata**: Includes titles, abstracts, authors, and categories in a consistent format
- **Embedding-Friendly**: Abstracts provide concise, information-dense text that produces meaningful vector embeddings
- **Research Relevance**: Widely used by the scientific community, ensuring practical utility
- **Semantic Search Compatibility**: Content structure works effectively with our embedding model (text-embedding-004)

Our implementation uses 10,000 ArXiv papers converted to vector embeddings and stored in ChromaDB, enabling semantic similarity searches to retrieve relevant scientific literature for user queries. The Arxiv dataset is available in Kagggle with the following link: [Arxiv_dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv). 
## Use Case
Researchers can upload a scientific paper and ask questions to find related work in the ArXiv database, facilitating literature reviews and discovery of relevant research.

````
chatbot = RAG_Scientific_chatbot()
answer = chatbot.chat("Find me related papers", "/path/to/document.pdf")
display(Markdown(answer))
```` 


## Setup
Import and install the necessary libraries.

In [1]:
!pip uninstall -qqy jupyterlab kfp  # Remove unused conflicting packages
!pip install -qU "google-genai==1.7.0" "chromadb==0.6.3"
!pip install --upgrade pymupdf

from google import genai
from google.genai import types

import json

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m51.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m6.0 MB/s[0

In [2]:
# API keys
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
client = genai.Client(api_key=GOOGLE_API_KEY)

In [3]:
# Define a retry policy. The model might make multiple consecutive calls automatically
# for a complex query, this ensures the client retries if it hits quota limits.
from google.api_core import retry

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

if not hasattr(genai.models.Models.generate_content, '__wrapped__'):
  genai.models.Models.generate_content = retry.Retry(
      predicate=is_retriable)(genai.models.Models.generate_content)

## Upload of Arxiv papers
First import the arxiv dataset and then perform vector embedding of all the documents. After the vector embedding, it is saved in a chromadb vector database. The arxiv dataset import is shown below.

RANDOM SAMPLING

In [4]:
import random
import numpy as np
amount_papers = 10000
papers = []

with open('/kaggle/input/arxiv/arxiv-metadata-oai-snapshot.json', 'r') as file:
    for i, line in enumerate(file):
        papers.append(json.loads(line))

random_indices = set(random.sample(range(len(papers)), amount_papers))
random_indices = list(random_indices)
papers_random = []
for i in range(len(random_indices)):
    index = random_indices[i]
    papers_random.append(papers[index])
papers = papers_random
# Now data is a list of dictionaries
print("Headers:", list(papers[0].keys()))


Headers: ['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi', 'report-no', 'categories', 'license', 'abstract', 'versions', 'update_date', 'authors_parsed']


Only the title and the abstract of each paper will be embedded. The code below implements this preprocessing of the papers.

In [5]:
def remove_newlines(obj):
    if isinstance(obj, str):
        return obj.replace('\n', ' ')
        
preprocessed_papers = []
for paper in papers:
    preprocessed_papers.append("PAPER TITLE: " + remove_newlines(paper["title"]) + "\nPAPER CONTENT: "+ remove_newlines(paper["abstract"]))
print("SUCCESSFULLY PREPROCESSED "+ str(len(preprocessed_papers)) + " PAPERS")
print("--- EXAMPLE OF PREPROCESSED PAPER ---")
print(preprocessed_papers[0])

SUCCESSFULLY PREPROCESSED 10000 PAPERS
--- EXAMPLE OF PREPROCESSED PAPER ---
PAPER TITLE: Lorentzian Lie 3-algebras and their Bagger-Lambert moduli space
PAPER CONTENT:   We classify Lie 3-algebras possessing an invariant lorentzian inner product. The indecomposable objects are in one-to-one correspondence with compact real forms of metric semisimple Lie algebras. We analyse the moduli space of classical vacua of the Bagger-Lambert theory corresponding to these Lie 3-algebras. We establish a one-to-one correspondence between one branch of the moduli space and compact riemannian symmetric spaces. We analyse the asymptotic behaviour of the moduli space and identify a large class of models with moduli branches exhibiting the desired N^{3/2} behaviour. 


Now the preprocessed papers are transformed into vector embeddings.

In [6]:
def batch(iterable, n=100):
    for i in range(0, len(iterable), n):
        yield iterable[i:i + n]

papers_embedded = []  
papers_batches = list(batch(preprocessed_papers, 100)) #limit of 100 embeddings per call

for batch in papers_batches:
    batch_embedded = client.models.embed_content(
        model='models/text-embedding-004',
        contents=batch,
        config=types.EmbedContentConfig(task_type='SEMANTIC_SIMILARITY'))
    list_batch_embedded = [e.values for e in batch_embedded.embeddings]
    papers_embedded+=list_batch_embedded

print("SUCCESSFULLY EMBEDDED "+ str(len(papers_embedded)) + " PAPERS")

SUCCESSFULLY EMBEDDED 10000 PAPERS


Once the vector embeddings of the papers are computed, these are stored into the chromadb database.

In [7]:
import chromadb
from chromadb import Documents, EmbeddingFunction, Embeddings
def batch(iterable, batch_size):
    for i in range(0, len(iterable), batch_size):
        yield iterable[i:i + batch_size]


# Start ChromaDB client
chromadb_client = chromadb.Client()

# Create or get a collection
collection = chromadb_client.get_or_create_collection(name="papers")

# Add the documents + embeddings to Chroma
emb_batches = list(batch(papers_embedded, 41000))
papers_batches = list(batch(preprocessed_papers, 41000))
for i in range(len(emb_batches)):
    ids_batch = [f"doc_{j + i * 41000}" for j in range(len(emb_batches[i]))]
    collection.add(
        documents=papers_batches[i],
        embeddings=emb_batches[i],
        ids=ids_batch,
    )
print("SUCCESSFULLY UPLOADED "+ str(len(papers_embedded)) + " PAPERS")

SUCCESSFULLY UPLOADED 10000 PAPERS


## Vector database search example

Now an example paper is used to search for similar papers in the database. If the same paper is obtained, the queried paper was in the database.

In [8]:
query_input = "Statistical modeling of experimental physical laws is based on the probability density function of measured variables. It is expressed by experimental data via a kernel estimator. The kernel is determined objectively by the scattering of data during calibration of experimental setup. A physical law, which relates measured variables, is optimally extracted from experimental data by the conditional average estimator. It is derived directly from the kernel estimator and corresponds to a general nonparametric regression. T"
#query_input = pdf_text

query_embedding = client.models.embed_content(
        model='models/text-embedding-004',
        contents=query_input,
        config=types.EmbedContentConfig(task_type='SEMANTIC_SIMILARITY'))

In [9]:
results = collection.query(
    query_embeddings=[query_embedding.embeddings[0].values],
    n_results=5  # Number of similar docs to return
)

for doc, doc_id in zip(results["documents"][0], results["ids"][0]):
    print(f"ID: {doc_id}")
    print(f"{doc}\n")

ID: doc_2287
PAPER TITLE: Trapezoidal rule and sampling designs for the nonparametric estimation   of the regression function in models with correlated errors
PAPER CONTENT:   The problem of estimating the regression function in a fixed design models with correlated observations is considered. Such observations are obtained from several experimental units, each of them forms a time series. Based on the trapezoidal rule, we propose a simple kernel estimator and we derive the asymptotic expression of its integrated mean squared error IMSE and its asymptotic normality. The problems of the optimal bandwidth and the optimal design with respect to the asymptotic IMSE are also investigated. Finally, a simulation study is conducted to study the performance of the new estimator and to compare it with the classical estimator of Gasser and M\"uller in a finite sample set. In addition, we study the robustness of the optimal design with respect to the misspecification of the autocovariance function

## RETRIEVAL AUGMENTED GENERATION (RAG)
For retrieval augmented generation, the question of the user together with the document are used to search for useful papers. With the useful papers and the user input, an answer is generated. The steps are as follow:
1) Use a LLM to embed the user input question and input document for vector search.
2) Obtain the original documents from the vector search in the database.
3) Use the input question and input document and the original documents from the database to generate a response with a LLM.
4) Show the answer to the user.

### Orchestration functions
The following functions orchestrate the RAG:

- create_embedding(text): For a given text generates the corresponding vector embedding.
- search_embedded_documents(query_embedding, n): For a given vector, searches nearby vectors in the vector embeddings database.
- retrieve_documents(doc_id): For a given list of document ids, this function returns an extended information of each paper.

In [10]:
from google.genai import types

# === Tools ===
def create_embedding(text)-> list:
    print(f' - CALL: create_embedding({text[:20]})')
    vector_embedding = client.models.embed_content(
        model='models/text-embedding-004',
        contents=text,
        config=types.EmbedContentConfig(task_type='SEMANTIC_SIMILARITY')
    )
    return vector_embedding.embeddings[0].values

def search_embedded_documents(query_embedding:list[float], n:int)->list[str]:
    print(f' - CALL: search_embedded_documents(n = {n})')
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n
    )
    return results

def retrieve_documents(doc_ids:list[int])-> list[dict]:
    print(f' - CALL: retrieve_documents(IDS = {doc_ids})')
    papers_retrieved = []
    for doc_id in doc_ids:
        papers_retrieved.append(papers[doc_id])
        
    return papers_retrieved

In [11]:
import pymupdf

def extract_text_from_pdf(path):
    text = ""
    with pymupdf.open(path) as doc:
        for page in doc:
            text += page.get_text()
    return text


pdf_text = extract_text_from_pdf("/kaggle/input/unc-paper/2409.10655v2.pdf")

In [12]:

user_document = pdf_text[:1000]
user_message = "Find me related papers."

instruction = """
You are a helpful chatbot that processes inputs from users and generates an output JSON for vector search. 

Given a user message and a document, return:
{
  "embedding_query": "<summarized embedding query based on user message and document>",
  "num_documents": <integer number of documents to retrieve>
}

Ensure the output is a valid JSON object. 'embedding_query' should be a concise string that captures the main topic or keywords for semantic search. 'num_documents' should be inferred from the user message, defaulting to 5 if unspecified.
"""

contents = [
    types.Content(
        role="user", parts=[types.Part(text=user_message),types.Part(text=pdf_text)]
    )
]

response_init = client.models.generate_content(
    model="gemini-2.0-flash", 
    config=types.GenerateContentConfig(
        system_instruction=instruction,
        #tools=[orchestration_tools]
    ),
    contents = contents
)


In [13]:
import pymupdf
import re

class RAG_Scientific_chatbot:
        
    def chat(self, question:str, document_path:str):
               
        processed_input, user_document = self._process_input(question, document_path)
    
        embedding = self._create_embedding(processed_input["embedding_query"])
        
        search_output = self._search_embedded_documents(embedding, int(processed_input["num_documents"]))
        
        doc_ids = search_output['ids'][0]  
        numeric_ids = [int(doc.split('_')[1]) for doc in doc_ids]
        extended_info = self._retrieve_documents(numeric_ids)
        answer = self._generate_final_answer(question, user_document, search_output, extended_info)

        return answer


    # === Tools ===
    def _process_input(self, user_message: str, document_path:str):
        instruction = """
            You are a scientific research assistant specializing in analyzing academic papers and research questions.
            
            Your task is to analyze the user's question and their uploaded document to create:
            1. An optimal embedding query for retrieving the most relevant scientific papers
            2. A recommendation for how many papers to retrieve
            
            INSTRUCTIONS:
            - Identify key scientific concepts, methodologies, domain-specific terminology, and research areas
            - Extract specific technical terms that would appear in related papers
            - Consider both the user's explicit question and the implicit research goals from their document
            - Focus on scientific significance rather than general terms
            - For empirical research questions, include methodology terms and measurement concepts
            - For theoretical questions, include relevant frameworks and paradigms
            
            For number of documents:
            - Suggest 3-5 papers for focused questions with specific methodology/technology
            - Suggest 5-8 papers for broader research areas requiring multiple perspectives
            - Suggest 8-12 papers for literature reviews or comparative analyses
            
            Return this string output:
            "{"embedding_query": "<your optimized embedding query>",
             "num_documents": "<number of papers to retrieve>"}"
            
            EXAMPLES:
            Poor embedding query: "machine learning effects"
            Good embedding query: "transformer neural networks performance metrics BERT GPT comparative analysis NLP benchmarks"
            """

        pdf_text = self._extract_text_from_pdf(document_path)
        
        contents = [
            types.Content(
                role="user", parts=[types.Part(text=user_message),types.Part(text=pdf_text)]
            )
        ]
        
        processed_input = client.models.generate_content(
            model="gemini-2.0-flash", 
            config=types.GenerateContentConfig(
                system_instruction=instruction,
            ),
            contents = contents
        )
        
        match = re.search(r'\{.*\}', processed_input.text, re.DOTALL)
        if match:
            clean_json_str = match.group(0)
            processed_input = json.loads(clean_json_str)
     
        return processed_input, pdf_text

    
    def _extract_text_from_pdf(self, path):
        text = ""
        with pymupdf.open(path) as doc:
            for page in doc:
                text += page.get_text()
        return text
        
        
    def _create_embedding(self, text)-> list:
        print(f' - CALL: create_embedding({text[:20]}...)')
        vector_embedding = client.models.embed_content(
            model='models/text-embedding-004',
            contents=text,
            config=types.EmbedContentConfig(task_type='SEMANTIC_SIMILARITY')
        )
        return vector_embedding.embeddings[0].values
    
    def _search_embedded_documents(self, query_embedding:list[float], n:int)->list[str]:
        print(f' - CALL: search_embedded_documents(n = {n})')
        results = collection.query(
            query_embeddings=[query_embedding],
            n_results=n
        )
        return results
    
    def _retrieve_documents(self, doc_ids:list[int])-> list[dict]:
        print(f' - CALL: retrieve_documents(IDS = {doc_ids})')
        papers_retrieved = []
        for doc_id in doc_ids:
            papers_retrieved.append(papers[doc_id])
            
        return papers_retrieved
        
    def _generate_final_answer(self,question: str, user_document: str, search_output: str, extended_info: str):
        instruction = """
            You are an advanced scientific research assistant tasked with providing comprehensive answers based on retrieved academic papers.
            
            CONTEXT:
            - The user has asked a QUESTION about a scientific topic
            - They've provided their own INPUT_DOCUMENT (a scientific paper or research proposal)
            - You've retrieved relevant papers from a scientific database (EMBED_DATA and EXTENDED_PAPER_INFO)
            
            YOUR TASK:
            1. Analyze the retrieved papers and determine their relevance to the question
            2. Provide an answer based on the retreived papers from EMBED_DATA and EXTENDED_PAPER_INFO. 
            3. You may use information from the INPUT_DOCUMENT if the papers from EMBED_DATA and EXTENDED_PAPER_INFO are not relevant. Always mention where the information is obtained from.
            
            IMPORTANT GUIDELINES:
            - SKIP the user's own paper if it appears in the results
            - Prioritize recent papers and high-impact findings
            - Compare and contrast contradictory findings when present
            - Always provide authors and publication dates
            - Focus on scientific significance rather than general summaries
            - For methodology questions, emphasize technical details and implementation
            - Use objective, academically-appropriate language
            - Provide the answer in a Markdown format
            - You do not need to show all the papers from EMBED_DATA and EXTENDED_PAPER_INFO, only the most relevant and important
            - For the answer, use only papers from the database. You may include some suggestions to other papers but don't make it too extensive.
            """


        prompt =f"""
                QUESTION:{user_message}
                INPUT_DOCUMENT:{user_document}
                EMBED_DATA: {search_output}
                EXTENDED_PAPER_INFO: {extended_info}
                """
        
        contents = []
        contents.append(types.Content(role="user", parts=[types.Part(text = prompt)]))
        response_final = client.models.generate_content(
            model="gemini-2.0-flash", 
            config=types.GenerateContentConfig(
                system_instruction=instruction
            ),
            contents = contents
        )
        return response_final.text

In [14]:
from IPython.display import display, Markdown, Latex


document = "/kaggle/input/unc-paper/2409.10655v2.pdf"
user_message = "Find me related papers with emphasis in uncertainty estimation."

chatbot = RAG_Scientific_chatbot()
answer = chatbot.chat(user_message, document)
display(Markdown(answer))

 - CALL: create_embedding(deep reinforcement l...)
 - CALL: search_embedded_documents(n = 7)
 - CALL: retrieve_documents(IDS = [4064, 3257, 6329, 5726, 4074, 5484, 9038])


```markdown
Here are some papers that focus on uncertainty estimation in the context of robot navigation and reinforcement learning.

*   **LEADER: Learning Attention over Driving Behaviors for Planning under Uncertainty (Danesh et al., 2022)**

    *   Addresses challenges in autonomous driving caused by uncertainty in human behaviors.
    *   Introduces a method called LEarning Attention over Driving bEhavioRs (LEADER) that uses a neural network to focus on critical human behaviors during planning.
    *   LEADER integrates attention into a belief-space planner, using importance sampling to bias reasoning towards critical events and learns risk-aware planning without human labeling by formulating a min-max game between the attention generator and the planner.

*   **Decentralized Multi-Robot Navigation for Autonomous Surface Vehicles with Distributional Reinforcement Learning (Lin et al., 2024)**

    *   Proposes a decentralized multi-ASV collision avoidance policy using Distributional Reinforcement Learning, which accounts for interactions among ASVs, static obstacles, and current flows.
    *   A variant of the framework automatically adapts its risk sensitivity to improve ASV safety in congested environments.
    
*   **n-MeRCI: A new Metric to Evaluate the Correlation Between Predictive Uncertainty and True Error (Moukari et al., 2019)**

    *   Proposes a novel metric for evaluating relative uncertainty assessment, applicable to regression with deep neural networks.
    *   The metric is designed to assess the quality of estimated uncertainty, which is important in robotics where actions depend on the confidence in perceived information.

*   **Safe Reinforcement Learning using Data-Driven Predictive Control (Selim et al., 2022)**

    *   Introduces a data-driven safety layer that filters unsafe actions in reinforcement learning.
    *   The safety layer uses a data-driven predictive controller to enforce safety guarantees for RL policies during training and deployment, using reachability analysis to verify proposed actions.

*   **Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty (Russel et al., 2020)**

    *   Focuses on making reinforcement learning algorithms more robust to model uncertainties.
    *   Merges constrained Markov decision processes (CMDP) with robust Markov decision processes (RMDP), leading to a formulation of robust constrained-MDPs (RCMDP).
    *   The formulation provides constraint satisfaction guarantees with respect to uncertainties in the system's state transition probabilities.


In [15]:
if answer and isinstance(answer, str):
    display(Markdown(answer))
else:
    print("Received non-string response:", type(answer))


```markdown
Here are some papers that focus on uncertainty estimation in the context of robot navigation and reinforcement learning.

*   **LEADER: Learning Attention over Driving Behaviors for Planning under Uncertainty (Danesh et al., 2022)**

    *   Addresses challenges in autonomous driving caused by uncertainty in human behaviors.
    *   Introduces a method called LEarning Attention over Driving bEhavioRs (LEADER) that uses a neural network to focus on critical human behaviors during planning.
    *   LEADER integrates attention into a belief-space planner, using importance sampling to bias reasoning towards critical events and learns risk-aware planning without human labeling by formulating a min-max game between the attention generator and the planner.

*   **Decentralized Multi-Robot Navigation for Autonomous Surface Vehicles with Distributional Reinforcement Learning (Lin et al., 2024)**

    *   Proposes a decentralized multi-ASV collision avoidance policy using Distributional Reinforcement Learning, which accounts for interactions among ASVs, static obstacles, and current flows.
    *   A variant of the framework automatically adapts its risk sensitivity to improve ASV safety in congested environments.
    
*   **n-MeRCI: A new Metric to Evaluate the Correlation Between Predictive Uncertainty and True Error (Moukari et al., 2019)**

    *   Proposes a novel metric for evaluating relative uncertainty assessment, applicable to regression with deep neural networks.
    *   The metric is designed to assess the quality of estimated uncertainty, which is important in robotics where actions depend on the confidence in perceived information.

*   **Safe Reinforcement Learning using Data-Driven Predictive Control (Selim et al., 2022)**

    *   Introduces a data-driven safety layer that filters unsafe actions in reinforcement learning.
    *   The safety layer uses a data-driven predictive controller to enforce safety guarantees for RL policies during training and deployment, using reachability analysis to verify proposed actions.

*   **Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty (Russel et al., 2020)**

    *   Focuses on making reinforcement learning algorithms more robust to model uncertainties.
    *   Merges constrained Markov decision processes (CMDP) with robust Markov decision processes (RMDP), leading to a formulation of robust constrained-MDPs (RCMDP).
    *   The formulation provides constraint satisfaction guarantees with respect to uncertainties in the system's state transition probabilities.


## Appendix: Partial code for an AI Agent 

This code is provided for future improvement, given that the code did not work

In [16]:
# === Tool declarations ===
create_embedding_tool = {
    "name" : "create_embedding",
    "description" : "For a given text, generate the corresponding vector embedding.",
    "parameters" : {
        "type": "OBJECT",
        "properties": {
            "text": {
                "type": "STRING",
                "description": "The input text to embed."
            }
        },
        "required": ["text"]
    }
}

search_embedded_documents_tool = {
    "name" : "search_embedded_documents",
    "description" : "Search for similar documents using a query embedding.",
    "parameters" : {
        "type": "OBJECT",
        "properties": {
            "query_embedding": {
                "type": "ARRAY",
                "items": {
                    "type": "NUMBER"  # O "INTEGER" si tus vectores son int (normalmente son floats)
                },
                "description": "The vector embedding of the input query."
            },
            "n": {
                "type": "INTEGER",
                "description": "Number of top similar documents to retrieve."
            }
        },
        "required": ["query_embedding", "n"]
    }
}

retrieve_documents_tool = {
    "name" : "retrieve_documents",
    "description" : "Retrieve detailed information about a document using its ID.",
    "parameters" : {
        "type": "OBJECT",
        "properties": {
            "doc_id": {
                "type": "INTEGER",
                "description": "The ID of the paper/document."
            }
        },
        "required": ["doc_id"]
    }
}



In [17]:
#orchestration_tools  = types.Tool(function_declarations=[create_embedding_tool, search_embedded_documents_tool])

orchestration_tools  = types.Tool(function_declarations=[create_embedding_tool, search_embedded_documents_tool])


instruction = """You are a helpful chatbot that can interact with a database of vector embeddings
of scientific papers and a database with the papers extended information. You will take the users questions and documents andgenerate

Use the following tools:
    - create_embedding(text) to convert text into vector embeddings 
    - search_embedded_documents(query_embedding, n) to obtain n papers that are similar to the embedded query 
from the database of vector embeddings.

"""

In [18]:
#tool_call = response.candidates[0].content.parts[0].function_call

#if tool_call.name == "create_embedding":
#    result = create_embedding(**tool_call.args)

#function_response_part = types.Part.from_function_response(
#    name=tool_call.name,
#    response={"result": result},
#)

#contents.append(types.Content(role="model", parts=[types.Part(function_call=tool_call)])) # Append the model's function call message
#contents.append(types.Content(role="user", parts=[function_response_part])) # Append the function response
#response = client.models.generate_content(
#    model="gemini-2.0-flash", 
#    config=types.GenerateContentConfig(
#        system_instruction=instruction,
#        tools=[orchestration_tools]
#    ),
#    contents = contents
#)
#print(response)