<a href="https://colab.research.google.com/github/kmkarakaya/Deep-Learning-Tutorials/blob/master/Simple_Rag_with_chromaDB_Gemini_PartC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PART C: CODE WITH CHROMADB FOR PERSISTENT VECTOR DB   
    
    
    
    
    

In this notebook we will develop a Retrieval Augmented Generation (RAG) application.

The Parts are

* PART A: AN INTRO TO GEMINI API FOR TEXT GENERATION & CHAT
* PART B: CODE WITH CHROMADB FOR VECTOR STORAGE & SIMILARITY SEARCH
* PART C: CODE WITH CHROMADB FOR PERSISTENT VECTOR DB
* PART D: A SIMPLE RAG BASED ON GEMINI & CHROMADB
* PART E: ADVANCED TECHNIQUES FOR RAG BASED ON GEMINI & CHROMADB

# WHAT IS RAG?

RAG stands for Retrieval-Augmented Generation. It's a technique that combines large language models (LLMs) with external knowledge sources to improve the accuracy and reliability of AI-generated text.

## How Does RAG Work? Unveiling the Power of External Knowledge

Before we start the core RAG process, we need to provide a foundation as follows:

* **Building the Knowledge Base:** The system starts by transforming documents and information within the external knowledge base (like Wikipedia or a company database) into a special format called **vector representations**. These condense the meaning of each document into a series of **numbers**, capturing the essence of the content.

* **Vector Database for Speedy Retrieval**: These vector representations are then stored in a specialized database called a vector database. This database is optimized for efficiently **searching and retrieving** information based on **semantic similarity**. Imagine it as a super-powered library catalog that **understands the meaning** of documents, **not just keywords**.

Now, let's explore how RAG leverages this foundation:

* **User Input**: The RAG process begins with a question or **prompt** from the user. This could be anything from "What caused the extinction of the dinosaurs?" to a more open-ended request like "Write a creative story."

* **Intelligent Retrieval**: RAG doesn't rely solely on the **LLM's internal knowledge**. It employs an information retrieval component that acts like a super-powered search engine. This component scans the vast external knowledge base – like a company's internal database for specific domains – to find information **directly relevant** to the user's input. Unlike a traditional **search engine** that relies on **keywords**, RAG leverages the power of vector representations to understand the **semantic meaning** of the user's prompt and identify the most relevant documents.

* **Enriched Context Creation**: The retrieved information isn't just shown alongside the prompt. RAG cleverly **merges the user input with the relevant snippets** from the knowledge base. This creates a ***richer context*** for the LLM to understand the **user's intent** and formulate a well-informed response.

* **LLM Powered Response Generation**: Finally, the **enriched context** is fed to the Large Language Model (LLM). The LLM, along with its ability to process language patterns, now has a strong **foundation of factual** information to draw upon. This empowers it to generate a response that is both comprehensive and accurate, addressing the specific needs of the user's prompt.

In this part, we will learn how to build a persistent ChromaDB Vector Database for speedy retrieval in a Knowledge Base.

https://www.trychroma.com/
https://github.com/chroma-core/chroma

# CONTENT

In this exciting tutorial series, we are developing a Retrieval Augmented Generation (RAG) application. If you missed the first 2 parts where we covered how to code the GEMINI API for text generation and chat and how to code ChromaDB to store and retrieve vectors, be sure to check that out.

In this second part, we will code with ChromaDB for a persistent Database.

In this tutorial, we will learn:

* How Does RAG Work? – Understand the fundamentals of Retrieval Augmented Generation.
*

All the above steps will be implemented and coded in Python on Google Colab.

Follow along step-by-step to master these techniques and enhance your data processing capabilities.

# WHY WE NEED A PERSISTENT CHROMADB?

In the context of a Retrieval-Augmented Generation (RAG) approach, saving and loading a persistent ChromaDB is particularly important for several reasons:

1. **Enhanced Data Durability**:
   - **Importance**: Ensures the retrieval database used for augmenting generative models is not lost between sessions or system restarts.
   - **RAG Relevance**: Maintains a consistent and reliable knowledge base that the generative model can reference, leading to more accurate and relevant responses.

2. **Operational Continuity**:
   - **Importance**: Allows seamless continuation of operations without needing to re-index or re-import data, saving time and computational resources.
   - **RAG Relevance**: Ensures that the generative model has continuous access to the same set of documents, which is essential for generating consistent and coherent responses over time.

3. **Facilitating Collaboration**:
   - **Importance**: Enables multiple users or systems to share and access the same dataset.
   - **RAG Relevance**: Supports collaborative development and usage of the RAG system, allowing different teams to work on improving the retrieval and generation processes simultaneously.

4. **Scalability**:
   - **Importance**: Provides a stable and persistent backend, enabling efficient handling of large datasets.
   - **RAG Relevance**: Essential for scaling the RAG system to handle more extensive and diverse knowledge bases, ensuring that the system can manage increased loads and deliver prompt, relevant information.


In a RAG system, the retriever (like ChromaDB) provides the generative model with relevant context from a knowledge base to generate informed and accurate responses. Persistent storage ensures that this knowledge base is durable, continuously available, and scalable, which is critical for the reliability, consistency, and performance of the RAG system.



# CREATING & SAVING A PERSISTENT CHROMADB

To make ChromaDB durable (persistent) rather than temporary on Google Colab, you can use external storage services like Google Drive or set up a cloud-based database. Google Colab provides temporary storage that resets after each session, so to maintain persistence across sessions, you'll need to save your data and configurations externally.

##1 Install required libraries

Install all the required libraries and helper functions

In [None]:
%pip install chromadb --quiet
%pip install sentence_transformers --quiet
%pip install pypdf --quiet
%pip install langchain --quiet
%pip install tqdm --quiet

import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import SentenceTransformersTokenTextSplitter


from pypdf import PdfReader

from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
from chromadb import Client, PersistentClient
from chromadb.utils import embedding_functions

import textwrap
from IPython.display import display
from IPython.display import Markdown
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

## 1. Initialize ChromaDB client with Google Drive connection

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd drive/MyDrive/'Colab Notebooks'

/content/drive/MyDrive/Colab Notebooks


In [None]:
# Initialize ChromaDB client with Google Drive connection
chromaDB_path = '/content/drive/MyDrive/Colab Notebooks/ChromaDBData'


In [None]:
# prompt: Check if the chromadb_path exists or not. If so, delete all the files and folders in chromadb_path. But before deleting get the permission from the user.

import os
import shutil

if os.path.exists(chromaDB_path):
  print(f"The directory '{chromaDB_path}' already exists.")
  permission = input("Do you want to delete all the files and folders in this directory? (y/n): ")
  if permission == "y":
    shutil.rmtree(chromaDB_path)
    print(f"All files and folders in '{chromaDB_path}' have been deleted.")
  else:
    print("No action taken.")
else:
  print(f"The directory '{chromaDB_path}' does not exist.")


The directory '/content/drive/MyDrive/Colab Notebooks/ChromaDBData' does not exist.


## 2. Define PersistentClient

Let's re-define the **create_chroma_client** function from the previous part so that this time we initialize a **persistent** ChromaDB client:

In [None]:
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
from chromadb import Client, PersistentClient


In [None]:
def create_chroma_client(collection_name, embedding_function, chromaDB_path ):
  if chromaDB_path is not None:
    chroma_client = PersistentClient(path=chromaDB_path,
                                     settings=Settings(),
                                     tenant=DEFAULT_TENANT,
                                     database=DEFAULT_DATABASE,)
  else:
    chroma_client = Client()

  chroma_collection = chroma_client.get_or_create_collection(
      collection_name,
      embedding_function=embedding_function)

  return chroma_client, chroma_collection

## 3. Create a collection as usual

In [None]:
collection_name = "Papers"
sentence_transformer_model="distiluse-base-multilingual-cased-v1"
embedding_function= embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=sentence_transformer_model)


  from tqdm.autonotebook import tqdm, trange


In [None]:
chroma_client, chroma_collection = create_chroma_client(collection_name,
                                                        embedding_function,
                                                        chromaDB_path)
print(chroma_collection.count())
print(chroma_client.list_collections())

0
[Collection(name=Papers)]


## Define helper functions

In [None]:
from google.colab import files
def upload_multiple_files():
  uploaded = files.upload()
  file_names = list()
  for fn in uploaded.keys():
    #print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))
    file_names.append(fn)
  return file_names

In [None]:
def convert_PDF_Text(pdf_path):
  reader = PdfReader(pdf_path)
  pdf_texts = [p.extract_text().strip() for p in reader.pages]
  # Filter the empty strings
  pdf_texts = [text for text in pdf_texts if text]
  print("Document: ",pdf_path," chunk size: ", len(pdf_texts))
  return pdf_texts

In [None]:
def convert_Page_ChunkinChar(pdf_texts, chunk_size = 1500, chunk_overlap=0 ):
  character_splitter = RecursiveCharacterTextSplitter(
      separators=["\n\n", "\n", ". ", " ", ""],
      chunk_size=1500,
      chunk_overlap=0
)
  character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))
  print(f"\nTotal number of chunks (document splited by max char = 1500): \
        {len(character_split_texts)}")
  return character_split_texts

In [None]:
def convert_Chunk_Token(text_chunksinChar,sentence_transformer_model, chunk_overlap=0,tokens_per_chunk=128 ):
  token_splitter = SentenceTransformersTokenTextSplitter(
      chunk_overlap=0,
      model_name=sentence_transformer_model,
      tokens_per_chunk=128)

  text_chunksinTokens = []
  for text in text_chunksinChar:
      text_chunksinTokens += token_splitter.split_text(text)
  print(f"\nTotal number of chunks (document splited by 128 tokens per chunk):\
       {len(text_chunksinTokens)}")
  return text_chunksinTokens

In [None]:
def add_meta_data(text_chunksinTokens, title, category, initial_id):
  ids = [str(i+initial_id) for i in range(len(text_chunksinTokens))]
  metadata = {
      'document': title,
      'category': category
  }
  metadatas = [ metadata for i in range(len(text_chunksinTokens))]
  return ids, metadatas

In [None]:
def add_document_to_collection(ids, metadatas, text_chunksinTokens, chroma_collection):
  print("Before inserting, the size of the collection: ", chroma_collection.count())
  chroma_collection.add(ids=ids, metadatas= metadatas, documents=text_chunksinTokens)
  print("After inserting, the size of the collection: ", chroma_collection.count())
  return chroma_collection

In [None]:
def retrieveDocs(chroma_collection, query, n_results=5, return_only_docs=False):
    results = chroma_collection.query(query_texts=[query],
                                      include= [ "documents","metadatas",'distances' ],
                                      n_results=n_results)

    if return_only_docs:
        return results['documents'][0]
    else:
        return results

In [None]:
def show_results(results, return_only_docs=False):

  if return_only_docs:
    retrieved_documents = results
    if len(retrieved_documents) == 0:
      print("No results found.")
      return
    for i, doc in enumerate(retrieved_documents):
      print(f"Document {i+1}:")
      print("\tDocument Text: ")
      display(to_markdown(doc));
  else:

      retrieved_documents = results['documents'][0]
      if len(retrieved_documents) == 0:
          print("No results found.")
          return
      retrieved_documents_metadata = results['metadatas'][0]
      retrieved_documents_distances = results['distances'][0]
      print("------- retreived documents -------\n")

      for i, doc in enumerate(retrieved_documents):
          print(f"Document {i+1}:")
          print("\tDocument Text: ")
          display(to_markdown(doc));
          print(f"\tDocument Source: {retrieved_documents_metadata[i]['document']}")
          print(f"\tDocument Source Type: {retrieved_documents_metadata[i]['category']}")
          print(f"\tDocument Distance: {retrieved_documents_distances[i]}")


## 4. Revise the load_multiple_pdfs_to_ChromaDB() to include persistentClient  

We need to update the function as well:

In [None]:
def load_multiple_pdfs_to_ChromaDB(collection_name,sentence_transformer_model,
                                   chromaDB_path):

  collection_name= collection_name
  category= "Journal Paper"
  sentence_transformer_model=sentence_transformer_model
  embedding_function= embedding_functions.SentenceTransformerEmbeddingFunction(model_name=sentence_transformer_model)
  chroma_client, chroma_collection = create_chroma_client(collection_name, embedding_function, chromaDB_path)
  current_id = chroma_collection.count()
  file_names = upload_multiple_files()
  for file_name in file_names:
    print(f"Document: {file_name} is being processed to be added to the {chroma_collection.name} {chroma_collection.count()}")
    print(f"current_id: {current_id} ")
    pdf_texts = convert_PDF_Text(file_name)
    text_chunksinChar = convert_Page_ChunkinChar(pdf_texts)
    text_chunksinTokens = convert_Chunk_Token(text_chunksinChar,sentence_transformer_model)
    ids,metadatas = add_meta_data(text_chunksinTokens,file_name,category, current_id)
    current_id = current_id + len(text_chunksinTokens)
    chroma_collection = add_document_to_collection(ids, metadatas, text_chunksinTokens, chroma_collection)
    print(f"Document: {file_name} added to the collection: {chroma_collection.count()}")
  return  chroma_client, chroma_collection

## 5. Run load_multiple_pdfs_to_ChromaDB() to fill in the colection

In [None]:
chroma_client, chroma_collection= load_multiple_pdfs_to_ChromaDB(collection_name,sentence_transformer_model, chromaDB_path)

ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet


Saving 15 UAV Route Planning For Maximum Target Coverage.pdf to 15 UAV Route Planning For Maximum Target Coverage (8).pdf
Saving 16 A Local Optimization Technique for Assigning New Targets ABSTRACT.pdf to 16 A Local Optimization Technique for Assigning New Targets ABSTRACT (8).pdf
Saving 22 ISEAIA Risk Sensetive Routing Abstract.pdf to 22 ISEAIA Risk Sensetive Routing Abstract (7).pdf
Saving 70 Biometric Verification.pdf to 70 Biometric Verification (7).pdf
Document: 15 UAV Route Planning For Maximum Target Coverage (8).pdf is being processed to be added to the Papers 0
current_id: 0 


ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet


Document:  15 UAV Route Planning For Maximum Target Coverage (8).pdf  chunk size:  8

Total number of chunks (document splited by max char = 1500):         14

Total number of chunks (document splited by 128 tokens per chunk):       41
Before inserting, the size of the collection:  0
After inserting, the size of the collection:  41
Document: 15 UAV Route Planning For Maximum Target Coverage (8).pdf added to the collection: 41
Document: 16 A Local Optimization Technique for Assigning New Targets ABSTRACT (8).pdf is being processed to be added to the Papers 41
current_id: 41 
Document:  16 A Local Optimization Technique for Assigning New Targets ABSTRACT (8).pdf  chunk size:  1

Total number of chunks (document splited by max char = 1500):         2

Total number of chunks (document splited by 128 tokens per chunk):       6
Before inserting, the size of the collection:  41
After inserting, the size of the collection:  47
Document: 16 A Local Optimization Technique for Assigning New Targe

## 6. Test the ChromaDB client and & collection

In [None]:
query = "What are the main difference in active and passive path scheduling?"

'''
In 16 A Local Optimization Technique for Assigning New Targets ABSTRACT:

Route planning can be static or dynamic. In static route planning, routes are
constructed according to given UAVs and targets and do not change during
the mission. However, in dynamic route planning, number of routes or UAVs
can alter which requires the update of existing routes to adopt these changes.

'''

'\nIn 16 A Local Optimization Technique for Assigning New Targets ABSTRACT:\n\nRoute planning can be static or dynamic. In static route planning, routes are \nconstructed according to given UAVs and targets and do not change during \nthe mission. However, in dynamic route planning, number of routes or UAVs \ncan alter which requires the update of existing routes to adopt these changes.\n\n'

In [None]:
retrieved_documents=retrieveDocs(chroma_collection, query, 10)
show_results(retrieved_documents)

------- retreived documents -------

Document 1:
	Document Text: 


> , routes are constructed according to given UAVs and targets and do not change during the mission. However, in dynamic route planning, number of routes or UAVs can alter which requires the update of existing routes to adopt these changes. For example, some of the UAVs can be lost during the mission or new targets might pop up after the take - off. This article proposes an iterative local optimization for the distribution of new targets to the existing routes in dynamic route planning. In the proposed solution, it is supposed that all UAVs have the same flight ranges, their initial routes are planned, and

	Document Source: 16 A Local Optimization Technique for Assigning New Targets ABSTRACT (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3230979240264826
Document 2:
	Document Text: 


> Computer Science & Engineering : An International Journal ( CSEIJ ), Vol. 4, No. 1, February 2014 34The prelimin ary results show the effectiveness of the MMAS in route planning. We would like to extend the work by defining different performance metrics and executing the experiments with different location set ups. REFERENCES [ 1 ] Bektas, T. ( 2006 ). The multiple trav eling salesman problem : an overview of formulations and solution procedures. Omega, 34 ( 3 ), 209 - 219. [ 2 ] Dorigo, M.

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4107952383887015
Document 3:
	Document Text: 


> with less cost, that is covering more targets, leave more pheromone on the paths to provide positive feedback for the other ants. 4. 4. Calculating Heuristic Value The heuristic value ( ηij ) between two locations is defined as ijijd1 =, whereijdis the distance between the locations. 4. 5. Algorithm Using the steps defined above an implementation of the MMAS is given in Table 1. We input the target list ( H ), the distances between the targets ( dij ), the flight ra nge ( FR ), and the number of UAVs

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.430447012601255
Document 4:
	Document Text: 


> follows. In the first phase of the algorithm, a n UAV with the highest slack range is picked and its route is modified by inserting a new target at a time. Adding a new target to an existing route causes an increase in the route distance, which is called update cost. If the update cost is not greater than the slack range, the new target is insert ed to the route. After finishing attempts with all new targets, if any of them is left over, insertion process is execute d with the UAV having the next highest slack range as described above until either all UAV

	Document Source: 16 A Local Optimization Technique for Assigning New Targets ABSTRACT (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4979866422014727
Document 5:
	Document Text: 


> UAV _ used < UAV ) { next = find _ Next _ Target ( ) ; if ( base _ Reachable ( next ) ) { move ( next ) ; remaining _ Range - = dcurrent, next ; target _ Number + + ; } else { move ( base ) ; UAV _ used + + ; remaining _ Range = FR ; } } / / end _ while evoporate _ Pheromone ( ) ; update _ Pheromone ( ) ; update _ Best _ Solution ( ) ; } / / end _ for _ each _ ant } return ( Best _ Solution ) ;

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5068370839867697
Document 6:
	Document Text: 


> ] and the Vehicle Routing Problem ( VRP ) [ 6 ]. In these well - defined problems, it is mostly assume d that travelling salesmen or vehicles should visit all the targets and the target function is defined as to find a minimum - distant route. Even, in the constraint versions of the mTSP and VRP, some other restrictions ( visiting time windows, number of depot s, etc. ) are included ; it is still assumed that there exists enough number of travelling salesmen or vehicles to cover all the given locations. However, in reality the number and flight range of UAVs might be ins

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5119848245505678
Document 7:
	Document Text: 


> Thus, if a routing plan can lead to visit all the targets, its cos t will be zero. The initial solution is constructed using Nearest Neighbors heuristic. The minimum pheromone value is defined as max10 min * ) 1 ( iteration p - = ( 6 ) As a result of Eq. ( 6 ), any edge would have pheromone at least ten times evaporat ed value of the maximum pheromone value. Thus, we do not allow unvisited edges get very low pheromone values which otherwise would decrease their probability. 4. 3. Updating Ph

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5183392136370448
Document 8:
	Document Text: 


> , 4 UAVs are successfully routed by the MMAS to cover all the targets while the NN prepares a routing plan for the same number of UAVs missing 4 % of the targets. Table 5. The target coverage ratios for the heuristics when FR = CD * 2. UAV TCNN TCMMAS 1 11 % 12 % 3 20 % 29 % 5 30 % 35 % 7 34 % 38 % 9 36 % 40 % 11 38 % 41 % 13 40 % 41 % 14 41 % 41 % 1 11 % 12 % 3. CONCLUSIONS In this work, we define

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.519430926832794
Document 9:
	Document Text: 


> they have already visited some of the targets accor ding to these routes. Furthermore, for each UAV, the slack range which is the difference between the flight range and initial route distance is calculated. Whenever some new targets appear, t he proposed iterative insertion algorithm executes as

	Document Source: 16 A Local Optimization Technique for Assigning New Targets ABSTRACT (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.520650897177158
Document 10:
	Document Text: 


> Computer Science & Engineering : An International Journal ( CSEIJ ), Vol. 4, No. 1, February 2014 28In the proposed solution, each ant constructs routes for the given number of UAVs using pheromone and heuristic information. After each iteration, the solution which covers more targets with less route distance is selected as the iteration - best solution and the pheromone values of the edges on that route are increased. According to the termination condition, the algorithm stops and outputs the best route found so far as the result. To evaluate the success of the proposed method,

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5302650166212748


## 7. Observe the ChromeDB saved to the provided path

List the folders and files in the chromaDB_path

In [None]:
!ls "{chromaDB_path}"

86c85359-5f78-4a0c-94ac-f734ca1ac34f  chroma.sqlite3


### YES WE DID IT!

# LOAD PERSISTENT CHROMADB



Let's kill the kernel so we ensure that nothing remains in the memory from all the above ChromaDB instance.

In [None]:
from google.colab import runtime
# Disconnect from the runtime
!kill -9 -1

##1 Connect to source directory

First get connected to the ChromaDB directory

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# change directory to chromaDB folder
chromaDB_path = '/content/drive/MyDrive/Colab Notebooks/ChromaDBData'
%cd {chromaDB_path}


/content/drive/MyDrive/Colab Notebooks/ChromaDBData


### Check that if chromadb_path exist or not and if exists does it contain chromadb files and folders

In [None]:
import os
if os.path.exists(chromadb_path):
    print(f"The directory '{chromadb_path}' exists.")

    # Check if the directory contains ChromaDB files and folders
    chromadb_files_and_folders = os.listdir(chromadb_path)
    if any(file_or_folder.startswith('chroma') for file_or_folder in chromadb_files_and_folders):
        print("The directory contains ChromaDB files and folders.")
    else:
        print("The directory does not contain ChromaDB files and folders.")
else:
    print(f"The directory '{chromadb_path}' does not exist.")


The directory '/content/drive/MyDrive/Colab Notebooks/ChromaDBData' exists.
The directory contains ChromaDB files and folders.


##2 Install required libraries

Secondly install all the required libraries and helper functions

In [None]:
%pip install chromadb --quiet
%pip install sentence_transformers --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.9/59.9 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.0/107.0 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
from chromadb import Client, PersistentClient
from chromadb.utils import embedding_functions


In [None]:
import textwrap
from IPython.display import display
from IPython.display import Markdown
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [None]:
def retrieveDocs(chroma_collection, query, n_results=5,
                 return_only_docs=False, filterType=None, filterValue=None):
    if filterType is not None and filterValue is not None:
        results = chroma_collection.query(
            query_texts=[query],
            include=["documents", "metadatas", "distances"],
            where={filterType: filterValue},
            n_results=n_results)

    else:
        results = chroma_collection.query(
            query_texts=[query],
            include= [ "documents","metadatas",'distances' ],
            n_results=n_results)

    if return_only_docs:
        return results['documents'][0]
    else:
        return results

In [None]:
def show_results(results, return_only_docs=False):

  if return_only_docs:
    retrieved_documents = results
    if len(retrieved_documents) == 0:
      print("No results found.")
      return
    for i, doc in enumerate(retrieved_documents):
      print(f"Document {i+1}:")
      print("\tDocument Text: ")
      display(to_markdown(doc));
  else:

      retrieved_documents = results['documents'][0]
      if len(retrieved_documents) == 0:
          print("No results found.")
          return
      retrieved_documents_metadata = results['metadatas'][0]
      retrieved_documents_distances = results['distances'][0]
      print("------- retreived documents -------\n")

      for i, doc in enumerate(retrieved_documents):
          print(f"Document {i+1}:")
          print("\tDocument Text: ")
          display(to_markdown(doc));
          print(f"\tDocument Source: {retrieved_documents_metadata[i]['document']}")
          print(f"\tDocument Source Type: {retrieved_documents_metadata[i]['category']}")
          print(f"\tDocument Distance: {retrieved_documents_distances[i]}")


##3 Initailizing

 Now, we can begin to upload the persistent ChromaDB from the location by initailizing
*  the chromaDB client
*  the chromaDB collections

In [None]:
# Initialize ChromaDB client with Google Drive connection
drive_path = '/content/drive/MyDrive/Colab Notebooks/ChromaDBData'

In [None]:
chroma_client2 = PersistentClient(path=drive_path,
                                     settings=Settings(),
                                     tenant=DEFAULT_TENANT,
                                     database=DEFAULT_DATABASE,)

In [None]:
chroma_client2.list_collections()

[Collection(name=Papers)]

In [None]:
collection_name = "Papers"
sentence_transformer_model="distiluse-base-multilingual-cased-v1"
embedding_function= embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=sentence_transformer_model)


  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/539M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

In [None]:
chroma_collection2 = chroma_client2.get_or_create_collection(
      collection_name,
      embedding_function=embedding_function)

##4 Test

Test the loaded ChromeDB client and the collection

In [None]:
chroma_collection2.get(['0'])

{'ids': ['0'],
 'embeddings': None,
 'metadatas': [{'category': 'Journal Paper',
   'document': '15 UAV Route Planning For Maximum Target Coverage (8).pdf'}],
 'documents': ['Computer Science & Engineering : An International Journal ( CSEIJ ), Vol. 4, No. 1, February 2014 DOI : 10. 5121 / cseij. 2014. 410 3 27UAVROUTEPLANNING FORMAXIMUMTARGET COVERAGE MuratKarakaya1 1Department of Computer Engineering, Atilim University, Ankara, Turkey ABSTRACT Utilization of Unmanned Aerial Vehicles ( UAVs ) in military and civil operations is getting popular. One of the challenges in effectively tasking these expensive vehicles is planning'],
 'uris': None,
 'data': None}

```python
chroma_collection.get(['0'])

{'ids': ['0'],
 'embeddings': None,
 'metadatas': [{'category': 'Journal Paper',
   'document': '15 UAV Route Planning For Maximum Target Coverage.pdf'}],
 'documents': ['Computer Science & Engineering : An International Journal ( CSEIJ ), Vol. 4, No. 1,
 February 2014 DOI : 10. 5121 / cseij. 2014. 410 3 27UAVROUTEPLANNING FORMAXIMUMTARGET COVERAGE MuratKarakaya1
 1Department of Computer Engineering, Atilim University, Ankara, Turkey ABSTRACT Utilization of Unmanned Aerial
 Vehicles ( UAVs ) in military and civil operations is getting popular. One of the challenges in effectively
  tasking these expensive vehicles is planning'],
 'uris': None,
 'data': None}
```

In [None]:
query = "What is Target Coverage?"

In [None]:
retrieved_documents=retrieveDocs(chroma_collection2, query, 10)
show_results(retrieved_documents)

------- retreived documents -------

Document 1:
	Document Text: 


> distance of the farthest target from the selected base. We test threeFRs with respect to the CDas Case 1 : FR = CD, Case 2 : FR = CD / 2, and Case 3 : FR = CD * 2. The main performance metric, Target Coverage ( TC ), is the ratio of the number of the targets visited by all the UAVs to the existing targets as formulated below : 100 * allvisited TTTC = ( 9 ) To obtain the results, we run each simulation 10 times and get the averages of these results to find the mean values. 5. 1. Results

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.303858609999323
Document 2:
	Document Text: 


> , 4 UAVs are successfully routed by the MMAS to cover all the targets while the NN prepares a routing plan for the same number of UAVs missing 4 % of the targets. Table 5. The target coverage ratios for the heuristics when FR = CD * 2. UAV TCNN TCMMAS 1 11 % 12 % 3 20 % 29 % 5 30 % 35 % 7 34 % 38 % 9 36 % 40 % 11 38 % 41 % 13 40 % 41 % 14 41 % 41 % 1 11 % 12 % 3. CONCLUSIONS In this work, we define

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3110774849579452
Document 3:
	Document Text: 


> on the maximum and minimum values of the pheromone values that can be compiled on an edge. We apply MMAS to find a route planning to cover most of the targets as explained belo w. 4. APPLYINGMMASTOTARGETCOVERAGE PROBLEM Below, we first explain the MMAS basics and then provide the algorithm to generate a solution to cover maximum number of targets. 2. 1. Selecting Next Target In MMAS, each artificial ant tries to create a route planning for all the UAVs by visiting targets considering the given problem constraints. Beginning

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.323723876045269
Document 4:
	Document Text: 


> ##s or new targets are finished. If there are still uncovered new targets after trying all UAVs, the algorithm proceeds the second phase in which a 2 - opt technique is applied to the modified UAV routes for increasing the slack range s. Then, the first phase of the algorithm is re - run for the remaining uncovered targets. Algorithm will terminate either all the new targets are covered or 2 - opt technique does not produce any better slack ranges. This local optimization technique is implemented using Mason simulation library and tested with various experiments for different parameter settings and TSP data

	Document Source: 16 A Local Optimization Technique for Assigning New Targets ABSTRACT (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3404909250098955
Document 5:
	Document Text: 


> generate more target coverage compared to the NN heuristic.

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3665156970266095
Document 6:
	Document Text: 


> ] and the Vehicle Routing Problem ( VRP ) [ 6 ]. In these well - defined problems, it is mostly assume d that travelling salesmen or vehicles should visit all the targets and the target function is defined as to find a minimum - distant route. Even, in the constraint versions of the mTSP and VRP, some other restrictions ( visiting time windows, number of depot s, etc. ) are included ; it is still assumed that there exists enough number of travelling salesmen or vehicles to cover all the given locations. However, in reality the number and flight range of UAVs might be ins

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3945648200810192
Document 7:
	Document Text: 


> Thus, if a routing plan can lead to visit all the targets, its cos t will be zero. The initial solution is constructed using Nearest Neighbors heuristic. The minimum pheromone value is defined as max10 min * ) 1 ( iteration p - = ( 6 ) As a result of Eq. ( 6 ), any edge would have pheromone at least ten times evaporat ed value of the maximum pheromone value. Thus, we do not allow unvisited edges get very low pheromone values which otherwise would decrease their probability. 4. 3. Updating Ph

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4261031915633489
Document 8:
	Document Text: 


> flight range, and the number of total targets visited by the all UAVs is maximized. Thus the target function is to maximize the number of targets to be visited by the al l UAVs. The constraints are the flight range and the number of UAVs. 3. MAX - MINANTSYSTEM Stützle and Hoos proposed the Max - Min Ant Colony System ( MMAS ) as a successful alternative to Ant System ( AS ) [ 8 ]. In the referenced work, they show the relative success. The basic difference between the MMAS and AS is the setting up limits

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4261147814174622
Document 9:
	Document Text: 


> the flight routes to monitor the targets. In this work, we aim to develop an algorithm which produces routing plans for a limited number of UAVs to cover maximum number of target s considering their flight range. The proposed solution for this practical optimization problem is designed by modifying the Max - Min Ant System ( MMAS ) algorithm. To evaluate the success of the proposed method, an alternative approach, based on the Neares t Neighbour ( NN ) heuristic, has been developed as well. The results showed the success of the proposed MMAS method by increasing the number of covered targets compared to the

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4322809030976014
Document 10:
	Document Text: 


> means that either all the targets are visited or the flight range is not enough to visit any targets any more. Then, ant returns t o the base. Thus, a route for a UAV is completed. The ant begins a new route for the next UAV with a refreshed flight range. When all the routes are prepared for all the UAVs an iteration of the ants has been finished. Each ant builds its own route plannin g simultaneously by exploiting the experiences of other ants by sensing the pheromone values in the formula. 4. 2. Assigning Initial Phero

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (8).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.432852971205863


## YES! WE DID IT!

# SUMMARY

.

In [None]:
def generateAnswer(query,n_results=5):
    retrieved_documents=retrieveDocs(query,n_results)

    print("------- retreived documents -------\n")
    for i, doc in enumerate(retrieved_documents):
      print(f"Document {i+1}:")
      print(f"\tDocument Text: {doc}")
    print("------- RAG answer -------\n")
    output = chat.send_message( "QUESTION: "+ query + "\n EXCERPTS: "+ "\n".join(retrieved_documents))
    to_markdown(output.text)
    print('\n')
    return output

In [None]:
queries =["Who are the authors suggested a new attention mechanism?",
          "Who are the authors suggested a new controllable text generation mechanism?",
          "Who is Murat Karakaya?",
          "Why do we need to control how the text is produced? ",
          "How can we use the self attention mechanism to control the text generation?",
          "Summarize the paper named Controllable Text Generation",
          "How many blocks are suggested in the transformer?",
          "What about decoder?"

    ]


In [None]:
reply=generateAnswer(queries[0],10)

AttributeError: 'str' object has no attribute 'query'

In [None]:
to_markdown(reply.text)

In [None]:
for message in chat.history:
  display(to_markdown(f'**{message.role}**: {message.parts[0].text}'))


In [None]:
model.count_tokens(chat.history)

In [None]:
response = chat.send_message(prompt)
to_markdown(response.text)


In [None]:
import os
import openai
from openai import OpenAI

'''
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']
openai_client = OpenAI()
'''
openai_client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

In [None]:
def rag(query, retrieved_documents, model="gpt-3.5-turbo-1106"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "As an attentive and supportive academic assistant, "
            "your task is to provide assistance based solely on the provided"
            " excerpts. Answer the following questions, ensuring your responses"
            " are derived exclusively from the provided partial texts. "
            "If the answer cannot be found within the provided excerpts, "
            "kindly respond with 'I don't know'."
            "After answering each question, please provide a detailed "
            "explanation, breaking down the answer step by step and relating "
            "it to the provided excerpts."
            "Return your response as a Json object with two key fields: "
            " 'Answer', which should contain the value of the answer, and "
            " 'Reason', which should provide an explanation of why this answer "
            "was generated."

        },
        {"role": "user", "content": f"Question: {query}. \n Excerpts: {information}"}
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [None]:
def generateAnswer(query,n_results=5):
    retrieved_documents=retrieveDocs(query,n_results)

    print("------- retreived documents -------\n")
    for i, doc in enumerate(retrieved_documents):
      print(f"Document {i+1}:")
      print(f"\tDocument Text: {doc}")
    print("------- RAG answer -------\n")
    output = rag(query=query, retrieved_documents=retrieved_documents)
    print(output)
    print('\n')
    return output

In [None]:
reply=generateAnswer(queries[5],10)


In [None]:
# prompt: convert the 'reply' to a dict

import ast
reply_dict = ast.literal_eval(reply)
print(f"Answer: {reply_dict['Answer']}")
print(f"Because; {reply_dict['Reason']}")

In [None]:
for query in queries:
  generateAnswer(query)

In [None]:
%pip install umap-learn

In [None]:
def project_embeddings(embeddings, umap_transform):
    umap_embeddings = np.empty((len(embeddings),2))
    for i, embedding in enumerate(tqdm(embeddings)):
        umap_embeddings[i] = umap_transform.transform([embedding])
    return umap_embeddings

In [None]:
import umap.umap_ as umap

embeddings = chroma_collection.get(include=['embeddings'])['embeddings']
umap_transform = umap.UMAP(random_state=0, transform_seed=0).fit(embeddings)
projected_dataset_embeddings = project_embeddings(embeddings, umap_transform)

In [None]:
import matplotlib.pyplot as plt

plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10)
plt.gca().set_aspect('equal', 'datalim')
plt.title('Projected Embeddings')
plt.axis('off')

In [None]:
query = queries[3]

results = chroma_collection.query(query_texts=query, n_results=10, include=['documents', 'embeddings'])

retrieved_documents = results['documents'][0]

for document in results['documents'][0]:
    print(document)
    print('')


In [None]:
query_embedding = embedding_function([query])[0]
retrieved_embeddings = results['embeddings'][0]

projected_query_embedding = project_embeddings([query_embedding], umap_transform)
projected_retrieved_embeddings = project_embeddings(retrieved_embeddings, umap_transform)


In [None]:
# Plot the projected query and retrieved documents in the embedding space
plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(projected_query_embedding[:, 0], projected_query_embedding[:, 1], s=150, marker='X', color='r')
plt.scatter(projected_retrieved_embeddings[:, 0], projected_retrieved_embeddings[:, 1], s=100, facecolors='none', edgecolors='g')

plt.gca().set_aspect('equal', 'datalim')
plt.title(f'{query}')
plt.axis('off')

In [None]:
def augment_query_generated(query, model="gpt-3.5-turbo"):
    messages = [
        {
            "role": "system",
            "content": "Sen TÜBİTAK proje başvurularını inceleyen yapay zeka konusunda uzman bir akasemisyensin."
            "Aşağıda verilen soruya, aşağıdaki proje tanımına uygun olabilecek bir cevap üret: \n"
            "Projenin genel amacı, bankacılık sektöründeki risk yönetimi operasyonlarını geliştirmek ve finansal kurumların karşılaştığı zorlukları ele almak "
            "için yapay zeka (AI) tabanlı bir platform geliştirmektir. Proje, bankalara vadeli mevduatın erken bozulması, kredilerin erken ödenmesi ve çeşitli "
            "mevduat türlerinin belirlenmesi gibi davranışsal riskleri daha etkili bir şekilde yönetme kapasitesi sunmayı hedeflemektedir. Bu riskler, finansal "
            "kurumların bilanço dengesini etkileyebilir ve operasyonel verimliliği azaltabilir. "
            "Projenin çözmeyi amaçladığı temel problem, bankaların karlılık ve risk analizlerini gerçekleştirirken karşılaştığı karmaşık durumları doğru ve "
            "etkili bir şekilde yönetme ihtiyacıdır. Özellikle vadeli mevduatların erken kapanması ve kredilerin erken ödenmesi gibi durumlar, bankaların "
            "gelecekteki nakit akışlarını ve risk profillerini belirleme sürecini karmaşıklaştırabilir"

        },
        {"role": "user", "content": query}
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [None]:
original_query = queries[0]
hypothetical_answer = augment_query_generated(original_query)

joint_query = f"{original_query} {hypothetical_answer}"
print(joint_query)

In [None]:
def extend_retrieved_documents(results, extension=4):
  original_ids= results['ids'][0]
  print("original_ids: ",original_ids)

  extended_ids = set()


  for id in original_ids:
    extended_ids.add(int(id))
    for i in range(1, extension):
      extended_ids.add(int(id)+i)


  extended_ids = sorted([int(x) for x in extended_ids])
  extended_ids = [str(x) for x in extended_ids if int(x) < chroma_collection.count()]
  print("extended_ids: ",extended_ids)
  return chroma_collection.get(extended_ids)['documents']

In [None]:
def retrieveDocs_augmented_query(query, n_results=5, extension=4):
    hypothetical_answer = augment_query_generated(query)
    print("------ hypothetical_answer ---------\n")
    print(hypothetical_answer,"\n")
    print("------------------------------------\n")
    joint_query = f"{query} {hypothetical_answer}"
    results = chroma_collection.query(query_texts=joint_query, n_results=n_results, include=['documents', 'embeddings'])
    retrieved_documents = extend_retrieved_documents(results, extension)
    #retrieved_documents = results['documents'][0]

    return retrieved_documents



In [None]:
retrieved_documents=retrieveDocs_augmented_query(query, 5)

for doc in retrieved_documents:
    print(doc)
    print('')

In [None]:
results = chroma_collection.query(query_texts=joint_query, n_results=10, include=['documents', 'embeddings'])
retrieved_documents = results['documents'][0]

for doc in retrieved_documents:
    print(doc)
    print('')

In [None]:
retrieved_embeddings = results['embeddings'][0]
original_query_embedding = embedding_function([original_query])
augmented_query_embedding = embedding_function([joint_query])

projected_original_query_embedding = project_embeddings(original_query_embedding, umap_transform)
projected_augmented_query_embedding = project_embeddings(augmented_query_embedding, umap_transform)
projected_retrieved_embeddings = project_embeddings(retrieved_embeddings, umap_transform)

In [None]:
import matplotlib.pyplot as plt

# Plot the projected query and retrieved documents in the embedding space
plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(projected_retrieved_embeddings[:, 0], projected_retrieved_embeddings[:, 1], s=100, facecolors='none', edgecolors='g')
plt.scatter(projected_original_query_embedding[:, 0], projected_original_query_embedding[:, 1], s=150, marker='X', color='r')
plt.scatter(projected_augmented_query_embedding[:, 0], projected_augmented_query_embedding[:, 1], s=150, marker='X', color='orange')

plt.gca().set_aspect('equal', 'datalim')
plt.title(f'{original_query}')
plt.axis('off')

In [None]:
def generateAnswer_augmented_query(query,n_results=5, extention=4):
    print("------- query -------\n")
    print(query,"\n")
    retrieved_documents=retrieveDocs_augmented_query(query,n_results,extention)
    print("------- retreived documents -------\n")
    for document in retrieved_documents:
        print(document)
        print('\n')

    print("------- RAG answer -------\n")
    output = rag(query=query, retrieved_documents=retrieved_documents)
    print(output)
    print('\n')

In [None]:
queries

In [None]:
generateAnswer_augmented_query(queries[0],10,5)

In [None]:
title= """Ar-Ge Sürecinde Kullanılacak Yöntemler Tanımlanan proje hedeflerine ulaşmak için uygulanacak analitik
        deneysel çözüm yöntemlerini belirtiniz. (NOT: Bu bölümde sunulan proje özelinde
        hangi teknik / bilimsel yaklaşımların ve bunlara ait aşamaların takip edileceği açıklanmalı, iş paketleri isimleri ya da her projede olabilecek standart
        rutin çalışma yöntemleri tekrarlanmamalıdır."""
results = chroma_collection.query(query_texts=title, n_results=5, include=['documents', 'embeddings'])
retrieved_documents = results['documents'][0]
print(retrieved_documents)

In [None]:
title= """Ar-Ge Sürecinde Kullanılacak Yöntemler Tanımlanan proje hedeflerine ulaşmak için uygulanacak analitik
        deneysel çözüm yöntemlerini belirtiniz. (NOT: Bu bölümde sunulan proje özelinde
        hangi teknik / bilimsel yaklaşımların ve bunlara ait aşamaların takip edileceği açıklanmalı, iş paketleri isimleri ya da her projede olabilecek standart
        rutin çalışma yöntemleri tekrarlanmamalıdır."""
results = chroma_collection.query(query_texts=title, n_results=5, include=['documents', 'embeddings'])

retrieved_documents = extend_retrieved_documents(results)
print(retrieved_documents)


In [None]:
chroma_collection.get(results['ids'][0])