<a href="https://colab.research.google.com/github/kmkarakaya/Deep-Learning-Tutorials/blob/master/Simple_Rag_with_chromaDB_Gemini_PartD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PART D: A SIMPLE RAG PIPELINE BASED ON GEMINI & CHROMADB
    
    
    
    
    

In this notebook we will develop a Retrieval Augmented Generation (RAG) application.

The Parts are

* PART A: AN INTRO TO GEMINI API FOR TEXT GENERATION & CHAT
* PART B: CODE WITH CHROMADB FOR VECTOR STORAGE & SIMILARITY SEARCH
* PART C: CODE WITH CHROMADB FOR PERSISTENT VECTOR DB
* PART D: A SIMPLE RAG BASED ON GEMINI & CHROMADB
* PART E: ADVANCED TECHNIQUES FOR RAG BASED ON GEMINI & CHROMADB

# WHAT IS RAG?

RAG stands for Retrieval-Augmented Generation. It's a technique that combines large language models (LLMs) with external knowledge sources to improve the accuracy and reliability of AI-generated text.

## How Does RAG Work? Unveiling the Power of External Knowledge

Before we start the core RAG process, we need to provide a foundation as follows:

* **Building the Knowledge Base:** The system starts by transforming documents and information within the external knowledge base (like Wikipedia or a company database) into a special format called **vector representations**. These condense the meaning of each document into a series of **numbers**, capturing the essence of the content.

* **Vector Database for Speedy Retrieval**: These vector representations are then stored in a specialized database called a vector database. This database is optimized for efficiently **searching and retrieving** information based on **semantic similarity**. Imagine it as a super-powered library catalog that **understands the meaning** of documents, **not just keywords**.

Now, let's explore how RAG leverages this foundation:

* **User Input**: The RAG process begins with a question or **prompt** from the user. This could be anything from "What caused the extinction of the dinosaurs?" to a more open-ended request like "Write a creative story."

* **Intelligent Retrieval**: RAG doesn't rely solely on the **LLM's internal knowledge**. It employs an information retrieval component that acts like a super-powered search engine. This component scans the vast external knowledge base – like a company's internal database for specific domains – to find information **directly relevant** to the user's input. Unlike a traditional **search engine** that relies on **keywords**, RAG leverages the power of vector representations to understand the **semantic meaning** of the user's prompt and identify the most relevant documents.

* **Enriched Context Creation**: The retrieved information isn't just shown alongside the prompt. RAG cleverly **merges the user input with the relevant snippets** from the knowledge base. This creates a ***richer context*** for the LLM to understand the **user's intent** and formulate a well-informed response.

* **LLM Powered Response Generation**: Finally, the **enriched context** is fed to the Large Language Model (LLM). The LLM, along with its ability to process language patterns, now has a strong **foundation of factual** information to draw upon. This empowers it to generate a response that is both comprehensive and accurate, addressing the specific needs of the user's prompt.

In this part, we will learn how to build a persistent ChromaDB Vector Database for speedy retrieval in a Knowledge Base.

https://www.trychroma.com/
https://github.com/chroma-core/chroma

# CONTENT: A SIMPLE RAG PIPELINE BASED ON GEMINI & CHROMADB

In this comprehensive tutorial series, we delve into the exciting world of developing a Retrieval Augmented Generation (RAG) application. If you are eager to create a chatbot leveraging cutting-edge technologies like GEMINI and Chromadb, you are in the right place! This video is tailored for anyone interested in building a RAG system, whether you're a seasoned developer or just starting out.

In the first three parts of this series, we explored:

* Coding GEMINI API for Text Generation & Chat: Understanding how to implement and use the GEMINI API for creating dynamic text-based interactions.
* Creating a Persistent Chromadb for Vector Storage & Similarity Search: Learning how to store and retrieve vectors efficiently using Chromadb.

In this fourth installment, titled "A SIMPLE RAG PIPELINE BASED ON GEMINI & CHROMADB," we aim to construct a functional RAG pipeline using these powerful tools. Here's what you can expect:

Key Steps Covered in this Video:
* Creating a Knowledge Base from Scratch with a Persistent Chromadb: Learn how to build a robust knowledge base from multiple documents.
* Upload Multiple Documents and Create Knowledge Base: Step-by-step guide on uploading and organizing your documents.
* Test the Knowledge Base: Methods to ensure your knowledge base is functioning correctly.
* Load a Knowledge Base from a Persistent Chromadb: How to efficiently load and utilize your knowledge base.
* Connect to an LLM: Google GEMINI via the Chat API: Integrate the Google GEMINI model for enhanced interaction.
* Create the RAG Pipeline for the Existing Knowledge Base: Develop a seamless pipeline to utilize your knowledge base with GEMINI.
* A Simple Loop for User Interaction: Implement a user-friendly loop for interactions.
* A Gradio Interface to the RAG: Create an intuitive interface using Gradio for a better user experience.

All these steps will be implemented and coded in Python on Google Colab, ensuring you can follow along and replicate the process easily.

Follow Us:
Murat Karakaya  on LinkedIn
Murat Karakaya  on Twitter

Join our community of developers and tech enthusiasts! Don't forget to like, share, and subscribe to stay updated with our latest tutorials and tech insights.

Watch the video here:
* In English:
* In Turkish:



# WHY WE NEED A PERSISTENT CHROMADB?

In the context of a Retrieval-Augmented Generation (RAG) approach, saving and loading a persistent ChromaDB is particularly important for several reasons:

1. **Enhanced Data Durability**:
   - **Importance**: Ensures the retrieval database used for augmenting generative models is not lost between sessions or system restarts.
   - **RAG Relevance**: Maintains a consistent and reliable knowledge base that the generative model can reference, leading to more accurate and relevant responses.

2. **Operational Continuity**:
   - **Importance**: Allows seamless continuation of operations without needing to re-index or re-import data, saving time and computational resources.
   - **RAG Relevance**: Ensures that the generative model has continuous access to the same set of documents, which is essential for generating consistent and coherent responses over time.

3. **Facilitating Collaboration**:
   - **Importance**: Enables multiple users or systems to share and access the same dataset.
   - **RAG Relevance**: Supports collaborative development and usage of the RAG system, allowing different teams to work on improving the retrieval and generation processes simultaneously.

4. **Scalability**:
   - **Importance**: Provides a stable and persistent backend, enabling efficient handling of large datasets.
   - **RAG Relevance**: Essential for scaling the RAG system to handle more extensive and diverse knowledge bases, ensuring that the system can manage increased loads and deliver prompt, relevant information.


In a RAG system, the retriever (like ChromaDB) provides the generative model with relevant context from a knowledge base to generate informed and accurate responses. Persistent storage ensures that this knowledge base is durable, continuously available, and scalable, which is critical for the reliability, consistency, and performance of the RAG system.



# CREATING A KNOWLEDGE BASE FROM SCRATCH WITH A PERSISTENT CHROMADB

To make ChromaDB durable (persistent) rather than temporary on Google Colab, you can use external storage services like Google Drive or set up a cloud-based database. Google Colab provides temporary storage that resets after each session, so to maintain persistence across sessions, you'll need to save your data and configurations externally.

##1 Install required libraries

Install all the required libraries and helper functions

In [1]:
%pip install chromadb --quiet
%pip install sentence_transformers --quiet
%pip install pypdf --quiet
%pip install langchain --quiet
%pip install tqdm --quiet

import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import SentenceTransformersTokenTextSplitter


from pypdf import PdfReader

from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
from chromadb import Client, PersistentClient
from chromadb.utils import embedding_functions

import textwrap
from IPython.display import display
from IPython.display import Markdown
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m559.5/559.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.9/59.9 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.0/107.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

## 2 Initialize a Persistent ChromaDB client with a proper Google Drive connection

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd drive/MyDrive/'Colab Notebooks'

/content/drive/MyDrive/Colab Notebooks


In [4]:
# Initialize ChromaDB client with Google Drive connection
chromaDB_path = '/content/drive/MyDrive/Colab Notebooks/ChromaDBData'


In [5]:
# Check if the chromadb_path exists or not. If so, delete all the files and folders in chromadb_path. But before deleting get the permission from the user.

import os
import shutil

def delete_all_files_and_folders(chromaDB_path):
  if os.path.exists(chromaDB_path):
    print(f"The directory '{chromaDB_path}' already exists.")
    permission = input("Do you want to delete all the files and folders in this directory? (y/n): ")
    if permission == "y":
      shutil.rmtree(chromaDB_path)
      print(f"All files and folders in '{chromaDB_path}' have been deleted.")
    else:
      print("No action taken.")
  else:
    print(f"The directory '{chromaDB_path}' does not exist.")



In [6]:
delete_all_files_and_folders(chromaDB_path)

The directory '/content/drive/MyDrive/Colab Notebooks/ChromaDBData' already exists.
Do you want to delete all the files and folders in this directory? (y/n): y
All files and folders in '/content/drive/MyDrive/Colab Notebooks/ChromaDBData' have been deleted.


## 3 Define PersistentClient

Let's re-define the **create_chroma_client** function from the previous part so that this time we initialize a **persistent** ChromaDB client:

In [None]:
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
from chromadb import Client, PersistentClient


In [None]:
def create_chroma_client(collection_name, embedding_function, chromaDB_path ):
  if chromaDB_path is not None:
    chroma_client = PersistentClient(path=chromaDB_path,
                                     settings=Settings(),
                                     tenant=DEFAULT_TENANT,
                                     database=DEFAULT_DATABASE,)
  else:
    chroma_client = Client()

  chroma_collection = chroma_client.get_or_create_collection(
      collection_name,
      embedding_function=embedding_function)

  return chroma_client, chroma_collection

## 4 Create a collection as usual

In [9]:
collection_name = "Papers"
sentence_transformer_model="distiluse-base-multilingual-cased-v1"
embedding_function= embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=sentence_transformer_model)


  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/539M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

In [22]:
chroma_client, chroma_collection = create_chroma_client(collection_name,
                                                        embedding_function,
                                                        chromaDB_path)

### Check the created collection:

In [23]:
# Verify collection properties
print(f"Collection name: {chroma_collection.name}")
print(f"Number of documents in collection: {chroma_collection.count()}")

# List all collections in the client
print("All collections in ChromaDB client:")
for collection in chroma_client.list_collections():
    print(collection.name)

Collection name: Papers
Number of documents in collection: 0
All collections in ChromaDB client:
Papers


## 5 Define helper functions

In [24]:
from google.colab import files
def upload_multiple_files():
  uploaded = files.upload()
  file_names = list()
  for fn in uploaded.keys():
    #print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))
    file_names.append(fn)
  return file_names

In [25]:
def convert_PDF_Text(pdf_path):
  reader = PdfReader(pdf_path)
  pdf_texts = [p.extract_text().strip() for p in reader.pages]
  # Filter the empty strings
  pdf_texts = [text for text in pdf_texts if text]
  print("Document: ",pdf_path," chunk size: ", len(pdf_texts))
  return pdf_texts

In [26]:
def convert_Page_ChunkinChar(pdf_texts, chunk_size = 1500, chunk_overlap=0 ):
  character_splitter = RecursiveCharacterTextSplitter(
      separators=["\n\n", "\n", ". ", " ", ""],
      chunk_size=1500,
      chunk_overlap=0
)
  character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))
  print(f"\nTotal number of chunks (document splited by max char = 1500): \
        {len(character_split_texts)}")
  return character_split_texts

In [27]:
def convert_Chunk_Token(text_chunksinChar,sentence_transformer_model, chunk_overlap=0,tokens_per_chunk=128 ):
  token_splitter = SentenceTransformersTokenTextSplitter(
      chunk_overlap=0,
      model_name=sentence_transformer_model,
      tokens_per_chunk=128)

  text_chunksinTokens = []
  for text in text_chunksinChar:
      text_chunksinTokens += token_splitter.split_text(text)
  print(f"\nTotal number of chunks (document splited by 128 tokens per chunk):\
       {len(text_chunksinTokens)}")
  return text_chunksinTokens

In [16]:
def add_meta_data(text_chunksinTokens, title, category, initial_id):
  ids = [str(i+initial_id) for i in range(len(text_chunksinTokens))]
  metadata = {
      'document': title,
      'category': category
  }
  metadatas = [ metadata for i in range(len(text_chunksinTokens))]
  return ids, metadatas

In [17]:
def add_document_to_collection(ids, metadatas, text_chunksinTokens, chroma_collection):
  print("Before inserting, the size of the collection: ", chroma_collection.count())
  chroma_collection.add(ids=ids, metadatas= metadatas, documents=text_chunksinTokens)
  print("After inserting, the size of the collection: ", chroma_collection.count())
  return chroma_collection

In [28]:
def retrieveDocs(chroma_collection, query, n_results=5, return_only_docs=False):
    results = chroma_collection.query(query_texts=[query],
                                      include= [ "documents","metadatas",'distances' ],
                                      n_results=n_results)

    if return_only_docs:
        return results['documents'][0]
    else:
        return results

In [29]:
def show_results(results, return_only_docs=False):

  if return_only_docs:
    retrieved_documents = results
    if len(retrieved_documents) == 0:
      print("No results found.")
      return
    for i, doc in enumerate(retrieved_documents):
      print(f"Document {i+1}:")
      print("\tDocument Text: ")
      display(to_markdown(doc));
  else:

      retrieved_documents = results['documents'][0]
      if len(retrieved_documents) == 0:
          print("No results found.")
          return
      retrieved_documents_metadata = results['metadatas'][0]
      retrieved_documents_distances = results['distances'][0]
      print("------- retreived documents -------\n")

      for i, doc in enumerate(retrieved_documents):
          print(f"Document {i+1}:")
          print("\tDocument Text: ")
          display(to_markdown(doc));
          print(f"\tDocument Source: {retrieved_documents_metadata[i]['document']}")
          print(f"\tDocument Source Type: {retrieved_documents_metadata[i]['category']}")
          print(f"\tDocument Distance: {retrieved_documents_distances[i]}")


In [30]:
def load_multiple_pdfs_to_ChromaDB(collection_name,sentence_transformer_model,
                                   chromaDB_path):

  collection_name= collection_name
  category= "Journal Paper"
  sentence_transformer_model=sentence_transformer_model
  embedding_function= embedding_functions.SentenceTransformerEmbeddingFunction(model_name=sentence_transformer_model)
  chroma_client, chroma_collection = create_chroma_client(collection_name, embedding_function, chromaDB_path)
  current_id = chroma_collection.count()
  file_names = upload_multiple_files()
  for file_name in file_names:
    print(f"Document: {file_name} is being processed to be added to the {chroma_collection.name} {chroma_collection.count()}")
    print(f"current_id: {current_id} ")
    pdf_texts = convert_PDF_Text(file_name)
    text_chunksinChar = convert_Page_ChunkinChar(pdf_texts)
    text_chunksinTokens = convert_Chunk_Token(text_chunksinChar,sentence_transformer_model)
    ids,metadatas = add_meta_data(text_chunksinTokens,file_name,category, current_id)
    current_id = current_id + len(text_chunksinTokens)
    chroma_collection = add_document_to_collection(ids, metadatas, text_chunksinTokens, chroma_collection)
    print(f"Document: {file_name} added to the collection: {chroma_collection.count()}")
  return  chroma_client, chroma_collection

## 6 Upload Multiple Documents and Create Knowledge Base

Run load_multiple_pdfs_to_ChromaDB() to fill in the colection

In [31]:
chroma_client, chroma_collection= load_multiple_pdfs_to_ChromaDB(collection_name,sentence_transformer_model, chromaDB_path)

Saving 15 UAV Route Planning For Maximum Target Coverage ABSTRACT.pdf to 15 UAV Route Planning For Maximum Target Coverage ABSTRACT (2).pdf
Saving 29 Mobile Sink Scheduling Method for Wireless Sensor Networks under Travel Time Uncertainty ABSTRACT.pdf to 29 Mobile Sink Scheduling Method for Wireless Sensor Networks under Travel Time Uncertainty ABSTRACT (3).pdf
Saving 30 Time-Sensitive Ant Colony Optimization ORIGINAL.pdf to 30 Time-Sensitive Ant Colony Optimization ORIGINAL (2).pdf
Saving 73 Analyzing Students Academic Success.pdf to 73 Analyzing Students Academic Success (2).pdf
Saving 77 Arac Park Yerlerinin Doluluk.pdf to 77 Arac Park Yerlerinin Doluluk (2).pdf
Document: 15 UAV Route Planning For Maximum Target Coverage ABSTRACT (2).pdf is being processed to be added to the Papers 0
current_id: 0 
Document:  15 UAV Route Planning For Maximum Target Coverage ABSTRACT (2).pdf  chunk size:  1

Total number of chunks (document splited by max char = 1500):         2

Total number of chu

## 7 Test the Knowledge Base

Query the Knowledge Base using the persistent ChromaDB client and & collection

In [32]:
query = "What are the main difference in active and passive path scheduling?"

'''
In 16 A Local Optimization Technique for Assigning New Targets ABSTRACT:

Route planning can be static or dynamic. In static route planning, routes are
constructed according to given UAVs and targets and do not change during
the mission. However, in dynamic route planning, number of routes or UAVs
can alter which requires the update of existing routes to adopt these changes.

'''

'\nIn 16 A Local Optimization Technique for Assigning New Targets ABSTRACT:\n\nRoute planning can be static or dynamic. In static route planning, routes are\nconstructed according to given UAVs and targets and do not change during\nthe mission. However, in dynamic route planning, number of routes or UAVs\ncan alter which requires the update of existing routes to adopt these changes.\n\n'

In [34]:
retrieved_documents=retrieveDocs(chroma_collection, query, 10)
show_results(retrieved_documents)

------- retreived documents -------

Document 1:
	Document Text: 


> impact of cirriculumdeveloment method. As seen above references, there have been a considerable interest on the pre - requisite coursesin the literature. However, the scope, goals, and results of the above - mentioned studies are not completely in line with those of the present work. One of the studies with similar scope and goals wasconducted by Anderson et al. [ 25 ]. They found thatsuccessfully completing calculus and economicscourses positively [UNK] the achieved success insubsequent economics courses. In another work, McMillan and Adeyemi focused on the success relationships between the

	Document Source: 73 Analyzing Students Academic Success (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.306624412536621
Document 2:
	Document Text: 


> ##lected data for a given period of time, namely tour time. Provided below are the details of the MES / TSACO implementation 3. 2 The Time - Sensitive ACO A combinatorial optimization problem can be static or dynamic with respect to the given characteristics of the problem. In static problems, the underly - ing system properties stay the same throughout the problem - solution pro - cess. A typical example of these kinds of problems is the Traveling Sales - man Problem ( TSP ). In the TSP, there are a number of towns ( vertices ) con - nected to each

	Document Source: 30 Time-Sensitive Ant Colony Optimization ORIGINAL (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3876214027404785
Document 3:
	Document Text: 


> ? A Mixed - Methods Approach to Identifying the Impact of a Prerequisite Course, CBE - Life Sciences Education, 16, 2017, p. ar16. 23. C. E. [UNK] and D. A. Gilman, Are Prerequisite Courses Necessary for Success in Advanced Courses?, ERIC, ERIC Number : ED475157, 2002. 24. J. M. Krieg and S. E. Henson, The Educational Impact ofOnline Learning : How Do University Students Perform in Subsequent Courses? Education Finance and Policy, 2016

	Document Source: 73 Analyzing Students Academic Success (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3958535194396973
Document 4:
	Document Text: 


> 70 M URAT KARAKAYA FIGURE 1 A sample schedule showing the dynamic nature of the MES. ( distance ). In essence, the solution to a TSP is to construct a minimum dis - tance circuit passing through each vertex once and only once. Therefore, thecost of a solution depends on the distances among the towns. These prob - lem characteristics - town locations and distances - do not change during the development of a solution. On the other hand, in dynamic problems, problemcharacteristics can be changed over time as a solution is being generated. The MES

	Document Source: 30 Time-Sensitive Ant Colony Optimization ORIGINAL (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4082379341125488
Document 5:
	Document Text: 


> and tour times. As a future work, the author intends to extend this work by scheduling a minimum number of multiple MSs, such that there will be no incidents [UNK] at all. Furthermore, it is planed that the scheduling algorithm beadapted so as to control the MS speed in order to minimize the [UNK] with optimum MS energy consumption. REFERENCES [ 1 ] Khaled Almi [UNK] ani, Anastasios Viglas, and Lavy Libman. ( 2010 ). Mobile element path plan - ning for time - constrained data gathering

	Document Source: 30 Time-Sensitive Ant Colony Optimization ORIGINAL (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.415479063987732
Document 6:
	Document Text: 


> - bin using a Traveling Sales - man Problem heuristic, all these schedules are concatenated into an overall schedule. There are some other works to solve different versions of the MES prob - lem, e. g. [ 1, 13 ] and [ 14 ]. In [ 1 ], Almi [UNK] ani et al. worked on a version of the MES problem in which the sensory data needs to be delivered to a static sink

	Document Source: 30 Time-Sensitive Ant Colony Optimization ORIGINAL (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.434402585029602
Document 7:
	Document Text: 


> second one is whether the selected curriculum devel - opment approach for deciding the chains has [UNK] impact on the academic success relation - ships between a pre - requisite and its follow - upcourse. The second question is shortly called [UNK] [UNK] Impact of Curriculum Development Model [UNK] [UNK]. Toinvestigate the answers, two chains among threemathematics courses ( Calculus I, Calculus II, and [UNK] Equations ) are selected. Then, using statistical test, the validity of the hypothesis isdiscussed. Before, providing the conclusions, a critical lit - erature survey,

	Document Source: 73 Analyzing Students Academic Success (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.46108877658844
Document 8:
	Document Text: 


> graduate pre - requisite management courses and the graduate Organiza - tional Behavior courses, and found a positive rela - tion between the grades received in these courses as

	Document Source: 73 Analyzing Students Academic Success (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4612693786621094
Document 9:
	Document Text: 


> In the proposed solution, each ant constructs routes for the given number of UAVs using pheromone and heuristic information. These routes are locally optimized using a modified 2 - opt technique. After each iteration, t he solution which covers more targets with less route distance is selected as the iteration - best solution and the pheromone values of the edges on that route are increased. According to the termination condition, the algorithm stops and outputs the best route found so far as the result. To evaluate the success of the proposed method, another approach, based on the Nearest Neigh

	Document Source: 15 UAV Route Planning For Maximum Target Coverage ABSTRACT (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4616098403930664
Document 10:
	Document Text: 


> Second, the considered chainedcourses are limited to three mathematical courses ( Calculus I, Calculus II, and [UNK] Equa - tions ). Third, among the several curriculum devel - opment methods, only two ( Spiral and Linear ) areselected for this research. Lastly, since there are [UNK] records stating which courses were designed according to which curriculum development meth - odology, the [UNK] of the chains are doneaccording to the [UNK] of the cirriculum devel - opment methods. Murat Karakaya et al. 368

	Document Source: 73 Analyzing Students Academic Success (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4657018184661865


## 8 Observe the ChromeDB saved to the provided path

List the folders and files in the chromaDB_path

In [35]:
!ls "{chromaDB_path}"

5c995b1b-29a5-4783-bab1-3c8955b9651f  chroma.sqlite3


## YES WE DID IT!

# LOAD A KNOWLEDGE BASE FROM A PERSISTENT CHROMADB



Let's kill the kernel so we ensure that nothing remains in the memory from all the above ChromaDB instance.

In [None]:
from google.colab import runtime
# Disconnect from the runtime
#!kill -9 -1

##1 Connect to source directory

First get connected to the ChromaDB directory

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [36]:
# change directory to chromaDB folder
chromaDB_path = '/content/drive/MyDrive/Colab Notebooks/ChromaDBData'
%cd {chromaDB_path}


/content/drive/MyDrive/Colab Notebooks/ChromaDBData


### Check that if chromadb_path exist or not and if exists does it contain chromadb files and folders

In [37]:
import os
if os.path.exists(chromaDB_path):
    print(f"The directory '{chromaDB_path}' exists.")

    # Check if the directory contains ChromaDB files and folders
    chromadb_files_and_folders = os.listdir(chromaDB_path)
    if any(file_or_folder.startswith('chroma') for file_or_folder in chromadb_files_and_folders):
        print("The directory contains ChromaDB files and folders.")
    else:
        print("The directory does not contain ChromaDB files and folders.")
else:
    print(f"The directory '{chromadb_path}' does not exist.")


The directory '/content/drive/MyDrive/Colab Notebooks/ChromaDBData' exists.
The directory contains ChromaDB files and folders.


##2 Install required libraries

Secondly install all the required libraries and helper functions

In [None]:
%pip install chromadb --quiet
%pip install sentence_transformers --quiet

In [None]:
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings
from chromadb import Client, PersistentClient
from chromadb.utils import embedding_functions


In [None]:
import textwrap
from IPython.display import display
from IPython.display import Markdown
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [None]:
def retrieveDocs(chroma_collection, query, n_results=5,
                 return_only_docs=False, filterType=None, filterValue=None):
    if filterType is not None and filterValue is not None:
        results = chroma_collection.query(
            query_texts=[query],
            include=["documents", "metadatas", "distances"],
            where={filterType: filterValue},
            n_results=n_results)

    else:
        results = chroma_collection.query(
            query_texts=[query],
            include= [ "documents","metadatas",'distances' ],
            n_results=n_results)

    if return_only_docs:
        return results['documents'][0]
    else:
        return results

In [None]:
def show_results(results, return_only_docs=False):

  if return_only_docs:
    retrieved_documents = results
    if len(retrieved_documents) == 0:
      print("No results found.")
      return
    for i, doc in enumerate(retrieved_documents):
      print(f"Document {i+1}:")
      print("\tDocument Text: ")
      display(to_markdown(doc));
  else:

      retrieved_documents = results['documents'][0]
      if len(retrieved_documents) == 0:
          print("No results found.")
          return
      retrieved_documents_metadata = results['metadatas'][0]
      retrieved_documents_distances = results['distances'][0]
      print("------- retreived documents -------\n")

      for i, doc in enumerate(retrieved_documents):
          print(f"Document {i+1}:")
          print("\tDocument Text: ")
          display(to_markdown(doc));
          print(f"\tDocument Source: {retrieved_documents_metadata[i]['document']}")
          print(f"\tDocument Source Type: {retrieved_documents_metadata[i]['category']}")
          print(f"\tDocument Distance: {retrieved_documents_distances[i]}")


##3 Initailizing

 Now, we can begin to upload the persistent ChromaDB from the location by initailizing
*  the chromaDB client
*  the chromaDB collections

In [38]:
chroma_client = PersistentClient(path=chromaDB_path,
                                     settings=Settings(),
                                     tenant=DEFAULT_TENANT,
                                     database=DEFAULT_DATABASE,)

In [39]:
chroma_client.list_collections()

[<chromadb.api.models.Collection.Collection at 0x7f4e345e5de0>]

In [None]:
collection_name = "Papers"
sentence_transformer_model="distiluse-base-multilingual-cased-v1"
embedding_function= embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=sentence_transformer_model)


In [40]:
chroma_collection = chroma_client.get_or_create_collection(
      collection_name,
      embedding_function=embedding_function)

In [41]:
# Verify collection properties
print(f"Collection name: {chroma_collection.name}")  # Access the name attribute directly
print(f"Number of documents in collection: {chroma_collection.count()}")

# List all collections in the client
print("All collections in ChromaDB client:")
for collection in chroma_client.list_collections():
    print(collection.name)

Collection name: Papers
Number of documents in collection: 238
All collections in ChromaDB client:
Papers


##4 Test

Test the loaded ChromeDB client and the collection

In [42]:
chroma_collection.get(['0'])

{'ids': ['0'],
 'embeddings': None,
 'metadatas': [{'category': 'Journal Paper',
   'document': '15 UAV Route Planning For Maximum Target Coverage ABSTRACT (2).pdf'}],
 'documents': ['UAV Route Planning For Maximum Target Coverage Murat Karakaya Department of Computer Engineering, Atilim University Incek / Ankara TURKEY kmkarakaya @ atilim. edu. tr Abstract The importance and the impact of using Unmanned Aerial Vehicles ( UAVs ) in military and civil operations are increasing. One of the challenges in effectively tasking these expensive vehicles is planning the flight routes to monitor the targets. This problem is related with the Multiple Travelling Salesman Problem ( mTSP ) and the Vehicle Routing Problem ( VRP ). In these well - defined problem'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

```python
chroma_collection.get(['0'])

{'ids': ['0'],
 'embeddings': None,
 'metadatas': [{'category': 'Journal Paper',
   'document': '15 UAV Route Planning For Maximum Target Coverage.pdf'}],
 'documents': ['Computer Science & Engineering : An International Journal ( CSEIJ ), Vol. 4, No. 1,
 February 2014 DOI : 10. 5121 / cseij. 2014. 410 3 27UAVROUTEPLANNING FORMAXIMUMTARGET COVERAGE MuratKarakaya1
 1Department of Computer Engineering, Atilim University, Ankara, Turkey ABSTRACT Utilization of Unmanned Aerial
 Vehicles ( UAVs ) in military and civil operations is getting popular. One of the challenges in effectively
  tasking these expensive vehicles is planning'],
 'uris': None,
 'data': None}
```

In [43]:
query = "What is Target Coverage?"

In [44]:
retrieved_documents=retrieveDocs(chroma_collection, query, 10)
show_results(retrieved_documents)

------- retreived documents -------

Document 1:
	Document Text: 


> s, it is mostly assumed that travelling salesm en or vehicles should visit all the targets and the target function is defined as to find a minimum - distan t route. Even, i n the constraint versions of the mTSP and VRP, some other restrictions ( visiting time windows, number of depots, etc. ) are included ; it is still assumed that there exists enough number of travelling salesm en or vehicle s to cover all the give n locations. However, in reality the number and flight range of UAVs might be insufficient to cover all the targets. As a result, t he maxim

	Document Source: 15 UAV Route Planning For Maximum Target Coverage ABSTRACT (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.2466243505477905
Document 2:
	Document Text: 


> impact of cirriculumdeveloment method. As seen above references, there have been a considerable interest on the pre - requisite coursesin the literature. However, the scope, goals, and results of the above - mentioned studies are not completely in line with those of the present work. One of the studies with similar scope and goals wasconducted by Anderson et al. [ 25 ]. They found thatsuccessfully completing calculus and economicscourses positively [UNK] the achieved success insubsequent economics courses. In another work, McMillan and Adeyemi focused on the success relationships between the

	Document Source: 73 Analyzing Students Academic Success (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3942387104034424
Document 3:
	Document Text: 


> ? A Mixed - Methods Approach to Identifying the Impact of a Prerequisite Course, CBE - Life Sciences Education, 16, 2017, p. ar16. 23. C. E. [UNK] and D. A. Gilman, Are Prerequisite Courses Necessary for Success in Advanced Courses?, ERIC, ERIC Number : ED475157, 2002. 24. J. M. Krieg and S. E. Henson, The Educational Impact ofOnline Learning : How Do University Students Perform in Subsequent Courses? Education Finance and Policy, 2016

	Document Source: 73 Analyzing Students Academic Success (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4003818035125732
Document 4:
	Document Text: 


> UAV Route Planning For Maximum Target Coverage Murat Karakaya Department of Computer Engineering, Atilim University Incek / Ankara TURKEY kmkarakaya @ atilim. edu. tr Abstract The importance and the impact of using Unmanned Aerial Vehicles ( UAVs ) in military and civil operations are increasing. One of the challenges in effectively tasking these expensive vehicles is planning the flight routes to monitor the targets. This problem is related with the Multiple Travelling Salesman Problem ( mTSP ) and the Vehicle Routing Problem ( VRP ). In these well - defined problem

	Document Source: 15 UAV Route Planning For Maximum Target Coverage ABSTRACT (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4190356731414795
Document 5:
	Document Text: 


> ACO M OBILE SINKSCHEDULING 75 The objective function fis based on the cost cas below : f = 1 c + 1 ( 6 ) Thus, we would like to maximize the objective function fby minimizing the costc. Using the cost of the generated tour, the amount of the additional pheromone is calculated and added to the edges which the ants have vis - ited during the tour. The amount of pheromone is set according to the belowformula : τ i, j, t←1 c + 1 + τi, j, t, ∀ (

	Document Source: 30 Time-Sensitive Ant Colony Optimization ORIGINAL (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.426795482635498
Document 6:
	Document Text: 


> durumlarının belirlenmesi amacıyla Derin Öğrenme ( Deep Learning ) yöntemi kullanılmıştır. Bilgisayarlar insanların gerçekleştiremediği karmaşık hesaplamaları yüksek hızda ve doğrulukta gerçekleştirebilmektedir. Buna karşın insanlar için çok kolay olan bazı işlemler ( işitilen sözler ile cümle oluşturup anlama, görülen b ir kişinin tanınması, bir fotoğraftaki cisimlerin belirlenmesi gibi ) bilgisayarlar için zorlayıcı olmaktadır. Derin Öğrenme

	Document Source: 77 Arac Park Yerlerinin Doluluk (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4507018327713013
Document 7:
	Document Text: 


> courses. Furthermore, if there are such successrelationships between pre - requisite and non - pre - requisite courses will be examined. For this purpose, some non - pre - requisite courses can be chosen as acontrol group and then, the academic performancesof the students enrolled in [UNK] combinations ofpre - requisite and non pre - requisite courses can be studied. References 1. J. Cobb, 10 Ways to Be a Better Learner, CreateSpace Independent Publishing Platform, 2012. 2. [ E. L. Thorndike, The Fundamentals of Learning,

	Document Source: 73 Analyzing Students Academic Success (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5071862936019897
Document 8:
	Document Text: 


> Second, the considered chainedcourses are limited to three mathematical courses ( Calculus I, Calculus II, and [UNK] Equa - tions ). Third, among the several curriculum devel - opment methods, only two ( Spiral and Linear ) areselected for this research. Lastly, since there are [UNK] records stating which courses were designed according to which curriculum development meth - odology, the [UNK] of the chains are doneaccording to the [UNK] of the cirriculum devel - opment methods. Murat Karakaya et al. 368

	Document Source: 73 Analyzing Students Academic Success (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5079680681228638
Document 9:
	Document Text: 


> requisite chains periodi - cally with empirical data so that pre - requisites and their follow - up courses should be well designed to support student success throughoutthe curriculum. 14. Conclusions The goal of this study is two - fold. The [UNK] goal is to examine the impact of the pre - requisite courses onthe success in a follow - up course for two mathe - matics course chains. The second one is to investi - gate the impact of the selected cirriculumdevelopment method on the success relationshipsbetween a pre - requisite course and follow

	Document Source: 73 Analyzing Students Academic Success (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5144762992858887
Document 10:
	Document Text: 


> lliklerini ezberleyeceği ( overfitting ) için başarı oranı gerçeği yansıtmayacak kadar yüksek çıkabilecektir. B irbirinden farklı görüntüler den oluşan ü ç veri setinin kullanımı ile D ESA [UNK] nın sınıflandırma başarı m ı hakkında daha gerçekçi sonuçlar elde edilmek i stenmiştir. Şekil 2 : Veri s etlerinde b ulunan p ark y eri g örüntülerinden ö rnekler B. Derin Evrişimsel Sinir Ağları Bu çalışma

	Document Source: 77 Arac Park Yerlerinin Doluluk (2).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.514948129310384


## YES! WE DID IT!

# CONNECT TO AN LLM: GOOGLE GEMINI

## 1 Install & Import Libraries

In [45]:
!pip install -q -U google-generativeai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.1/163.1 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m717.3/717.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [46]:
import os
import textwrap
import google.generativeai as genai
from IPython.display import display
from IPython.display import Markdown

## 2 Define Helper Functions

In [47]:
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [49]:
def build_chatBot(system_instruction):
  model = genai.GenerativeModel('gemini-1.5-flash-latest', system_instruction=system_instruction)
  chat = model.start_chat(history=[])
  return chat

In [50]:
def generate_LLM_answer(prompt, context, chat):
  response = chat.send_message( prompt + context)
  return response.text

## 3 Connect to the LLM via the Chat API

In [51]:
# Used to securely store your API key
from google.colab import userdata
# Or use `os.getenv('GOOGLE_API_KEY')` to fetch an environment variable.
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

In [52]:
system_prompt= """ You are an attentive and supportive academic assistant.
Your role is to provide assistance based solely on the provided context.

Here’s how we’ll proceed:
1. I will provide you with a question and related text excerpt.
2. Your task is to answer the question using only the provided partial texts.
3. If the answer isn’t explicitly found within the given context,
respond with 'I don't know'.
4. After each response, please provide a detailed explanation.
Break down your answer step by step and relate it directly to the provided context.
5. Sometimes, I will ask questions about the chat session, such as summarize
the chat or list the question etc. For this kind of questions do not try
to use the provided partial texts.
6. Generate the answer in the same language of the given question.

If you're ready, I'll provide you with the question and the context.
"""

In [53]:
RAG_LLM = build_chatBot(system_prompt)

## 4 Test the LLM connection

In [54]:
prompt="What is FC?"
context= """FC lets developers create a description
of a F in their code, then pass that description to a language
model in a request.

The response from the model includes the name of
a F that matches the description and the arguments to call it with.
FC lets you use F as tools in generative AI applications,
and you can define more than one F within a single request.
"""

In [55]:
response=generate_LLM_answer(prompt, context,RAG_LLM)
to_markdown(response)

> I don't know.
> 
> The provided text explains that FC is a system that allows developers to describe functions (Fs) in their code and then use a language model to find a matching function and its arguments. However, it does not explicitly define what FC stands for. 


In [56]:
RAG_LLM.history

[parts {
   text: "What is FC?FC lets developers create a description\nof a F in their code, then pass that description to a language\nmodel in a request.\n\nThe response from the model includes the name of\na F that matches the description and the arguments to call it with.\nFC lets you use F as tools in generative AI applications,\nand you can define more than one F within a single request.\n"
 }
 role: "user",
 parts {
   text: "I don\'t know.\n\nThe provided text explains that FC is a system that allows developers to describe functions (Fs) in their code and then use a language model to find a matching function and its arguments. However, it does not explicitly define what FC stands for. \n"
 }
 role: "model"]

In [57]:
RAG_LLM.history.clear()
RAG_LLM.history

[]

# CREATE THE RAG PIPELINE FOR THE EXISTING KNOWLEDGE BASE

## 1 A simple RAG Pipeline

* preparea summary for the Knowledge Base
* get the query from the user
* query the Knowledge Base
* get the related chunks from the Knowledge Base
* combine the query + context from Knowledge Base
* submit the prompt (query + context) to the LLM
* get the response from the LLM

In [58]:
# Verify collection properties
print(f"Collection name: {chroma_collection.name}")  # Access the name attribute directly
print(f"Number of documents in collection: {chroma_collection.count()}")

# List all collections in the client
print("All collections in ChromaDB client:")
for collection in chroma_client.list_collections():
    print(collection.name)

Collection name: Papers
Number of documents in collection: 238
All collections in ChromaDB client:
Papers


In [59]:
def summarize_collection(chroma_collection):
  summary = [] # Initialize summary as a list
  print("Summarizing the collection...")
  # Verify collection properties
  print(f"\t Collection name: {chroma_collection.name}")  # Access the name attribute directly
  print(f"\t Number of document chunks in collection: {chroma_collection.count()}")
  summary.append(f"Collection name: {chroma_collection.name}") # Append to the list
  summary.append(f"Number of document chunks in collection: {chroma_collection.count()}")
  # Print distinct metadata "document" for each chunk in the collection
  print("\t Distinct 'document' metadata in the collection:")
  distinct_documents = set()  # Use a set to store unique document names

  # Iterate over chunks in the collection
  for chunk_id in range(chroma_collection.count()):
      metadata = chroma_collection.get([str(chunk_id)])['metadatas'][0]  # Get metadata for the chunk
      document_name = metadata.get("document", "Unknown")  # Get document metadata; default to "Unknown" if not present
      distinct_documents.add(document_name)  # Add document name to set for uniqueness

  # Print all distinct document names
  summary.append("Documents:")
  for document_name in distinct_documents:
      print("\t ",document_name)
      summary.append(document_name) # Append to the list

  print("Collection summarization completed.")

  # Join the list elements into a single string
  summary_string = "\n ".join(summary)
  return summary_string

In [60]:
s=summarize_collection(chroma_collection)

Summarizing the collection...
	 Collection name: Papers
	 Number of document chunks in collection: 238
	 Distinct 'document' metadata in the collection:
	  15 UAV Route Planning For Maximum Target Coverage ABSTRACT (2).pdf
	  30 Time-Sensitive Ant Colony Optimization ORIGINAL (2).pdf
	  29 Mobile Sink Scheduling Method for Wireless Sensor Networks under Travel Time Uncertainty ABSTRACT (3).pdf
	  73 Analyzing Students Academic Success (2).pdf
	  77 Arac Park Yerlerinin Doluluk (2).pdf
Collection summarization completed.


In [61]:
print(s)

Collection name: Papers
 Number of document chunks in collection: 238
 Documents:
 15 UAV Route Planning For Maximum Target Coverage ABSTRACT (2).pdf
 30 Time-Sensitive Ant Colony Optimization ORIGINAL (2).pdf
 29 Mobile Sink Scheduling Method for Wireless Sensor Networks under Travel Time Uncertainty ABSTRACT (3).pdf
 73 Analyzing Students Academic Success (2).pdf
 77 Arac Park Yerlerinin Doluluk (2).pdf


In [63]:
def generateAnswer(RAG_LLM, chroma_collection,query,n_results=5, only_response=True):
    retrieved_documents= retrieveDocs(chroma_collection, query, 10, return_only_docs=True)
    prompt = "QUESTION: "+ query
    context = "\n EXCERPTS: "+ "\n".join(retrieved_documents)
    if not only_response:
      print("------- retreived documents -------\n")
      for i, doc in enumerate(retrieved_documents):
        print(f"Document {i+1}:")
        print(f"\tDocument Text: {doc}")
      print("------- RAG answer -------\n")
    output = generate_LLM_answer(prompt, context, RAG_LLM)

    display(to_markdown(output))
    print('\n')
    return output

## 2 Test the RAG pipeline

In [64]:
queries =["Who are the authors suggested a new attention mechanism?",
          "Who are the authors suggested a new controllable text generation mechanism?",
          "Who is Murat Karakaya?",
          "Why do we need to control how the text is produced? ",
          "How can we use the self attention mechanism to control the text generation?",
          "Summarize the paper named Controllable Text Generation",
          "How many blocks are suggested in the transformer?",
          "What about decoder?"
    ]

In [65]:
reply=generateAnswer(RAG_LLM, chroma_collection, queries[2],10, only_response=False)

------- retreived documents -------

Document 1:
	Document Text: 78 M URAT KARAKAYA TSACO MWSF N μσ μ σ Impro vement ON 900 93. 23 14. 64 109. 33 12. 4 15 % 625 32. 13 8. 50 47. 83 10. 41 33 % 400 6. 76 2. 48 11. 23 3. 86 40 % 225 2. 43 1. 25 3. 03 1. 69 20 % 100 0. 96 0. 88 1. 06 0. 98 9 % TABLE 3Average Number ( μ ) of [UNK] caused by TSACO and MWSF heuristics for different num - be
Document 2:
	Document Text: ediyor. İlgi alanları arasında yapay zeka ve derin öğrenme konuları bulunuyor. Yazılım mühendisi olarak özel bir şirkette çalışmaya devam etmektedir. B. Murat Karakaya KHO Elektrik - Elektronik Mühendisliği nden lisans, Bilkent Üniversitesinden Bilgisayar Mühendisliğinden Yüksek Lisans ve Doktora derecelerini sırasıyla 1991, 2000 ve 2008 yılında almıştır. Şu anda A tılım Üniversitesi Bilgisayar Mühendisliği bölümünde Doç. Dr
Document 3:
	Document Text: Second, the considered chainedcourses are limited to three mathematical courses ( Calculus I, Calculus II, and [UNK] Equa - tio

> Murat Karakaya is an Assistant Professor in the Computer Engineering department at Atilim University in Ankara, Turkey. 
> 
> Here's how we know:
> 
> 1. The text states "Murat Karakaya et al. 368". This tells us that he's involved in research and likely a professor.
> 2. Later, the text states "He joined the faculty of Atilim University in 2012 and is currently an Asst. Professor in the department of Computer Engineering, Ankara, Turkey." This confirms his position and location. 






## 3 A simple loop for the User Interaction

In [67]:
summarize_collection(chroma_collection)
RAG_LLM.history.clear()
while True:
  question = input("Please enter your question, or type 'bye' to exit: ")
  if question == "bye":
    print("Thank you for using the service. Goodbye!")
    break
  else:
    generateAnswer(RAG_LLM, chroma_collection, question)




Summarizing the collection...
	 Collection name: Papers
	 Number of document chunks in collection: 238
	 Distinct 'document' metadata in the collection:
	  15 UAV Route Planning For Maximum Target Coverage ABSTRACT (2).pdf
	  30 Time-Sensitive Ant Colony Optimization ORIGINAL (2).pdf
	  29 Mobile Sink Scheduling Method for Wireless Sensor Networks under Travel Time Uncertainty ABSTRACT (3).pdf
	  73 Analyzing Students Academic Success (2).pdf
	  77 Arac Park Yerlerinin Doluluk (2).pdf
Collection summarization completed.
Please enter your question, or type 'bye' to exit: bye
Thank you for using the service. Goodbye!


## 4 A Gradio Interface

In [68]:
%pip install gradio
import gradio as gr


Collecting gradio
  Downloading gradio-4.37.1-py3-none-any.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==1.0.2 (from gradio)
  Downloading gradio_client-1.0.2-py3-none-any.whl (318 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.2/318.2 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Collecting ruff>=0.2.2 (from gradio)
  Downloading ruff-0.4.10-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[?25hC

In [69]:
RAG_LLM.history.clear()

# Replace with your actual function (assuming it generates an answer)
def generateAnswerInterFace(question):
    return generateAnswer(RAG_LLM, chroma_collection, question)

# Function to generate the info text
def get_info_text():
    return "INFO: " + summarize_collection(chroma_collection)
    # Assuming summarize_collection returns a string

# Use gr.Blocks instead of gr.Interface
with gr.Blocks() as demo:
    # Define interface components
    query_txt = gr.Textbox(label="Enter your question here:", placeholder="Type your question")
    answer_txt = gr.Textbox(label="Answer:", placeholder="Answer will be displayed here")

    # Create a button to trigger the prediction
    btn = gr.Button("Generate Answer")

    # Define the prediction function (order changed for button placement)
    def predict(question):
        answer = generateAnswerInterFace(question)
        return answer

    info_txt = gr.Textbox(get_info_text(), label="Info")  # Add info textbox after button

    # Connect button click to prediction function
    btn.click(predict, inputs=query_txt, outputs=answer_txt)

# Launch the interface
demo.launch(debug=True)


Summarizing the collection...
	 Collection name: Papers
	 Number of document chunks in collection: 238
	 Distinct 'document' metadata in the collection:
	  15 UAV Route Planning For Maximum Target Coverage ABSTRACT (2).pdf
	  30 Time-Sensitive Ant Colony Optimization ORIGINAL (2).pdf
	  29 Mobile Sink Scheduling Method for Wireless Sensor Networks under Travel Time Uncertainty ABSTRACT (3).pdf
	  73 Analyzing Students Academic Success (2).pdf
	  77 Arac Park Yerlerinin Doluluk (2).pdf
Collection summarization completed.
Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://441f494ec374292a3b.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Termin

> Murat Karakaya is an Assistant Professor in the Department of Computer Engineering at Atilim University in Ankara, Turkey. He received his Bachelor's degree in Electrical and Electronics Engineering from KHO, his Master's degree in Computer Engineering from Bilkent University, and his PhD in Computer Engineering from Bilkent University in 1991, 2000, and 2008, respectively. His research interests include natural computing, sensor networks, peer-to-peer networks, natural computing, optimization, and communication protocol design. He is a member of the IACSIT.
> 
> **Explanation:**
> 
> * The text explicitly states that Murat Karakaya is an Assistant Professor in the Department of Computer Engineering at Atilim University in Ankara, Turkey.
> * The text also mentions his academic degrees: Bachelor's degree in Electrical and Electronics Engineering from KHO, Master's degree in Computer Engineering from Bilkent University, and PhD in Computer Engineering from Bilkent University.
> * The text lists the years he received his degrees: 1991, 2000, and 2008.
> * The text states his research interests: natural computing, sensor networks, peer-to-peer networks, natural computing, optimization, and communication protocol design.
> * The text mentions his membership in the IACSIT.
> 
> Therefore, based on the provided text, we know Murat Karakaya is a computer engineering professor at Atilim University with a strong background in the field.






> Araç park yerlerinin boş olup olmadığını derin öğrenme yöntemleri kullanarak tespit edebiliriz. Bu yöntemde, park yerlerini gösteren görüntüler bir algoritma tarafından analiz edilir ve park yeri boş mu yoksa dolu mu olduğuna karar verilir. 
> 
> **Açıklama:**
> 
> Metinde, park yerlerinin boş veya dolu olup olmadığını belirlemek için "derin öğrenme yöntemlerinin" kullanılabileceği belirtiliyor. Ayrıca, park yerleri görüntülerinin bir algoritma tarafından analiz edilerek boş veya dolu olduğuna karar verilebileceği belirtiliyor. Bu nedenle, derin öğrenme yöntemlerinin bu görevi yerine getirmek için kullanılabileceği sonucuna varabiliyoruz. 






> Konuşmamızda Murat Karakaya'nın kim olduğu, kiminle çalıştığı ve ne yaptığını öğrendik. Ayrıca derin öğrenmenin park yerlerinin doluluk durumunu tespit etmek için nasıl kullanılabileceğini de ele aldık. 






> Sorular şunlardı:
> 
> * Bir dersin başarı oranını etkileyen faktörler nelerdir?
> * Öğrencilerin bir dersten önceki derslerde başarılı olup olmaması, sonraki derslerde başarılı olma olasılıklarını nasıl etkiler? 
> 
> Bu soruları cevaplamak için 441 öğrenci üzerinde araştırma yapıldı.






> Murat Karakaya, şu anda Atılım Üniversitesi Bilgisayar Mühendisliği bölümünde Doçent Doktor olarak görev yapmaktadır. Yapay zeka ve derin öğrenme alanlarında uzmanlaşmıştır.  KHO Elektrik-Elektronik Mühendisliği bölümünden 1991 yılında lisans, Bilkent Üniversitesi Bilgisayar Mühendisliği bölümünden ise 2000 yılında Yüksek Lisans ve 2008 yılında Doktora derecelerini almıştır. Ayrıca özel bir şirkette yazılım mühendisi olarak da çalışmaktadır. 
> 
> **Açıklama:**
> 
> Metinde Murat Karakaya'nın şu anda Atılım Üniversitesi Bilgisayar Mühendisliği bölümünde Doçent Doktor olarak görev yaptığı, yapay zeka ve derin öğrenme alanlarında uzmanlaştığı, KHO Elektrik-Elektronik Mühendisliği bölümünden 1991 yılında lisans, Bilkent Üniversitesi Bilgisayar Mühendisliği bölümünden ise 2000 yılında Yüksek Lisans ve 2008 yılında Doktora derecelerini aldığı ve özel bir şirkette yazılım mühendisi olarak çalıştığı belirtilmektedir. 






> Sohbetin başından beri şu sorular soruldu:
> 
> * **Murat Karakaya kimdir?** 
> * **Araç park yerlerinin boş olup olmadığını nasıl tespit edebiliriz?**
> * **Bir dersin başarı oranını etkileyen faktörler nelerdir?**
> * **Öğrencilerin bir dersten önceki derslerde başarılı olup olmaması, sonraki derslerde başarılı olma olasılıklarını nasıl etkiler?**
> 
> Bu sorular, Murat Karakaya'nın kimliği ve derin öğrenme kullanarak park yerlerinin doluluk durumunu nasıl tespit edebileceğimiz hakkında bilgi edinmek için soruldu. Ayrıca,  bir dersin başarı oranını etkileyen faktörleri ve bir dersten önceki derslerde başarılı olmanın sonraki derslerde başarılı olmayı nasıl etkilediğini anlamak için soruldu. 






> Eğitim sistemlerinde, öğrencilerin bir sonraki dersi başarıyla tamamlamaları için ön koşul dersleri tamamlamaları gerekir. Ancak, ön koşul dersleri ile takip eden dersler arasındaki ilişkinin öğrencilerin başarı oranına etkisini araştıran çalışmalar sınırlıdır. Bu makale, iki farklı müfredat geliştirme yaklaşımıyla oluşturulan iki ön koşul zinciri üzerinden, bir sonraki dersi başarıyla tamamlamanın ön koşul dersi başarısı ile ilişkisini ve müfredat geliştirme yaklaşımının bu ilişkiye etkisini incelemektedir. 






> Verilen metin parçalarında,  iki temel proje hakkında bilgi bulunmaktadır:
> 
> * **Ön Koşul Dersleri ve Akademik Başarı İlişkisi:** Bu proje, ön koşul dersleri ile takip eden dersler arasındaki ilişkinin öğrencilerin başarı oranına etkisini incelemektedir. İki farklı müfredat geliştirme yaklaşımıyla oluşturulan ön koşul zincirleri analiz edilerek, bir sonraki dersi başarıyla tamamlamanın ön koşul dersi başarısı ile ilişkisine ve müfredat geliştirme yaklaşımının bu ilişkiye etkisine bakılmaktadır. 
> * **Derin Öğrenme ile Park Yeri Doluluk Durumunun Tespit Edilmesi:** Bu proje,  derin öğrenme yöntemlerini kullanarak park yerlerinin doluluk durumunu tespit etmeyi amaçlamaktadır.  Transfer öğrenmesi yöntemiyle son katmanın eğitilmesi ve önceden eğitilmiş diğer katmanların bilgilerinin kullanılması amaçlanmaktadır.  
> 
> Bu parçalar, ilgili projelerin temel amaçlarını ve uygulanan yöntemleri kısaca özetlemektedir. Ancak, projelere dair daha detaylı bilgi ve sonuçlar, metin parçalarında yer almamaktadır. 






> "Analyzing Students Academic Success" makalesine göre öğrenci başarısını etkileyen temel unsurlar şunlardır:
> 
> * **Ön koşul derslerindeki başarı:** Makale, bir öğrencinin bir ön koşul dersinde başarısının, takip eden dersindeki başarısını olumlu yönde etkilediğini bulmuştur. Başka bir deyişle, ön koşul dersinde başarılı olan öğrenciler, takip eden derslerde de daha başarılı olma eğilimindedir.
> * **Müfredat geliştirme yaklaşımı:** Makale, ön koşul zincirlerinin tasarımında kullanılan müfredat geliştirme yaklaşımının da öğrenci başarısını etkilediğini ortaya koymuştur. Farklı müfredat yaklaşımları, ön koşul dersleri ile takip eden dersler arasındaki ilişkiye farklı şekillerde etki edebilir. 
> * **Ön koşul dersleri arasındaki ilişki:**  Makalede vurgulanan bir diğer önemli nokta ise, ön koşul derslerinin kendi aralarındaki ilişkinin de öğrenci başarısını etkilediği gerçeğidir.  Bazı ön koşul dersleri, takip eden derslerde başarılı olmada diğerlerinden daha önemli olabilir.
> 
> Makale, bu unsurların öğrenci başarısına etkisini daha detaylı incelemek için iki farklı ön koşul zinciri üzerinde bir vaka çalışması gerçekleştirmiştir. Çalışmanın sonuçları, yukarıda bahsedilen unsurların öğrenci başarısını etkilediğini göstermektedir.






> Metin parçalarında, çeşitli araştırmacıların ön koşul dersleri ve öğrenci başarısı arasındaki ilişkiye odaklanan çalışmalarına atıflar bulunmaktadır. Bu atıflardan bazıları şunlardır:
> 
> * **Anderson ve arkadaşları (2017):** Kalkülüs ve ekonomi derslerinin tamamlanmasının, daha sonraki ekonomi derslerindeki başarıyı olumlu yönde etkilediğini bulmuşlardır.
> * **McMillan ve Adeyemi (2017):**  Ön koşul dersleriyle takip eden dersler arasındaki başarı ilişkilerini incelemişlerdir.
> * **Wright ve arkadaşları (2017):** Organik kimya dersinin tamamlanmasının, takip eden biyokimya dersinde başarıya etkisi olup olmadığını araştırmışlardır. 
> * **Somasundara ve arkadaşları (2017):** Mobile Element Scheduling (MES) problemini ele almış,  NP-complete olduğunu kanıtlamış ve EDF ve MWSF gibi iki algoritma önermişlerdir. 
> 
> Bu çalışmalar, ön koşul derslerinin öğrenci başarısına etkisini araştırmakta ve bu alanda devam eden bir ilgi olduğunu göstermektedir. 
> 
> Bununla birlikte, metin parçalarında bahsedilen makalelerin isimleri, yayımlandıkları dergiler veya yayın tarihleri gibi ek bilgiler yer almamaktadır. Bu bilgiler, makaleleri tam olarak belirlemek için gereklidir. 






> "Analyzing Students Academic Success" makalesine göre, öğrenci başarısını etkileyen temel unsurlar şunlardır:
> 
> 1. **Ön koşul derslerdeki başarı:** Makale, bir öğrencinin ön koşul dersinde elde ettiği başarının, takip eden dersindeki başarısını olumlu yönde etkilediğini bulmuştur. Yani, ön koşul dersinde başarılı olan öğrenciler, takip eden derslerde de daha başarılı olma eğilimindedir. 
> 2. **Müfredat geliştirme yaklaşımı:** Makale, ön koşul zincirlerinin tasarımında kullanılan müfredat geliştirme yaklaşımının da öğrenci başarısını etkilediğini göstermiştir. Farklı müfredat yaklaşımları, ön koşul dersleri ile takip eden dersler arasındaki ilişkiye farklı şekillerde etki edebilir. 
> 3. **Ön koşul derslerinin içerik ve düzenlemesi:**  Makale ayrıca, ön koşul derslerinin içerik ve düzenlemesinin de öğrenci başarısına etki ettiğini vurgulamaktadır. İyi organize edilmiş ve öğrencilerin ön bilgi seviyelerine uygun olarak tasarlanmış ön koşul dersleri, takip eden derslerde daha yüksek başarıya yol açabilir.
> 
> Makale, bu unsurların öğrenci başarısına etkisini daha detaylı incelemek için iki farklı ön koşul zinciri üzerinde bir vaka çalışması gerçekleştirmiştir. Çalışmanın sonuçları, yukarıda bahsedilen unsurların öğrenci başarısını etkilediğini göstermektedir. 






> Makalede, öğrenci başarımını etkileyen en önemli unsur, **ön koşul derslerde elde edilen başarı** olarak belirtiliyor. Makale, ön koşul dersinde başarının, takip eden derslerde de başarıyı olumlu yönde etkilediğini ve öğrencilerin ön koşul derslerini başarıyla tamamlamalarının, takip eden derslerde de daha başarılı olmalarına yol açtığını vurguluyor. 
> 
> Dolayısıyla, makaleye göre, öğrencilerin ön koşul derslerindeki başarısı, sonraki derslerdeki başarıları için en önemli belirleyici faktördür.




Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://441f494ec374292a3b.gradio.live




In [None]:
RAG_LLM.history

# SUMMARY

WHY WE NEED A PERSISTENT CHROMADB?
In the context of a Retrieval-Augmented Generation (RAG) approach, saving and loading a persistent ChromaDB is particularly important for several reasons:

Enhanced Data Durability:

Importance: Ensures the retrieval database used for augmenting generative models is not lost between sessions or system restarts.
RAG Relevance: Maintains a consistent and reliable knowledge base that the generative model can reference, leading to more accurate and relevant responses.
Operational Continuity:

Importance: Allows seamless continuation of operations without needing to re-index or re-import data, saving time and computational resources.
RAG Relevance: Ensures that the generative model has continuous access to the same set of documents, which is essential for generating consistent and coherent responses over time.
Facilitating Collaboration:

Importance: Enables multiple users or systems to share and access the same dataset.
RAG Relevance: Supports collaborative development and usage of the RAG system, allowing different teams to work on improving the retrieval and generation processes simultaneously.
Scalability:

Importance: Provides a stable and persistent backend, enabling efficient handling of large datasets.
RAG Relevance: Essential for scaling the RAG system to handle more extensive and diverse knowledge bases, ensuring that the system can manage increased loads and deliver prompt, relevant information.
In a RAG system, the retriever (like ChromaDB) provides the generative model with relevant context from a knowledge base to generate informed and accurate responses. Persistent storage ensures that this knowledge base is durable, continuously available, and scalable, which is critical for the reliability, consistency, and performance of the RAG system.

.

.

.

In [None]:
def generateAnswer(RAG_LLM, chroma_collection,query,n_results=5):
    retrieved_documents= retrieveDocs(chroma_collection, query, 10, return_only_docs=True)

    print("------- retreived documents -------\n")
    for i, doc in enumerate(retrieved_documents):
      print(f"Document {i+1}:")
      print(f"\tDocument Text: {doc}")
    prompt = "QUESTION: "+ query
    context = "\n EXCERPTS: "+ "\n".join(retrieved_documents)
    print("------- RAG answer -------\n")
    output = generate_LLM_answer(prompt, context, RAG_LLM)

    display(to_markdown(output))
    print('\n')
    return output

In [None]:
to_markdown(reply.text)

In [None]:
for message in chat.history:
  display(to_markdown(f'**{message.role}**: {message.parts[0].text}'))


In [None]:
model.count_tokens(chat.history)

In [None]:
response = chat.send_message(prompt)
to_markdown(response.text)


In [None]:
import os
import openai
from openai import OpenAI

'''
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']
openai_client = OpenAI()
'''
openai_client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

In [None]:
def rag(query, retrieved_documents, model="gpt-3.5-turbo-1106"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "As an attentive and supportive academic assistant, "
            "your task is to provide assistance based solely on the provided"
            " excerpts. Answer the following questions, ensuring your responses"
            " are derived exclusively from the provided partial texts. "
            "If the answer cannot be found within the provided excerpts, "
            "kindly respond with 'I don't know'."
            "After answering each question, please provide a detailed "
            "explanation, breaking down the answer step by step and relating "
            "it to the provided excerpts."
            "Return your response as a Json object with two key fields: "
            " 'Answer', which should contain the value of the answer, and "
            " 'Reason', which should provide an explanation of why this answer "
            "was generated."

        },
        {"role": "user", "content": f"Question: {query}. \n Excerpts: {information}"}
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [None]:
def generateAnswer(query,n_results=5):
    retrieved_documents=retrieveDocs(query,n_results)

    print("------- retreived documents -------\n")
    for i, doc in enumerate(retrieved_documents):
      print(f"Document {i+1}:")
      print(f"\tDocument Text: {doc}")
    print("------- RAG answer -------\n")
    output = rag(query=query, retrieved_documents=retrieved_documents)
    print(output)
    print('\n')
    return output

In [None]:
reply=generateAnswer(queries[5],10)


In [None]:
# prompt: convert the 'reply' to a dict

import ast
reply_dict = ast.literal_eval(reply)
print(f"Answer: {reply_dict['Answer']}")
print(f"Because; {reply_dict['Reason']}")

In [None]:
for query in queries:
  generateAnswer(query)

In [None]:
%pip install umap-learn

In [None]:
def project_embeddings(embeddings, umap_transform):
    umap_embeddings = np.empty((len(embeddings),2))
    for i, embedding in enumerate(tqdm(embeddings)):
        umap_embeddings[i] = umap_transform.transform([embedding])
    return umap_embeddings

In [None]:
import umap.umap_ as umap

embeddings = chroma_collection.get(include=['embeddings'])['embeddings']
umap_transform = umap.UMAP(random_state=0, transform_seed=0).fit(embeddings)
projected_dataset_embeddings = project_embeddings(embeddings, umap_transform)

In [None]:
import matplotlib.pyplot as plt

plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10)
plt.gca().set_aspect('equal', 'datalim')
plt.title('Projected Embeddings')
plt.axis('off')

In [None]:
query = queries[3]

results = chroma_collection.query(query_texts=query, n_results=10, include=['documents', 'embeddings'])

retrieved_documents = results['documents'][0]

for document in results['documents'][0]:
    print(document)
    print('')


In [None]:
query_embedding = embedding_function([query])[0]
retrieved_embeddings = results['embeddings'][0]

projected_query_embedding = project_embeddings([query_embedding], umap_transform)
projected_retrieved_embeddings = project_embeddings(retrieved_embeddings, umap_transform)


In [None]:
# Plot the projected query and retrieved documents in the embedding space
plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(projected_query_embedding[:, 0], projected_query_embedding[:, 1], s=150, marker='X', color='r')
plt.scatter(projected_retrieved_embeddings[:, 0], projected_retrieved_embeddings[:, 1], s=100, facecolors='none', edgecolors='g')

plt.gca().set_aspect('equal', 'datalim')
plt.title(f'{query}')
plt.axis('off')

In [None]:
def augment_query_generated(query, model="gpt-3.5-turbo"):
    messages = [
        {
            "role": "system",
            "content": "Sen TÜBİTAK proje başvurularını inceleyen yapay zeka konusunda uzman bir akasemisyensin."
            "Aşağıda verilen soruya, aşağıdaki proje tanımına uygun olabilecek bir cevap üret: \n"
            "Projenin genel amacı, bankacılık sektöründeki risk yönetimi operasyonlarını geliştirmek ve finansal kurumların karşılaştığı zorlukları ele almak "
            "için yapay zeka (AI) tabanlı bir platform geliştirmektir. Proje, bankalara vadeli mevduatın erken bozulması, kredilerin erken ödenmesi ve çeşitli "
            "mevduat türlerinin belirlenmesi gibi davranışsal riskleri daha etkili bir şekilde yönetme kapasitesi sunmayı hedeflemektedir. Bu riskler, finansal "
            "kurumların bilanço dengesini etkileyebilir ve operasyonel verimliliği azaltabilir. "
            "Projenin çözmeyi amaçladığı temel problem, bankaların karlılık ve risk analizlerini gerçekleştirirken karşılaştığı karmaşık durumları doğru ve "
            "etkili bir şekilde yönetme ihtiyacıdır. Özellikle vadeli mevduatların erken kapanması ve kredilerin erken ödenmesi gibi durumlar, bankaların "
            "gelecekteki nakit akışlarını ve risk profillerini belirleme sürecini karmaşıklaştırabilir"

        },
        {"role": "user", "content": query}
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [None]:
original_query = queries[0]
hypothetical_answer = augment_query_generated(original_query)

joint_query = f"{original_query} {hypothetical_answer}"
print(joint_query)

In [None]:
def extend_retrieved_documents(results, extension=4):
  original_ids= results['ids'][0]
  print("original_ids: ",original_ids)

  extended_ids = set()


  for id in original_ids:
    extended_ids.add(int(id))
    for i in range(1, extension):
      extended_ids.add(int(id)+i)


  extended_ids = sorted([int(x) for x in extended_ids])
  extended_ids = [str(x) for x in extended_ids if int(x) < chroma_collection.count()]
  print("extended_ids: ",extended_ids)
  return chroma_collection.get(extended_ids)['documents']

In [None]:
def retrieveDocs_augmented_query(query, n_results=5, extension=4):
    hypothetical_answer = augment_query_generated(query)
    print("------ hypothetical_answer ---------\n")
    print(hypothetical_answer,"\n")
    print("------------------------------------\n")
    joint_query = f"{query} {hypothetical_answer}"
    results = chroma_collection.query(query_texts=joint_query, n_results=n_results, include=['documents', 'embeddings'])
    retrieved_documents = extend_retrieved_documents(results, extension)
    #retrieved_documents = results['documents'][0]

    return retrieved_documents



In [None]:
retrieved_documents=retrieveDocs_augmented_query(query, 5)

for doc in retrieved_documents:
    print(doc)
    print('')

In [None]:
results = chroma_collection.query(query_texts=joint_query, n_results=10, include=['documents', 'embeddings'])
retrieved_documents = results['documents'][0]

for doc in retrieved_documents:
    print(doc)
    print('')

In [None]:
retrieved_embeddings = results['embeddings'][0]
original_query_embedding = embedding_function([original_query])
augmented_query_embedding = embedding_function([joint_query])

projected_original_query_embedding = project_embeddings(original_query_embedding, umap_transform)
projected_augmented_query_embedding = project_embeddings(augmented_query_embedding, umap_transform)
projected_retrieved_embeddings = project_embeddings(retrieved_embeddings, umap_transform)

In [None]:
import matplotlib.pyplot as plt

# Plot the projected query and retrieved documents in the embedding space
plt.figure()
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1], s=10, color='gray')
plt.scatter(projected_retrieved_embeddings[:, 0], projected_retrieved_embeddings[:, 1], s=100, facecolors='none', edgecolors='g')
plt.scatter(projected_original_query_embedding[:, 0], projected_original_query_embedding[:, 1], s=150, marker='X', color='r')
plt.scatter(projected_augmented_query_embedding[:, 0], projected_augmented_query_embedding[:, 1], s=150, marker='X', color='orange')

plt.gca().set_aspect('equal', 'datalim')
plt.title(f'{original_query}')
plt.axis('off')

In [None]:
def generateAnswer_augmented_query(query,n_results=5, extention=4):
    print("------- query -------\n")
    print(query,"\n")
    retrieved_documents=retrieveDocs_augmented_query(query,n_results,extention)
    print("------- retreived documents -------\n")
    for document in retrieved_documents:
        print(document)
        print('\n')

    print("------- RAG answer -------\n")
    output = rag(query=query, retrieved_documents=retrieved_documents)
    print(output)
    print('\n')

In [None]:
queries

In [None]:
generateAnswer_augmented_query(queries[0],10,5)

In [None]:
title= """Ar-Ge Sürecinde Kullanılacak Yöntemler Tanımlanan proje hedeflerine ulaşmak için uygulanacak analitik
        deneysel çözüm yöntemlerini belirtiniz. (NOT: Bu bölümde sunulan proje özelinde
        hangi teknik / bilimsel yaklaşımların ve bunlara ait aşamaların takip edileceği açıklanmalı, iş paketleri isimleri ya da her projede olabilecek standart
        rutin çalışma yöntemleri tekrarlanmamalıdır."""
results = chroma_collection.query(query_texts=title, n_results=5, include=['documents', 'embeddings'])
retrieved_documents = results['documents'][0]
print(retrieved_documents)

In [None]:
title= """Ar-Ge Sürecinde Kullanılacak Yöntemler Tanımlanan proje hedeflerine ulaşmak için uygulanacak analitik
        deneysel çözüm yöntemlerini belirtiniz. (NOT: Bu bölümde sunulan proje özelinde
        hangi teknik / bilimsel yaklaşımların ve bunlara ait aşamaların takip edileceği açıklanmalı, iş paketleri isimleri ya da her projede olabilecek standart
        rutin çalışma yöntemleri tekrarlanmamalıdır."""
results = chroma_collection.query(query_texts=title, n_results=5, include=['documents', 'embeddings'])

retrieved_documents = extend_retrieved_documents(results)
print(retrieved_documents)


In [None]:
chroma_collection.get(results['ids'][0])