####  Google Gemini API Key Definition Along with Paths

Key was defined for Google Gemini and saved to the environment.

In [None]:
import os
os.environ["GEMINI_API_KEY"]="AIzaSyBBF0rGyp8y0pJdfOnykCElApzNWJGuB7k"
chromadb_path = "C:\\Users\\ASUS\\Desktop\\Python Deep Learning\\chroma_db\\"
chroma_collection_name ="llm_rag_fellow"
pdfs_path = "C:\\Users\\ASUS\\Desktop\\Python Deep Learning\\data\\articles\\"

# Loading and Splitting The Data

The data was collected from 55 PDFs one by one and saved as text variable. Then, it was split into chunks of 1000 characters each and prepared to create a local database with ChromaDB.

In [None]:
from PyPDF2 import PdfReader 

In [None]:
def get_pdf_text(pdf_folder_path):
    text = ""
    for file_name in os.listdir(pdf_folder_path):
        if file_name.endswith('.pdf'):
            pdf_path = os.path.join(pdf_folder_path, file_name)
            pdf_reader = PdfReader(pdf_path)
            for page in pdf_reader.pages:
                text += page.extract_text()
    return text

In [None]:
pdf_folder_path = pdfs_path
pdf_text = get_pdf_text(pdf_folder_path)

In [None]:
#Lenght of the chunked text
print(len(pdf_text))

In [None]:
#splitting the text into chunk
import re
def split_text(text: str):
    split_text = re.split('\n \n',text)
    return [i for i in split_text if i!=""]

chunked_text = split_text(text=pdf_text)

In [None]:
#Length of chunked_text 
print(len(chunked_text))

# Embedding Data [Vectorising]

After splitting the data into chunks, we need to prepare it for storage in ChromaDB so that it can be easily retrieved later. To do this, we need to vectorize the data. Vectorization involves converting the textual data into numerical vectors, which can then be stored and processed efficiently. This process typically involves using techniques such as word embeddings or document embeddings to represent the text data in a high-dimensional space. Once the data is vectorized, it can be stored in ChromaDB, allowing for fast and efficient retrieval when needed. This vectorization step is crucial for ensuring that the data is stored in a format that is suitable for use with ChromaDB and other similar databases.

In [None]:
import google.generativeai as genai
from chromadb import Documents, EmbeddingFunction, Embeddings
import os


The Gemini embedding class has been imported since Gemini will be utilized as the LLM. This allows for seamless integration of Gemini's embedding functionality into the LLM framework, enabling efficient processing and analysis of text data. By incorporating Gemini embeddings, the LLM can leverage advanced semantic representations to enhance its understanding of textual information, leading to more accurate and insightful results in various natural language processing tasks.

In [None]:
class GeminiEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input:Documents)-> Embeddings:
        gemini_api_key=os.getenv("GEMINI_API_KEY")
        if not gemini_api_key:
            raise ValueError("API KEY NOT PROVIDED")
        genai.configure(api_key=gemini_api_key)
        model = "models/embedding-001"
        title = "Custom query"
        return genai.embed_content(model = model,
                                  content = input,
                                  task_type="retrieval_document",
                                  title=title)["embedding"]
    
    

# Storing Data

The data has been saved within the specified path and with the defined name for later use with the LLM. In this way, the vectorized data, comprising the entire content of 55 articles, has been stored in the format required by ChromaDB.

In [None]:
import chromadb
from typing import List
def create_chroma_db(documents:List, path:str, name:str):

    chroma_client = chromadb.PersistentClient(path=path)
    db = chroma_client.create_collection(name=name,embedding_function=GeminiEmbeddingFunction())
    for i, d in enumerate(documents):
        db.add(documents=d, ids=str(i))
    return db,name

db, name=create_chroma_db(documents=chunked_text,
                          path =chromadb_path,
                          name=chroma_collection_name)

In [None]:
def load_chroma_collection(path, name):

    chroma_client = chromadb.PersistentClient(path=path)
    db = chroma_client.get_collection(name=name, embedding_function=GeminiEmbeddingFunction())

    return db

db=load_chroma_collection(path=chromadb_path,
                          name=chroma_collection_name)

# Retrieval

With this function, we were able to perform semantic search and obtain similar chunks within the text.

In [None]:
def get_relevant_passage(query, db, n_results):
  passage = db.query(query_texts=[query], n_results=n_results)['documents'][0]
  return passage

#Example prompt
relevant_text = get_relevant_passage(query="Can you analyze/enquire the pdf corpus to synthesize evidence on interlinkages between SDG 16 and the other two goals, SDG 1 + SDG 10?",db=db,n_results=3)

In [None]:
print(relevant_text)

We remind the LLM how it should return responses to us as prompts, ensuring that the answers will be in the desired format.

In [None]:
def make_rag_prompt(query, relevant_passage):
  escaped = relevant_passage.replace("'", "").replace('"', "").replace("\n", " ")
  prompt = ("""You are a helpful and informative bot that answers questions using text from the reference passage included below. \
  Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. \
  You are talking to a technical audience make sure you give relevant text and passages\
  strike a formal and converstional tone. \
  If the passage is irrelevant to the answer, you may ignore it.
  QUESTION: '{query}'
  PASSAGE: '{relevant_passage}'

  ANSWER:
  """).format(query=query, relevant_passage=escaped)

  return prompt


The provided function below is utilized to produce a response for a specified prompt using the Gemini Pro API.

In [None]:
import google.generativeai as genai

def generate_answer(prompt):
    gemini_api_key = os.getenv("GEMINI_API_KEY")
    if not gemini_api_key:
        raise ValueError("Gemini API Key not provided. Please provide GEMINI_API_KEY as an environment variable")
    genai.configure(api_key=gemini_api_key)
    model = genai.GenerativeModel('gemini-pro')
    answer = model.generate_content(prompt)
    return answer.text

Testing the answers

In [None]:
generate_answer("How does increased accountability and increased transparency affect reducing poverty including relative and extreme poverty.")

Retrieving the relevent 3 text chunks answers

In [None]:
#retrieve top 3 relevant text chunks and joining the relevant chunks to create a single passage
def generate_answer_rel(db,query):
    relevant_text = get_relevant_passage(query,db,n_results=3)
    prompt = make_rag_prompt(query, 
                             relevant_passage="".join(relevant_text))
    answer = generate_answer(prompt)

    return answer

In [None]:
db=load_chroma_collection(path=chromadb_path,
                          name=chroma_collection_name)

query="How does increased accountability and increased transparency affect reducing poverty including relative and extreme poverty. Hint: For accountability and increased transparency check SDG 16.6 and 16.7.Hint: For reducing poverty including relative and extreme poverty check SDG 1.1 and 1.2."

answer = generate_answer_rel(db, query)
print(answer)

Second question query 

In [None]:
query="Can you analyze/enquire the pdf corpus to synthesize evidence on interlinkages between SDG 16 and the other two goals, SDG 1 + SDG 10? Hint: You may use full text of the relevant SDG i.e., goal+targets+indicators for better matches."

answer = generate_answer_rel(db, query)
print(answer)

In [None]:
import session_info
session_info.show()


In [None]:
import session_info

# session_info modülünden kullanılan kütüphaneleri ve sürüm numaralarını alın
requirements2 = session_info.show()

# requirements.txt dosyasını oluşturun ve gereksinimleri yazın
with open("requirements2.txt", "w") as f:
    for package, version in requirements2.items():
        f.write(f"{package}=={version}\n")