####  Google Gemini API Key Definition Along with Paths

Key was defined for Google Gemini and saved to the environment.

In [1]:
import os
os.environ["GEMINI_API_KEY"]="AIzaSyBBF0rGyp8y0pJdfOnykCElApzNWJGuB7k"
chromadb_path = "C:\\Users\\ASUS\\Desktop\\Python Deep Learning\\chroma_db\\"
chroma_collection_name ="llm_rag_fellow"
pdfs_path = "C:\\Users\\ASUS\\Desktop\\Python Deep Learning\\data\\articles\\"

# Loading and Splitting The Data

The data was collected from 55 PDFs one by one and saved as text variable. Then, it was split into chunks of 1000 characters each and prepared to create a local database with ChromaDB.

In [2]:
from PyPDF2 import PdfReader 

In [3]:
def get_pdf_text(pdf_folder_path):
    text = ""
    for file_name in os.listdir(pdf_folder_path):
        if file_name.endswith('.pdf'):
            pdf_path = os.path.join(pdf_folder_path, file_name)
            pdf_reader = PdfReader(pdf_path)
            for page in pdf_reader.pages:
                text += page.extract_text()
    return text

In [4]:
pdf_folder_path = pdfs_path
pdf_text = get_pdf_text(pdf_folder_path)

In [5]:
#Lenght of the chunked text
print(len(pdf_text))

4624309


In [6]:
#splitting the text into chunk
import re
def split_text(text: str):
    split_text = re.split('\n \n',text)
    return [i for i in split_text if i!=""]

chunked_text = split_text(text=pdf_text)

In [7]:
#Length of chunked_text 
print(len(chunked_text))

471


# Embedding Data [Vectorising]

After splitting the data into chunks, we need to prepare it for storage in ChromaDB so that it can be easily retrieved later. To do this, we need to vectorize the data. Vectorization involves converting the textual data into numerical vectors, which can then be stored and processed efficiently. This process typically involves using techniques such as word embeddings or document embeddings to represent the text data in a high-dimensional space. Once the data is vectorized, it can be stored in ChromaDB, allowing for fast and efficient retrieval when needed. This vectorization step is crucial for ensuring that the data is stored in a format that is suitable for use with ChromaDB and other similar databases.

In [8]:
import google.generativeai as genai
from chromadb import Documents, EmbeddingFunction, Embeddings
import os


The Gemini embedding class has been imported since Gemini will be utilized as the LLM. This allows for seamless integration of Gemini's embedding functionality into the LLM framework, enabling efficient processing and analysis of text data. By incorporating Gemini embeddings, the LLM can leverage advanced semantic representations to enhance its understanding of textual information, leading to more accurate and insightful results in various natural language processing tasks.

In [9]:
class GeminiEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input:Documents)-> Embeddings:
        gemini_api_key=os.getenv("GEMINI_API_KEY")
        if not gemini_api_key:
            raise ValueError("API KEY NOT PROVIDED")
        genai.configure(api_key=gemini_api_key)
        model = "models/embedding-001"
        title = "Custom query"
        return genai.embed_content(model = model,
                                  content = input,
                                  task_type="retrieval_document",
                                  title=title)["embedding"]
    
    

# Storing Data

The data has been saved within the specified path and with the defined name for later use with the LLM. In this way, the vectorized data, comprising the entire content of 55 articles, has been stored in the format required by ChromaDB.

In [10]:
import chromadb
from typing import List
def create_chroma_db(documents:List, path:str, name:str):

    chroma_client = chromadb.PersistentClient(path=path)
    db = chroma_client.create_collection(name=name,embedding_function=GeminiEmbeddingFunction())
    for i, d in enumerate(documents):
        db.add(documents=d, ids=str(i))
    return db,name

db, name=create_chroma_db(documents=chunked_text,
                          path =chromadb_path,
                          name=chroma_collection_name)

In [11]:
def load_chroma_collection(path, name):

    chroma_client = chromadb.PersistentClient(path=path)
    db = chroma_client.get_collection(name=name, embedding_function=GeminiEmbeddingFunction())

    return db

db=load_chroma_collection(path=chromadb_path,
                          name=chroma_collection_name)

# Retrieval

With this function, we were able to perform semantic search and obtain similar chunks within the text.

In [12]:
def get_relevant_passage(query, db, n_results):
  passage = db.query(query_texts=[query], n_results=n_results)['documents'][0]
  return passage

#Example prompt
relevant_text = get_relevant_passage(query="Can you analyze/enquire the pdf corpus to synthesize evidence on interlinkages between SDG 16 and the other two goals, SDG 1 + SDG 10?",db=db,n_results=3)

In [13]:
print(relevant_text)

[' 2 \n 2020   African Governance and Development Institute                                            WP/20/08 6 ', 'en/ and information on governance \nindicators are available at http://', 'for global players such as funders, and']


We remind the LLM how it should return responses to us as prompts, ensuring that the answers will be in the desired format.

In [33]:
def make_rag_prompt(query, relevant_passage):
  escaped = relevant_passage.replace("'", "").replace('"', "").replace("\n", " ")
  prompt = ("""You are a helpful and informative bot that answers questions using text from the reference passage included below. \
  Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. \
  You are talking to a technical audience make sure you give relevant text and passages\
  strike a formal and converstional tone. \
  If the passage is irrelevant to the answer, you may ignore it.
  QUESTION: '{query}'
  PASSAGE: '{relevant_passage}'

  ANSWER:
  """).format(query=query, relevant_passage=escaped)

  return prompt


The provided function below is utilized to produce a response for a specified prompt using the Gemini Pro API.

In [69]:
import google.generativeai as genai

def generate_answer(prompt):
    gemini_api_key = os.getenv("GEMINI_API_KEY")
    if not gemini_api_key:
        raise ValueError("Gemini API Key not provided. Please provide GEMINI_API_KEY as an environment variable")
    genai.configure(api_key=gemini_api_key)
    model = genai.GenerativeModel('gemini-pro')
    answer = model.generate_content(prompt)
    return answer.text

Testing the answers

In [70]:
generate_answer("How does increased accountability and increased transparency affect reducing poverty including relative and extreme poverty.")

'**Effects of Increased Accountability and Transparency on Reducing Poverty**\n\n**1. Improved Targeting and Efficiency:**\n\n* Transparent systems allow for better identification of the poor and vulnerable.\n* Accountability ensures that resources are allocated more effectively to those who need them most.\n* This reduces leakages and wasteful spending, targeting interventions precisely.\n\n**2. Reduced Corruption and Misappropriation:**\n\n* Transparency makes it easier to monitor the use of poverty alleviation funds.\n* It deterrents corruption and ensures that resources reach their intended beneficiaries.\n* This reduces the amount of money lost to fraud, mismanagement, or diversion.\n\n**3. Enhanced Participation and Empowerment:**\n\n* Transparent reporting and accountability mechanisms empower the poor to hold decision-makers accountable.\n* They can participate in decision-making processes and ensure that their voices are heard.\n* This leads to more inclusive and effective pov

Retrieving the relevent 3 text chunks answers

In [65]:
#retrieve top 3 relevant text chunks and joining the relevant chunks to create a single passage
def generate_answer_rel(db,query):
    relevant_text = get_relevant_passage(query,db,n_results=6)
    prompt = make_rag_prompt(query, 
                             relevant_passage="".join(relevant_text))
    answer = generate_answer(prompt)

    return answer

Q 1 Prompt:

Interlinkages between SDG 16 and SDG 1 + SDG 10

Prompt:
Analyze the PDF corpus to synthesize evidence on the interlinkages between Sustainable Development Goal 16 (SDG 16) and the other two goals, SDG 1 and SDG 10. Please utilize the full text of the relevant SDGs, including the goal itself, targets, and indicators, for better matches.




Q 2 Prompt:

Impact of Increased Accountability and Transparency on Poverty Reduction

Prompt:
Examine the effects of increased accountability and transparency on reducing poverty, including both relative and extreme poverty. Please consider the targets related to accountability and increased transparency in Sustainable Development Goal 16 (SDG 16.6 and 16.7), as well as the targets related to reducing poverty, including both relative and extreme poverty, in Sustainable Development Goal 1 (SDG 1.1 and 1.2). Analyze how the implementation of measures to enhance accountability and transparency contributes to achieving the targets of poverty reduction. 



In [71]:
queries= [ "Examine the effects of increased accountability and transparency on reducing poverty, including both relative and extreme poverty. Please consider the targets related to accountability and increased transparency in Sustainable Development Goal 16 (SDG 16.6 and 16.7), as well as the targets related to reducing poverty, including both relative and extreme poverty, in Sustainable Development Goal 1 (SDG 1.1 and 1.2). Analyze how the implementation of measures to enhance accountability and transparency contributes to achieving the targets of poverty reduction."
, "Could you analyze the PDF corpus to identify evidence of the interconnections between SDG 16 and the other two goals, SDG 1 and SDG 10? Tip: For better matches, you may utilize the full text of the relevant SDG, including its goals, targets, and indicators."]

In [72]:
db=load_chroma_collection(path=chromadb_path,
                          name=chroma_collection_name)

In [73]:
for i in range(0,2):
    answer = generate_answer_rel(db, queries[i])
    print(answer)
    print("------------------\n")
    

**Effects of increased accountability and transparency on reducing poverty, including both relative and extreme poverty**

SDG 16.6 and 16.7 aim to promote accountability and transparency, while SDG 1.1 and 1.2 target reducing poverty, both relative and extreme. Studies show that good governance enhances inclusive education, while inequality dampens its positive effects.

The research examines thresholds of inequality that weaken the positive impact of governance on inclusive education in sub-Saharan Africa (SSA). Results indicate that governance unconditionally promotes inclusive education, but inequality mitigates this effect.

The study also finds that:

* Income inequality thresholds above which governance can no longer promote inclusive education range from 0.562 to 0.700.
* The control of corruption and the rule of law are most effective in promoting inclusive education but are hindered by high levels of inequality.
* The findings support the need to reduce income inequality and 