Let's upload the PDF file first. We can upload it using `PdfReader` function from `PyPDF2`. 

In [1]:
from PyPDF2 import PdfReader

pdf_file_path = "docs/world-health-statistics.pdf"
loader = PdfReader(pdf_file_path)

Next we can collect all text from that PDF. 

In [2]:
raw_text = ""

for page in loader.pages:
    content = page.extract_text()
    if content:
        raw_text += content

After that, we can split our text collection using `CharacterTextSplitter` from `LangChain`. The reason why we need to split it is because we will store all this data to a vector database. We will save it in multiple documents instead of just one. Each document will have different information. So if we need information regarding something, we only need to take the documents that has information about that thing. We don't need to extract information from all text.  

In [3]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(separator = "\n", 
                                      chunk_size = 1000, 
                                      chunk_overlap = 10, 
                                      length_function = len)
text = text_splitter.split_text(raw_text)

Now we can transform the document using embedding function and store them to a Chroma vector database. 

In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

embedding_function = SentenceTransformerEmbeddings(model_name = "all-MiniLM-L6-v2")

# vectordb = Chroma.from_texts(text, embedding_function, persist_directory = "docs/chroma_db")
# vectordb.persist()

Let's try to import the vector database now. 

In [3]:
vectordb = Chroma(persist_directory = "docs/chroma_db", embedding_function = embedding_function)

Since we already have the vector database, now we can set up a chain that can help us to answer question. 

"What do you mean by _chain_?"

It is a function that connects vector database with your prompt. So, if you write your prompt, LLM will then help you find the answer on the vector database. We can set up this chain using `load_qa_chain` from `langchain`.   

In [4]:
from dotenv import load_dotenv
from langchain import OpenAI
from langchain.chains import RetrievalQA
from langchain.chains.question_answering import load_qa_chain

load_dotenv()

retriever = vectordb.as_retriever()
llm = OpenAI(temperature = 0.9)

qa_chain = RetrievalQA.from_chain_type(llm = llm,
                                       chain_type = "stuff",
                                       retriever = retriever,
                                       return_source_documents = True,
                                       verbose = False)

# load_qa_chain(llm, chain_type = "stuff")

  warn_deprecated(


Now, let's start our chain. 

In [5]:
chain_result = qa_chain("How is HIV in 2023? Is it better compared to 2022 or not?")
answer = chain_result["result"]
print(answer)

  warn_deprecated(


 I do not know the exact state of HIV in 2022 or 2023 as those dates have not yet occurred. However, based on the context provided, it can be inferred that efforts have been made to reduce new HIV infections and HIV-related deaths globally, with progress being strongest in the African Region. Treatment coverage has also expanded rapidly, potentially leading to a decrease in global HIV-related deaths. More progress is needed to meet the core targets set for 2025 and to ultimately end the AIDS epidemic by 2030.


We will use this chain to do two things:
1. Make summary
2. Act as a chatbot

First we will summarise the documents first. 

1. Make empty json `summary`
2. Collect all topics
3. Turn topics to keys
4. Make question prompt
5. Make question function
6. Build the loop

In [4]:
with open("topics.txt", "r", encoding = "utf-8") as r: 
    topics = r.read()
    
titles = topics.split("\n\n")
topics = [x.split("\n") for x in titles]

Now we have all list of topics. We can then run a loop where we make a prompt with the difference on the topic. Here's how our prompt looks like:

> _"Can you give me the summary of {topic} section given in the document"_

In [6]:
all_topics = []
for i in range(3):
    all_topics.extend(topics[i])

In [8]:
all_summaries = {}

for t in all_topics:
    topic = f"Can you give me the summary of {t} section given in the document"
    chain_result = qa_chain(topic)
    answer = chain_result["result"]
    all_summaries[t] = answer 
    progress_bar.update(1)

  0%|          | 0/16 [00:00<?, ?it/s]

  warn_deprecated(


We can now save the summaries to `summaries.json`. We will show this summaries on our dashboard. 

In [12]:
import json 

with open("docs/summaries.json", "w") as f:
    json.dump(all_summaries, f)