# Step 1:
---
Download the necessary packages:
either by running the code on your jupyter notebook:

import sys
!{sys.executable} -m pip install numpy langchain langchain-huggingface langchain-community pypdf transformers safetensors sentence-transformers datasets faiss-cpu

or copy paste the following into your terminal:

pip install numpy langchain langchain-huggingface langchain-community pypdf transformers safetensors sentence-transformers datasets faiss-cpu

# Step 2:
Import necessary libraries

In [1]:
import os
import numpy as np
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_community.llms import HuggingFaceHub
from langchain_community.llms import HuggingFacePipeline
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from huggingface_hub import hf_hub_download
from urllib.request import urlretrieve

# Step 3:

Choose a relevant topic, collect pdf's from the internet.

In [4]:
# Download relevant documents
os.makedirs("monkeypox", exist_ok=True)
files = [
    "https://www.binasss.sa.cr/mono/8.pdf",
    "https://www.cfsph.iastate.edu/FastFacts/pdfs/monkeypox_F.pdf",
    "https://www.europeanreview.org/wp/wp-content/uploads/5983-5990.pdf",
    "https://centerforhealthsecurity.org/sites/default/files/2023-02/monkeypox.pdf",
    "https://www.ecdc.europa.eu/sites/default/files/documents/mpox-risk-assessment-monkeypox-virus-africa-august-2024.pdf",
    "https://www.ecdc.europa.eu/sites/default/files/documents/Monkeypox-multi-country-outbreak.pdf",
    "https://www.nicd.ac.za/wp-content/uploads/2022/06/IPC-SOP-For-monkeypox-management.pdf",
    "https://covid19.cdc.gov.sa/wp-content/uploads/2023/06/Monkeypox-Guidelines-01-June-2023-V1.4.pdf",
    "https://www.cfsph.iastate.edu/Factsheets/pdfs/monkeypox.pdf",
    "https://www.health.ny.gov/diseases/communicable/zoonoses/mpox/docs/mpv_college_q_and_a.pdf"]
for url in files:
    file_path = os.path.join("monkeypox", url.rpartition("/")[2])
    urlretrieve(url, file_path)

# Step 4:
Split the document into manageable chunks

chunk_size = 700: Each chunk of text will have a maximum size of 700 characters.

chunk_overlap = 50: This ensures that there will be a 50-character overlap between consecutive chunks to prevent loss of context between chunks.

In [2]:
# Load pdf files in the local directory
loader = PyPDFDirectoryLoader("./monkeypox/")

docs_before_split = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap  = 50,
)
docs_after_split = text_splitter.split_documents(docs_before_split)

docs_after_split

Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 14 0 (offset 0)
Ignoring wrong pointing object 16 0 (offset 0)
Ignoring wrong pointing object 30 0 (offset 0)
Ignoring wrong pointing object 61 0 (offset 0)
Ignoring wrong pointing object 63 0 (offset 0)
Ignoring wrong pointing object 65 0 (offset 0)
Ignoring wrong pointing object 7 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 14 0 (offset 0)
Ignoring wrong pointing object 16 0 (offset 0)
Ignoring wrong pointing object 18 0 (offset 0)
Ignoring wrong pointing object 20 0 (offset 0)
Ignoring wrong pointing object 28 0 (offset 0)
Ignoring wrong pointing object 30 0 (offset 0)
Ignoring wrong pointing object 32 0 (offset 0)
Ignoring wrong pointing object 34 0 (offset 0)
Ignoring wrong pointing object 39 0 (offset 0)
Ignoring wrong 

[Document(metadata={'source': 'monkeypox\\5983-5990.pdf', 'page': 0}, page_content='5983Abstract.  – OBJECTIVE: Recently monkeypox \ncases have been reported from many non-en -\ndemic countries. The objective of this article is \nto bring out the epidemiology, mode of trans -\nmission, clinical features, genetic clades, and \nmolecular properties of monkeypox virus. \nMATERIALS AND METHODS: A detailed liter -\nature review was conducted on monkeypox, us -\ning databases PubMed/Medline, EMBASE, PMC \nand Cochrane Library, for the period between \n1985 to 2022.    \nRESULTS: Genetically monkeypox virus can \nbe classified into Central African clade and \nWestern African clades. The sequence simi -\nlarity between the two strains was found to \nbe 99.5%. However, some significant differenc -'),
 Document(metadata={'source': 'monkeypox\\5983-5990.pdf', 'page': 0}, page_content='be 99.5%. However, some significant differenc -\nes were found in the virulent and nonvirulent \ngenes of the str

# What's happening?

The error or message you are seeing, such as "Ignoring wrong pointing object," is typically encountered when dealing with corrupted or improperly formatted files—often in the context of PDF parsing or rendering.

What it means:
- Pointing Object: This refers to a reference or pointer in the PDF structure that should point to a valid object (such as text, images, or other resources).
- Wrong Pointing Object: This message is likely triggered when a PDF processing library (e.g., PyPDF or other PDF libraries) tries to access objects in the PDF that are either corrupt, missing, or incorrectly defined.
- Offset 0: The offset refers to the location in the PDF file where the object is expected. Offset 0 suggests that the pointer is not pointing to a valid location, which is a sign of a problem in the PDF structure.

In [3]:
avg_doc_length = lambda docs: sum([len(doc.page_content) for doc in docs])//len(docs)
avg_char_before_split = avg_doc_length(docs_before_split)
avg_char_after_split = avg_doc_length(docs_after_split)

print(f'Before split, there were {len(docs_before_split)} documents loaded, with average characters equal to {avg_char_before_split}.')
print(f'After split, there were {len(docs_after_split)} documents (chunks), with average characters equal to {avg_char_after_split} (average chunk length).')

Before split, there were 123 documents loaded, with average characters equal to 3098.
After split, there were 666 documents (chunks), with average characters equal to 580 (average chunk length).


# Step 5:

Text Embeddings with Hugging Face models

In [4]:
huggingface_embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-small-en-v1.5",  # alternatively use "sentence-transformers/all-MiniLM-l6-v2" for a light and faster experience.
    model_kwargs={'device':'cpu'}, 
    encode_kwargs={'normalize_embeddings': True}
)

In [5]:
sample_embedding = np.array(huggingface_embeddings.embed_query(docs_after_split[0].page_content))
print("Sample embedding of a document chunk: ", sample_embedding)
print("Size of the embedding: ", sample_embedding.shape)

Sample embedding of a document chunk:  [ 1.42315803e-02  3.02480124e-02 -1.94063038e-03 -2.37165466e-02
  8.72085914e-02  2.57493537e-02  2.55865939e-02  5.37119582e-02
 -5.97317778e-02  2.79164035e-02  2.28543319e-02 -7.57816732e-02
  5.22226617e-02 -1.95315983e-02 -8.61759856e-03 -2.61693522e-02
 -4.99382950e-02 -4.07498740e-02  2.17050873e-02  2.44464409e-02
  8.82618502e-03 -2.49396618e-02  3.78644913e-02  1.27089745e-03
  4.96838614e-02  2.95210909e-02 -1.56822111e-02  6.39245734e-02
  3.11577711e-02 -2.05213994e-01 -1.40936263e-02 -2.44428348e-02
 -3.63206565e-02 -7.12227961e-03  8.26122519e-03 -6.49903668e-03
 -2.70382725e-02 -2.29580533e-02 -6.96670124e-03 -1.51532385e-02
  5.99912219e-02  1.34736598e-02  3.46205942e-02 -3.77883688e-02
  1.17250672e-02 -6.19833805e-02 -4.93516140e-02  1.16232866e-02
  2.27204431e-02 -9.13899299e-03  1.77481268e-02 -6.35470971e-02
  1.81622040e-02  1.09002486e-01 -2.58362796e-02 -2.35719346e-02
  3.58781144e-02 -5.86627461e-02 -3.13700251e-02 -4

# Step 6:

Retrieval System for vector embeddings

In [6]:
vectorstore = FAISS.from_documents(docs_after_split, huggingface_embeddings)

In [7]:
query = """What is monkeypox?"""  
         # Sample question, change to other questions you are interested in.
relevant_documents = vectorstore.similarity_search(query)
print(f'There are {len(relevant_documents)} documents retrieved which are relevant to the query. Display the first one:\n')
print(relevant_documents[0].page_content)

There are 4 documents retrieved which are relevant to the query. Display the first one:

© 2013Monkeypox
What is monkeypox and 
what causes it?
Monkeypox is a viral disease 
discovered in laboratory monkeys in 
1958. The disease most commonly 
occurs in central and west Africa. Many 
animal species and humans can be 
infected. In 2003, monkeypox infected 
several people in the United States 
after they had contact with infected 
prairie dogs. The monkeypox virus is 
closely related to the viruses that cause 
smallpox and cowpox in humans.
What animals get 
monkeypox?
Old and New World monkeys and 
apes, a variety of rodents (including 
rats, mice, squirrels, and prairie 
dogs) and rabbits are susceptible 
to infection. The complete range of 
animal species that can be infected


# Step 7:

Create a retriever interface using vector store.
- Since we are using FAISS, it defaults into cosine similarity:
\begin{align}
\text{Cosine Similarity} = \frac{A \cdotp B}{\|A\|\|B\|}
\end{align}
- mmr (maximum marginal relevance) can also be used:
\begin{align}
\text{MMR} = \arg \max_{D_{i} \in S \setminus R} \left [\lambda \cdotp sim \left( D_{i}, Q \right)  - (1 - \lambda) \cdotp \max_{D_{j} \in R} sim(D_{i}, D_{j})\right]
\end{align}

In [8]:
# Use similarity searching algorithm and return 3 most relevant documents.
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})

In [9]:
hf = HuggingFaceHub(
    repo_id="mistralai/Mistral-7B-v0.1",
    huggingfacehub_api_token="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",
    model_kwargs={"temperature":0.1, "max_length":1000})

query = """What were the trends in monkeypox for  the last quarter of 2024."""  # Sample question, change to other questions you are interested in.
hf.invoke(query)

  hf = HuggingFaceHub(


'What were the trends in monkeypox for  the last quarter of 2024.\n\n## Monkeypox\n\nMonkeypox is a viral disease that is caused by the monkeypox virus. The virus is related to the virus that causes smallpox. Monkeypox is found in parts of Africa.\n\nMonkeypox can cause a rash that looks like pimples or blisters. The rash can be on the face, inside the mouth, and on other parts of the body. The rash can'

# Step 8:

Create a Hugging Face Local Pipelines

In [10]:
# need to login to use mistralai/Mistral-7B-v0.1
from huggingface_hub import login
login(token="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")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to C:\Users\isabe\.cache\huggingface\token
Login successful


In [11]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

hf = HuggingFacePipeline.from_model_id(
    model_id="distilgpt2",
    task="text-generation",
    pipeline_kwargs={"temperature": 1, "max_new_tokens": 300}
)

llm = hf 
llm.invoke(query)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

"What were the trends in monkeypox for  the last quarter of 2024. When compared with the other five pandemic pandemics in 2009, all in 2009 were at this stage of their pandemics, with only one case occurring in 2000. As with HIV, they have a history of using their virus to treat a disease. Now, for example, it's clear from this study of influenza by researchers at Harvard Medical School, that the viral burden of the disease rose significantly between 2006-2014. This increase was large enough, and the fact that there were no studies focusing on it, suggests some of the research on this pandemic may have also played an important role in influencing the future of pandemics.\n\n\nThe present research has been criticized for some of the ways the new findings indicate pandemics have changed from the original one to the present. In other words, new research is increasingly making up some of the best estimates that have been available of these types of epidemics in the past five years based on

# Step 9: 

Develop a Q & A chain

In [12]:
prompt_template = """Use the following pieces of context to answer the question at the end. Please follow the following rules:
1. If you don't know the answer, don't try to make up an answer. Just say "I can't find the final answer but you may want to check the following links".
2. If you find the answer, write the answer in a concise way with five sentences maximum.

{context}

Question: {question}

Helpful Answer:
"""

PROMPT = PromptTemplate(
 template=prompt_template, input_variables=["context", "question"]
)

In [13]:
retrievalQA = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

# Step 10:

Use RetrievalQA invoke method to execute the chain

In [14]:
# Call the QA chain with our query.
result = retrievalQA.invoke({"query": query})
print(result['result'])

Use the following pieces of context to answer the question at the end. Please follow the following rules:
1. If you don't know the answer, don't try to make up an answer. Just say "I can't find the final answer but you may want to check the following links".
2. If you find the answer, write the answer in a concise way with five sentences maximum.

signs, disease course and complications observed 
in the year 1970 and later decades is similar to the 
recent monkeypox outbreak in 202214. This may 
be attributed to its stable nature of viral DNA ge -
nome that do not easily undergo significant muta -
tions over a period of time, unlike RNA viruses27. 
Discussion
The difference observed between the old mon -
keypox cases and the recent monkeypox mainly 
lie in the mode of acquisition, all other aspects 
remain the same. Recent monkeypox cases out -
break in Europe included sexual transmission too 
apart from other modes of transmission. Monkey -
pox cases were observed predominantly in hom

In [15]:
relevant_docs = result['source_documents']
print(f'There are {len(relevant_docs)} documents retrieved which are relevant to the query.')
print("*" * 100)
for i, doc in enumerate(relevant_docs):
    print(f"Relevant Document #{i+1}:\nSource file: {doc.metadata['source']}, Page: {doc.metadata['page']}\nContent: {doc.page_content}")
    print("-"*100)
    print(f'There are {len(relevant_docs)} documents retrieved which are relevant to the query.')

There are 3 documents retrieved which are relevant to the query.
****************************************************************************************************
Relevant Document #1:
Source file: monkeypox\5983-5990.pdf, Page: 3
Content: signs, disease course and complications observed 
in the year 1970 and later decades is similar to the 
recent monkeypox outbreak in 202214. This may 
be attributed to its stable nature of viral DNA ge -
nome that do not easily undergo significant muta -
tions over a period of time, unlike RNA viruses27. 
Discussion
The difference observed between the old mon -
keypox cases and the recent monkeypox mainly 
lie in the mode of acquisition, all other aspects 
remain the same. Recent monkeypox cases out -
break in Europe included sexual transmission too 
apart from other modes of transmission. Monkey -
pox cases were observed predominantly in homo -
----------------------------------------------------------------------------------------------------
Th