Drug Repurposing RAG Pipeline

Conveting the PDF data as Vectors and saving it in vector database

Step 1: Extract the text from the pdf documents

In [1]:
import pdfplumber

def extract_text(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = ""
        for page in pdf.pages:
            text += page.extract_text() or ""
    return text

pdf_files = ["data/tygacil-epar-product-information_en.pdf", "data/tigecycline-accord-epar-public-assessment-report_en.pdf"]  # List your PDF files
documents = [extract_text(pdf) for pdf in pdf_files]

Step 2 : Spliting them into chunks for easy processing and querying

In [2]:
import nltk
import os

#nltk.download('punkt')
nltk.download('punkt_tab')

#nltk_data_path = "C:\\GEN AI\\drug-repurposing-rag\\.venv"
#os.path.join(os.getcwd(), 'venv', 'nltk_data')
#nltk.data.path.append(nltk_data_path)

def chunk_text(text, max_length=500):
    sentences = nltk.sent_tokenize(text)
    chunks = []
    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk) + len(sentence) < max_length:
            current_chunk += " " + sentence
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

chunked_docs = [chunk_text(doc) for doc in documents]

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Step 3: Using a huggingface embedding model to convert the text as vectors

all-MiniLM-L6-v2 is a compact, efficient sentence-transformer model designed for natural language processing (NLP) tasks, It maps sentences or short paragraphs to a 384-dimensional dense vector space

In [5]:
from sentence_transformers import SentenceTransformer


model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = [model.encode(chunks) for chunks in chunked_docs]

  from .autonotebook import tqdm as notebook_tqdm


Step 4: Create an index in pinecone vecor database

In [None]:
from pinecone import Pinecone, ServerlessSpec
import yaml

# Load the config.yaml file
with open("config.yaml", "r") as file:
    config = yaml.safe_load(file)

api_key = config["dev"]["api_key"]

pc = Pinecone(api_key=api_key)

In [None]:

index_name = "drug-repurposing"

pc.create_index(
    name=index_name,
    dimension=384, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ) 
)

print(pc.list_indexes())

[{
    "name": "drug-repurposing",
    "metric": "cosine",
    "host": "drug-repurposing-60xht04.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "vector_type": "dense",
    "dimension": 384,
    "deletion_protection": "disabled",
    "tags": null
}]


In [None]:
index_name = "drug-repurposing"
index = pc.Index(index_name)

Step 5: Upload the vectors into pinecone

In [None]:
def upload_to_pinecone(index, embeddings, chunked_docs, pdf_files):
    vectors = []
    for doc_idx, (doc_embeds, chunks) in enumerate(zip(embeddings, chunked_docs)):
        for chunk_idx, (embed, chunk) in enumerate(zip(doc_embeds, chunks)):
            vector_id = f"{pdf_files[doc_idx]}_chunk_{chunk_idx}"
            vectors.append((vector_id, embed.tolist(), {"text": chunk, "source": pdf_files[doc_idx]}))
    index.upsert(vectors=vectors)

upload_to_pinecone(index, embeddings, chunked_docs, pdf_files)

Step 6 : Test the vecot DB by querying with a sample question

In [12]:
user_question = "What are the potential new uses for Tigecycline based on recent research?"
query_embed = model.encode([user_question])[0]

# Use keyword arguments for the query method
results = index.query(
    vector=query_embed.tolist(),  # Keyword 'vector' for the query vector
    top_k=5,                     # Keyword 'top_k' for number of results
    include_metadata=True        # Keyword 'include_metadata' to include metadata
)

# Iterate through the results
for match in results['matches']:
    print(match['metadata']['text'])

retrieved_texts = [res["metadata"]["text"] for res in results["matches"]]

Tigecycline is a third-
generation tetracycline, which has a modified structure for dealing with the known resistance
mechanisms of bacteria. This application concerns a generic application according to article 10(1) for Tigecycline Accord, powder
for solution for infusion, 50 mg/vial. The reference product is the centrally authorised medicinal
product Tygacil, which has been authorised in the EU since April 2006.
Tigecycline Accord should be used only in situations where other alternative antibiotics are not suitable
(see sections 4.4, 4.8 and 5.1). The reference product is Tygacil by Pfizer Limited, UK, authorised in EU since 2006. The Guideline on
the Investigation of Bioequivalence (CPMP/EWP/QWP/1401/98 Rev. 1) is relevant for the assessment. No CHMP scientific advice pertinent to the clinical development was given for this medicinal product.
This medicinal product must not be mixed with other medicinal products except those mentioned in
section 6.6. 6.3 Shelf life
2 years. Once re

In [10]:
import psutil
print(psutil.virtual_memory())

svmem(total=8491270144, available=2110472192, percent=75.1, used=6380797952, free=2110472192)
