# Advanced RAG

- Additional metadata vector store filtering [info](https://akash-mathur.medium.com/vector-database-vs-indexing-path-to-efficient-data-handling-382cc1207491#:~:text=Metadata%20storage%20and%20filtering%3A%20Vector,filters%20for%20finer%2Dgrained%20queries.)

In [2]:
import pandas as pd
import numpy as np
import json, os, pprint
import matplotlib.pyplot as plt
import plotly.express as px
import random
from langchain_openai import OpenAIEmbeddings
from langchain_openai.chat_models import ChatOpenAI
from langchain_core.utils.function_calling import convert_to_openai_tool
from langchain_core.tools import tool
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser
from langchain.output_parsers import JsonOutputToolsParser, JsonOutputKeyToolsParser
from langchain.agents import AgentExecutor, create_openai_tools_agent, create_react_agent, Tool
from langchain.agents.format_scratchpad.openai_tools import format_to_openai_tool_messages
from langchain.agents.output_parsers.openai_tools import OpenAIToolsAgentOutputParser
from langchain_experimental.utilities import PythonREPL
from langchain_experimental.tools import PythonREPLTool
from langchain import hub
from typing import List
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.callbacks import Callbacks
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, MessagesPlaceholder
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_community.vectorstores import Chroma, FAISS
from langchain_core.runnables import RunnablePassthrough

In [3]:
os.environ["OPENAI_API_KEY"] = ""

In [3]:
llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0.1, streaming=True)

In [4]:
# Download files from https://athena.ohdsi.org/
ndc_dir = "/Users/jzamalloa/Documents/PROJECTS/LLM/DBs/033024_ndc"
concept_ndc = pd.read_csv(ndc_dir + "/CONCEPT.csv", sep="\t")

print(concept_ndc.shape)
concept_ndc.head()

(1403710, 10)


  concept_ndc = pd.read_csv(ndc_dir + "/CONCEPT.csv", sep="\t")


Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
0,36189414,hemorrhoidal cream 10mg/g / 144mg/g / 150mg/g ...,Drug,NDC,9-digit NDC,,3630641,20180325,20991231,
1,1220863,fulvestrant 250mg/5mL INTRAMUSCULAR INJECTION,Drug,NDC,9-digit NDC,,167290436,20210121,20991231,
2,35110579,"kali muriaticum, carbo vegetabilis, lung (suis...",Drug,NDC,11-digit NDC,,43742164901,20200626,20280919,D
3,36321712,"pulsatilla (pratensis), euphorbium officinarum...",Drug,NDC,11-digit NDC,,43742206101,20221201,20280908,D
4,36321592,"influenzinum (2022-2023), herpes simplex 1 nos...",Drug,NDC,11-digit NDC,,43742206201,20221205,20281117,D


In [30]:
(concept_ndc
 .query("standard_concept==standard_concept")
#  .concept_class_id.unique()
 .query("vocabulary_id=='NDC'")
 )

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_date,valid_end_date,invalid_reason
2135,45201605,SM BANDAGES FLEXIBLE,Device,NDC,Device,S,10939005233,20130805,20991231,
2136,45304229,SM FABRIC BANDAGES,Device,NDC,Device,S,10939005933,20130805,20991231,
2137,45355381,SM FABRIC BANDAGES,Device,NDC,Device,S,10939008511,20130805,20991231,
2138,44979982,FLEXIBLE EX-LARGE BANDAGE,Device,NDC,Device,S,10939008611,20130805,20991231,
2139,45235976,SUNBLOCK SPF15 LOTION,Device,NDC,Device,S,10939036711,20130805,20991231,
...,...,...,...,...,...,...,...,...,...,...
1403226,37140214,sunscreen spf 30 3g/100g / 5g/100g / 10g/100g ...,Device,NDC,Device,S,80489023201,20240101,20991231,
1403227,37140215,sunscreen spf 30 3g/100g / 5g/100g / 10g/100g ...,Device,NDC,Device,S,80489023202,20240101,20991231,
1403228,37140216,sunscreen spf 50 3g/100g / 5g/100g / 10g/100g ...,Device,NDC,Device,S,80489023501,20240101,20991231,
1403229,37140217,sunscreen spf 50 3g/100g / 5g/100g / 10g/100g ...,Device,NDC,Device,S,80489023502,20240101,20991231,


In [6]:
concept_ndc_filtered = (concept_ndc
 .query("standard_concept!=standard_concept")
 .query("domain_id=='Drug'")
#  .query("vocabulary_id!='NDC'")
 .loc[:,["concept_id", "concept_name", "concept_class_id", "concept_code"]]
 )

concept_ndc_filtered

Unnamed: 0,concept_id,concept_name,concept_class_id,concept_code
0,36189414,hemorrhoidal cream 10mg/g / 144mg/g / 150mg/g ...,9-digit NDC,003630641
1,1220863,fulvestrant 250mg/5mL INTRAMUSCULAR INJECTION,9-digit NDC,167290436
2,35110579,"kali muriaticum, carbo vegetabilis, lung (suis...",11-digit NDC,43742164901
3,36321712,"pulsatilla (pratensis), euphorbium officinarum...",11-digit NDC,43742206101
4,36321592,"influenzinum (2022-2023), herpes simplex 1 nos...",11-digit NDC,43742206201
...,...,...,...,...
1403704,37143425,zolmitriptan 2.5 MG Oral Tablet,11-digit NDC,62332046206
1403705,37143427,zolmitriptan 5 MG Oral Tablet,11-digit NDC,62332046303
1403706,37143429,zolpidem tartrate 10 MG Oral Tablet,11-digit NDC,72789032314
1403707,37143426,zolmitriptan 2.5 MG Disintegrating Oral Tablet,11-digit NDC,62332018116


In [7]:
# Test Vectorizing sample sub-sample first

(concept_ndc_filtered.sample(1)).to_csv(ndc_dir + "/CONCEPT_FILTERED.csv", index=False, sep="\t")


In [8]:
loader = CSVLoader(file_path=ndc_dir + "/CONCEPT_FILTERED.csv", 
                   source_column="concept_class_id",
                   metadata_columns=["concept_id", "concept_code", "concept_class_id"],
                   csv_args={'delimiter':'\t'}
                             )
ndc_loaded = loader.load()

In [9]:
print(len(ndc_loaded))
ndc_loaded[:3]

1


[Document(page_content='concept_name: Chlordiazepoxide Hydrochloride 5 MG Oral Capsule', metadata={'source': '11-digit NDC', 'row': 0, 'concept_id': '44977739', 'concept_code': '00771061004', 'concept_class_id': '11-digit NDC'})]

In [10]:
ndc_loaded[0].page_content

'concept_name: Chlordiazepoxide Hydrochloride 5 MG Oral Capsule'

### Embed NDC Sample Doc into VectoStore

In [11]:
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

In [140]:
# Embedding speed:
# 100 samples - Instantaneous
# 1000 samples - 5.8s
# 10000 samples - 50.5s
# 133224 samples - 11min 30s

ndc_db = Chroma.from_documents(ndc_loaded, embedding=embeddings_model, 
                              #  persist_directory="/Users/jzamalloa/Documents/PROJECTS/LLM/DBs/033024_ndc",
                               collection_metadata={"hnsw:space": "cosine"}
                               )

### Evaluate similarity

In [128]:
# Embedding from vector store
print(len(ndc_db.get(include=['embeddings'])["embeddings"][0]))
ndc_db.get(include=['embeddings'])["embeddings"][0][:3]

1536


[0.005644810386002064, 0.019334683194756508, -0.04998311027884483]

In [112]:
# Embedding from original document vector "page_content" only
doc_embed = embeddings_model.embed_documents(
    [ndc_loaded[0].page_content]
)

doc_embed[0][:3] 

[0.005644810395135876, 0.01933468322604177, -0.04998311035972201]

In [146]:
# Their similarity (page_content original without metadata to vector store embedded)
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances
import numpy as np
from numpy.linalg import norm

def cosine_similarity_manual(A, B):
    cosine = np.dot(A,B)/(norm(A)*norm(B))
    return cosine

print("Cosine Similarity: ", cosine_similarity(
    np.array([ndc_db.get(include=['embeddings'])["embeddings"][0]]),
    np.array([doc_embed[0]])
))  # Between -1 to 1, -1 absolutely opposite vectors, 0 no correlation, 1 absolutely similar

print("Cosine Similarity Manual: ", cosine_similarity_manual(
    ndc_db.get(include=['embeddings'])["embeddings"][0],
    doc_embed[0]
))

print("Cosine Distance: ", pairwise_distances(
    np.array([ndc_db.get(include=['embeddings'])["embeddings"][0]]),
    np.array([doc_embed[0]]),
    metric="cosine" #distance - Range of cosine distance is from 0 to 2, 0 — identical vectors, 1 — no correlation, 2 — absolutely different.
))

print("L2 Distance: ", pairwise_distances(
    np.array([ndc_db.get(include=['embeddings'])["embeddings"][0]]),
    np.array([doc_embed[0]]),
    metric="l2" #distance - Range 0-Inf. Larger then farther apart they are
))


Cosine Similarity:  [[1.]]
Cosine Similarity Manual:  1.0000000000000004
Cosine Distance:  [[0.]]
L2 Distance:  [[0.]]


In [147]:
# Simality of sample embedded vector to target query
sample_query = embeddings_model.embed_query("what are drugs associated with prostate cancer?")

print("Cosine Similarity: ",cosine_similarity(
    np.array([ndc_db.get(include=['embeddings'])["embeddings"][0]]),
    np.array([sample_query])
))

print("Cosine Similarity Manual: ", cosine_similarity_manual(
    ndc_db.get(include=['embeddings'])["embeddings"][0],
    sample_query
))

print("Cosine Distance: ",pairwise_distances(
    np.array([ndc_db.get(include=['embeddings'])["embeddings"][0]]),
    np.array([sample_query]),
    metric="cosine"
))

print("L2 Distance: ",pairwise_distances(
    np.array([ndc_db.get(include=['embeddings'])["embeddings"][0]]),
    np.array([sample_query]),
    metric="l2"
))


Cosine Similarity:  [[0.1386754]]
Cosine Similarity Manual:  0.1386753966698643
Cosine Distance:  [[0.8613246]]
L2 Distance:  [[1.31249732]]


### Test Vector DB Similarity search on target sample

In [142]:
ndc_db.similarity_search_with_score("what are drugs associated with prostate cancer?", 1)

[(Document(page_content='concept_name: Chlordiazepoxide Hydrochloride 5 MG Oral Capsule', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '00771061004', 'concept_id': '44977739', 'row': 0, 'source': '11-digit NDC'}),
  1.7226483821868896)]

In [141]:
ndc_db.similarity_search_with_relevance_scores("what are drugs associated with prostate cancer?", 1)



[(Document(page_content='concept_name: Chlordiazepoxide Hydrochloride 5 MG Oral Capsule', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '00771061004', 'concept_id': '44977739', 'row': 0, 'source': '11-digit NDC'}),
  -0.7226483821868896)]

The above by default show the exact cosine DISTANCE score **but** they somehow **multiply it by 2** resulting exactly on the output of `similarity_search_with_score`

In [153]:
pairwise_distances(
    np.array([ndc_db.get(include=['embeddings'])["embeddings"][0]]),
    np.array([sample_query]),
    metric="cosine" #Distance
)*2

array([[1.72264921]])

And the `similarity_search_with_relevance_scores` is just `1-cosine distance` or:

In [149]:
1 - (
   pairwise_distances(
    np.array([ndc_db.get(include=['embeddings'])["embeddings"][0]]),
    np.array([sample_query]),
    metric="cosine"
    )*2 
)

array([[-0.72264921]])

### Now test Vector Store Embedding on entire Vector Corpus of interest

In [169]:
(pd.read_csv( ndc_dir + "/CONCEPT_FILTERED.csv", sep="\t")
 .query("concept_name.str.upper().str.contains('AMIVAN')", engine="python")
 )

  (pd.read_csv( ndc_dir + "/CONCEPT_FILTERED.csv", sep="\t")


Unnamed: 0,concept_id,concept_name,concept_class_id,concept_code
90743,36116548,7 ML amivantamab-vmjw 50 MG/ML Injection [Rybr...,11-digit NDC,57894050101
133181,36115799,amivantamab 350mg/1 INTRAVENOUS INJECTION,9-digit NDC,578940501
133182,36116547,7 ML amivantamab-vmjw 50 MG/ML Injection [Rybr...,11-digit NDC,57894050100


In [152]:
# Write rows of interest to CSV
(
    pd.concat(
        [
            (concept_ndc_filtered
             .sample(100000)
             ),
            (concept_ndc_filtered
             .query("concept_name.str.upper().str.contains('TINIB')", engine="python")
             ),
             (concept_ndc_filtered
             .query("concept_name.str.upper().str.contains('MIDE')", engine="python")
             ),
             (concept_ndc_filtered
             .query("concept_name.str.upper().str.contains('AMAB')", engine="python")
             )
        ]
    )
    .drop_duplicates()
).to_csv(ndc_dir + "/CONCEPT_FILTERED.csv", index=False, sep="\t")

# Load CSVLoader object from file
loader = CSVLoader(file_path=ndc_dir + "/CONCEPT_FILTERED.csv", 
                   source_column="concept_class_id",
                   metadata_columns=["concept_id", "concept_code", "concept_class_id"],
                   csv_args={'delimiter':'\t'}
                             )
ndc_loaded = loader.load()

# Write to Vector DB
ndc_db = Chroma.from_documents(ndc_loaded, embedding=embeddings_model, 
                              #  persist_directory="/Users/jzamalloa/Documents/PROJECTS/LLM/DBs/033024_ndc",
                               collection_metadata={"hnsw:space": "cosine"}
                               )

### Test Vector Querying for Similar Target Terms
(Lower cosine distance is better)

In [156]:
ndc_db.similarity_search_with_score("what are drugs associated with prostate cancer?", 5)

[(Document(page_content='concept_name: PROSTATE - petroselinum sativum, populus tremulodies, sabal serrulata, chimaphila umbellata, adenosinum triphosphoricum dinatrum, equol, kreosotum, nadidum, testosterone, succinicum acidum, hepar sulphuris calcareum, prostate nosode, conium maculatum, pro', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '43742046901', 'concept_id': '46335386', 'row': 7204, 'source': '11-digit NDC'}),
  1.092272162437439),
 (Document(page_content='concept_name: PROSTATE - thyroidinum, baryta carb., berber. vulg., bryonia, calc. carb., cinchona, conium, digitalis, ferrum picricum, hydrastis, iodium, lycopodium, nux vom., pareira, pulsatilla, sabal, selenium, staphysag., thuja occ.,trifolium pratense, echinacea, l', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '55714224200', 'concept_id': '46350021', 'row': 49621, 'source': '11-digit NDC'}),
  1.1388591527938843),
 (Document(page_content='concept_name: PROSTAZEN - enlarged prostate, pr

In [157]:
ndc_db.similarity_search_with_relevance_scores("what are drugs associated with prostate cancer?", 5)



[(Document(page_content='concept_name: PROSTATE - petroselinum sativum, populus tremulodies, sabal serrulata, chimaphila umbellata, adenosinum triphosphoricum dinatrum, equol, kreosotum, nadidum, testosterone, succinicum acidum, hepar sulphuris calcareum, prostate nosode, conium maculatum, pro', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '43742046901', 'concept_id': '46335386', 'row': 7204, 'source': '11-digit NDC'}),
  -0.09220409393310547),
 (Document(page_content='concept_name: PROSTATE - thyroidinum, baryta carb., berber. vulg., bryonia, calc. carb., cinchona, conium, digitalis, ferrum picricum, hydrastis, iodium, lycopodium, nux vom., pareira, pulsatilla, sabal, selenium, staphysag., thuja occ.,trifolium pratense, echinacea, l', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '55714224200', 'concept_id': '46350021', 'row': 49621, 'source': '11-digit NDC'}),
  -0.13880932331085205),
 (Document(page_content='concept_name: PROSTAZEN - enlarged prostat

**Cosine distance same as above:**

In [175]:
# Test cosine distance
pairwise_distances(
    np.array([embeddings_model.embed_query("concept_name: PROSTATE - petroselinum sativum, populus tremulodies, sabal serrulata, chimaphila umbellata, adenosinum triphosphoricum dinatrum, equol, kreosotum, nadidum, testosterone, succinicum acidum, hepar sulphuris calcareum, prostate nosode, conium maculatum, pro")]),
    np.array([embeddings_model.embed_query("what are drugs associated with prostate cancer?")]),
    metric="cosine" #Distance
)*2

array([[1.09238066]])

**We have to be very exact about term when using similarity search directly, otherwise we might get incorrect results**

In [176]:
ndc_db.similarity_search_with_score("what are drugs associated with amivantamab?")

[(Document(page_content='concept_name: acetaminophen, doxylamine succinate, and dextromethorphan hydrobromide 15mg/1 / 325mg/1 / 6.25mg/1 ORAL CAPSULE, LIQUID FILLED', metadata={'concept_class_id': '9-digit NDC', 'concept_code': '682101500', 'concept_id': '44939863', 'row': 126559, 'source': '9-digit NDC'}),
  1.2495418787002563),
 (Document(page_content='concept_name: acetaminophen, dextromethorphan hydrobromide, and doxylamine succinate 15mg/1 / 325mg/1 / 6.5mg/1 ORAL CAPSULE, LIQUID FILLED', metadata={'concept_class_id': '9-digit NDC', 'concept_code': '301420787', 'concept_id': '36398336', 'row': 110747, 'source': '9-digit NDC'}),
  1.2518198490142822),
 (Document(page_content='concept_name: acetaminophen, dextromethorphan hydrobromide, and doxylamine succinate 15mg/1 / 325mg/1 / 6.5mg/1 ORAL CAPSULE, LIQUID FILLED', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '30142078748', 'concept_id': '36397955', 'row': 110748, 'source': '11-digit NDC'}),
  1.2518447637557983),

In [219]:
ndc_db.similarity_search_with_score("amivantamab")

[(Document(page_content='concept_name: amivantamab 350mg/1 INTRAVENOUS INJECTION', metadata={'concept_class_id': '9-digit NDC', 'concept_code': '578940501', 'concept_id': '36115799', 'row': 133181, 'source': '9-digit NDC'}),
  0.6727580428123474),
 (Document(page_content='concept_name: 7 ML amivantamab-vmjw 50 MG/ML Injection [Rybrevant]', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '57894050101', 'concept_id': '36116548', 'row': 90743, 'source': '11-digit NDC'}),
  0.8070242404937744),
 (Document(page_content='concept_name: 7 ML amivantamab-vmjw 50 MG/ML Injection [Rybrevant]', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '57894050100', 'concept_id': '36116547', 'row': 133182, 'source': '11-digit NDC'}),
  0.8070434927940369),
 (Document(page_content='concept_name: 0.05 ML faricimab-svoa 120 MG/ML Injection [Vabysmo]', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '50242009677', 'concept_id': '1985284', 'row': 20585, 'source': '11-dig

In [221]:
ndc_db.similarity_search_with_score("apalutamide",6)

[(Document(page_content='concept_name: apalutamide 240mg/1 ORAL TABLET, FILM COATED', metadata={'concept_class_id': '9-digit NDC', 'concept_code': '596760604', 'concept_id': '36353102', 'row': 121741, 'source': '9-digit NDC'}),
  0.7026128768920898),
 (Document(page_content='concept_name: apalutamide 240 MG Oral Tablet [Erleada]', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '59676060430', 'concept_id': '36353428', 'row': 121743, 'source': '11-digit NDC'}),
  0.7085045576095581),
 (Document(page_content='concept_name: apalutamide 240 MG Oral Tablet [Erleada]', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '59676060414', 'concept_id': '36353427', 'row': 121742, 'source': '11-digit NDC'}),
  0.7085045576095581),
 (Document(page_content='concept_name: apalutamide 60mg/1 ORAL TABLET, FILM COATED', metadata={'concept_class_id': '9-digit NDC', 'concept_code': '596760600', 'concept_id': '42836263', 'row': 121737, 'source': '9-digit NDC'}),
  0.7122552990913391

In [203]:
ndc_db.similarity_search_with_score("I want to find all the NDC codes associated with Apalutamide. List all of them.")

[(Document(page_content='concept_name: tapentadol 100 MG Oral Tablet [Nucynta]', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '35356061160', 'concept_id': '45339905', 'row': 8818, 'source': '11-digit NDC'}),
  1.2002676725387573),
 (Document(page_content='concept_name: tapentadol 100 MG Oral Tablet [Nucynta]', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '50458084004', 'concept_id': '45376111', 'row': 36122, 'source': '11-digit NDC'}),
  1.2002676725387573),
 (Document(page_content='concept_name: tapentadol 100 MG Oral Tablet [Nucynta]', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '16590028960', 'concept_id': '44861657', 'row': 32324, 'source': '11-digit NDC'}),
  1.2002676725387573),
 (Document(page_content='concept_name: 24 HR Nifedipine 60 MG Extended Release Oral Tablet [Afeditab CR]', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '43353031160', 'concept_id': '45238612', 'row': 25849, 'source': '11-digit NDC'}),
  

#### We need to use [hybrid search](https://weaviate.io/blog/hybrid-search-explained) to accomplish sparse and dense search

### Test Retriever (Specific Instructions)

In [225]:
# ndc_retriever = ndc_db.as_retriever(search_kwargs={"k": 5})
ndc_retriever = ndc_db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.01})

In [226]:
ndc_retriever.get_relevant_documents("what are drugs associated with prostate cancer?")



[]

In [215]:
ndc_retriever.get_relevant_documents("retrieve the code associated with apalutamide, be very specific about them")



[]

In [216]:
ndc_retriever.get_relevant_documents("apalutamide")

[Document(page_content='concept_name: apalutamide 240mg/1 ORAL TABLET, FILM COATED', metadata={'concept_class_id': '9-digit NDC', 'concept_code': '596760604', 'concept_id': '36353102', 'row': 121741, 'source': '9-digit NDC'}),
 Document(page_content='concept_name: apalutamide 240 MG Oral Tablet [Erleada]', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '59676060430', 'concept_id': '36353428', 'row': 121743, 'source': '11-digit NDC'}),
 Document(page_content='concept_name: apalutamide 240 MG Oral Tablet [Erleada]', metadata={'concept_class_id': '11-digit NDC', 'concept_code': '59676060414', 'concept_id': '36353427', 'row': 121742, 'source': '11-digit NDC'}),
 Document(page_content='concept_name: apalutamide 60mg/1 ORAL TABLET, FILM COATED', metadata={'concept_class_id': '9-digit NDC', 'concept_code': '596760600', 'concept_id': '42836263', 'row': 121737, 'source': '9-digit NDC'})]

In [194]:
template = """Answer the question based only on the following context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

def format_docs(docs):
    return "\n\n".join([f"The {d.page_content} corresponds to the NDC code: {d.metadata['concept_code']}" for d in docs])


chain = (
    {"context": ndc_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [195]:
chain.invoke("what are drugs associated with prostate cancer?")

'Based on the provided context, none of the listed concept_names explicitly mention being associated with the treatment or management of prostate cancer. They seem to be more focused on general prostate health, symptoms of prostate enlargement (BPH), and other prostate-related conditions. Therefore, based on the given context, there are no drugs specifically associated with prostate cancer mentioned.'

In [196]:
chain.invoke("what are drugs associated with lung cancer?")

'Based on the provided context, none of the mentioned products (LUNG DROPS 9604, sticta pulmonaria 1[hp_X]/1 ORAL TABLET, lung stim liquescence, lobaria pulmonaria 30[hp_C]/mL ORAL LIQUID) are directly indicated as being associated with the treatment, management, or prevention of lung cancer. These products seem to be homeopathic remedies, which are typically used for various health conditions but not specifically or scientifically proven for treating lung cancer.'

In [198]:
chain.invoke("what are the NDC codes associated with amivantamab?")

'Based on the provided context, there is no information given about amivantamab or its associated NDC codes. The context only provides information about aminopentamide sulfate and its corresponding NDC codes.'

In [199]:
chain.invoke("what are the NDC codes associated with apalutamide?")

'Based on the context provided, there is no information about apalutamide or its associated NDC codes. The context only provides information about naproxen sodium 220 MG in various forms and their corresponding NDC codes.'

### TRY HYBRID SEARCH

Chroma does not support Hybrid search, but `Weviate` does. We'll use this DB moving forward

Write previous loader object to Weviate Vector DB

In [1]:
import weaviate
from langchain_weaviate.vectorstores import WeaviateVectorStore

In [None]:
ndc_db = Chroma.from_documents(ndc_loaded, embedding=embeddings_model, 
                              #  persist_directory="/Users/jzamalloa/Documents/PROJECTS/LLM/DBs/033024_ndc",
                               collection_metadata={"hnsw:space": "cosine"}
                               )