# Baseline - Using Keyword Search

Download dataset from HuggingFace

In [1]:
from datasets import load_dataset

dataset = load_dataset("Jaymax/FDA_Pharmaceuticals_FAQ", split="train")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset

Dataset({
    features: ['Question', 'Answer'],
    num_rows: 1433
})

## ElasticSearch Database

First, we initialize the ElasticSearch database

In [3]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch("http://localhost:9200")

In [4]:
settings = {"number_of_shards": 1, "number_of_replicas": 0}

mappings = {
    "dynamic": "true",
    "numeric_detection": "true",
    "_source": {"enabled": "true"},
    "properties": {
        "answer": {"type": "text"},
        "question": {
            "type": "text",
        },
    },
}

index_name = "pharma"
if es_client.indices.exists(index=index_name):
    es_client.indices.delete(index=index_name)
es_client.indices.create(index=index_name, settings=settings, mappings=mappings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'pharma'})

Then, we ingest the dataset to the ElasticSearch Database

In [5]:
from tqdm.auto import tqdm

for doc in tqdm(dataset):
    es_client.index(index=index_name, document=doc)

100%|██████████| 1433/1433 [00:32<00:00, 43.60it/s]


## Perform Search

In our `utils.py` file, we have created helper functions that
- perform search over the elasticsearch database
- build a prompt for our LLM that takes in the relevant documents from the search results
- invoke the LLM to generate the result

In [6]:
from utils import wrap, rag

query = "What is the goal for IVD studies?"
answer = rag(query, es_client, index_name)
print(wrap(answer))

The goals for IVD studies are the same as the goals for other device
studies. The goals are to:

* Produce valid scientific evidence demonstrating reasonable assurance
of the safety and effectiveness of the product.
* Protect the rights and welfare of study subjects.


In [7]:
dataset_test = load_dataset("Jaymax/FDA_Pharmaceuticals_FAQ", split="test")

In [8]:
dataset_test[0]

{'Question': 'As described in Assessing User Fees Under the Generic Drug User Fee Amendments of 2022 , Do DMF holders need to wait for a new ANDA applicant to request a letter of authorization before the DMF is assessed to be available for reference?',
 'Answer': 'No. DMF holders can pay the fee before a letter of authorization is requested. The DMF will then undergo an initial completeness assessment, using factors articulated in the final guidance _Completeness Assessments for Type II Active Pharmaceutical Ingredient Drug Master Files Under the Generic Drug User Fee Amendments_. If the DMF passes the initial completeness assessment, FDA will identify the DMF on the Type II Drug Master Files - Available for Reference List.'}

In [9]:
query = dataset_test[0]['Question']
answer = rag(query, es_client, index_name)
print('Generated Answer')
print('-'*8)
print(wrap(answer))
print('*'*8)

print('Ground Truth')
print('-'*8)
print(wrap(dataset_test[0]['Answer']))


Generated Answer
--------
No, the context does not explicitly state that DMF holders need to
wait for a new ANDA applicant to request a letter of authorization
before the DMF is assessed to be available for reference. However, it
does state that the DMF fee is incurred the first time a generic drug
submission references that DMF by an initial letter of authorization
on or after October 1, 2012.
********
Ground Truth
--------
No. DMF holders can pay the fee before a letter of authorization is
requested. The DMF will then undergo an initial completeness
assessment, using factors articulated in the final guidance
_Completeness Assessments for Type II Active Pharmaceutical Ingredient
Drug Master Files Under the Generic Drug User Fee Amendments_. If the
DMF passes the initial completeness assessment, FDA will identify the
DMF on the Type II Drug Master Files - Available for Reference List.
