# Revised Model - Semantic Search

Download dataset from HuggingFace

In [1]:
from datasets import load_dataset

dataset = load_dataset("Jaymax/FDA_Pharmaceuticals_FAQ", split="train")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset

Dataset({
    features: ['Question', 'Answer'],
    num_rows: 1433
})

## ElasticSearch Database

First, we initialize the ElasticSearch database

In [3]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch("http://localhost:9200")

In [4]:
settings = {"number_of_shards": 1, "number_of_replicas": 0}

mappings = {
    "dynamic": "true",
    "numeric_detection": "true",
    "_source": {"enabled": "true"},
    "properties": {
        "Answer": {"type": "text"},
        "Question": {
            "type": "text",
        },
        "QuestionVector": {
            "type": "dense_vector",
            "dims": 768,
            "index": True,
            "similarity": "cosine",
        },
        "AnswerVector": {
            "type": "dense_vector",
            "dims": 768,
            "index": True,
            "similarity": "cosine",
        },
    },
}


index_name = "pharma_embed"
if es_client.indices.exists(index=index_name):
    es_client.indices.delete(index=index_name)
es_client.indices.create(index=index_name, settings=settings, mappings=mappings)


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'pharma_embed'})

In [5]:
dataset[0]

{'Question': 'Taking into account the content of Q7 Good Manufacturing Practice Guidance for Active Pharmaceutical Ingredients Guidance for Industry , Would additional process validation studies be needed to support a change in the source of an API starting material?',
 'Answer': 'Any change in the API starting material should be assessed for impact on the API manufacturing process and the resulting API quality (ICH Q7, paragraph 7.14). Additional validation studies of the API process may be warranted if the change in the API starting material is deemed significant. In most cases, validation would be expected for a different source of the starting material unless otherwise justified (ICH Q7, paragraphs 12.1, 13.13).'}

## Embeddings

We instantiate our Embeddings Model

In [6]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("NeuML/pubmedbert-base-embeddings")

We append the embeddings of the question and answer to our dataset and ingest it to the database

In [None]:
from tqdm.auto import tqdm

documents = []

for doc in tqdm(dataset):
    doc["QuestionVector"] = model.encode(doc["Question"]).tolist()
    doc["AnswerVector"] = model.encode(doc["Answer"]).tolist()
    documents.append(doc)
    es_client.index(index=index_name, document=doc)

In [7]:
import pickle

# # Saving embeddings to a file
# with open("embeddings.pkl", "wb") as f:
#     pickle.dump(documents, f)

with open('embeddings.pkl', 'rb') as f:
    documents = pickle.load(f)

In [8]:
len(documents[0]['QuestionVector'])

768

In [9]:
query = "What is the goal for IVD studies?"
query_vector = model.encode(query)

In [11]:
from utils import wrap, rag_vector

answer = rag_vector(query, query_vector, es_client, index_name)
print(wrap(answer))

Unfortunately, I don't have any context to draw from. Please provide
the context from the FAQ database, and I'll be happy to answer the
question based on the facts provided.


In [12]:
dataset_test = load_dataset("Jaymax/FDA_Pharmaceuticals_FAQ", split="test")

In [13]:
dataset_test[0]

{'Question': 'As described in Assessing User Fees Under the Generic Drug User Fee Amendments of 2022 , Do DMF holders need to wait for a new ANDA applicant to request a letter of authorization before the DMF is assessed to be available for reference?',
 'Answer': 'No. DMF holders can pay the fee before a letter of authorization is requested. The DMF will then undergo an initial completeness assessment, using factors articulated in the final guidance _Completeness Assessments for Type II Active Pharmaceutical Ingredient Drug Master Files Under the Generic Drug User Fee Amendments_. If the DMF passes the initial completeness assessment, FDA will identify the DMF on the Type II Drug Master Files - Available for Reference List.'}

In [14]:
query = dataset_test[0]['Question']
query_vector = model.encode(query)
answer = rag_vector(query, query_vector, es_client, index_name)
print('Generated Answer')
print('-'*8)
print(wrap(answer))
print('*'*8)

print('Ground Truth')
print('-'*8)
print(wrap(dataset_test[0]['Answer']))


Generated Answer
--------
According to the Generic Drug User Fee Amendments of 2022, DMF holders
do not need to wait for a new ANDA applicant to request a letter of
authorization before the DMF is assessed to be available for
reference.
********
Ground Truth
--------
No. DMF holders can pay the fee before a letter of authorization is
requested. The DMF will then undergo an initial completeness
assessment, using factors articulated in the final guidance
_Completeness Assessments for Type II Active Pharmaceutical Ingredient
Drug Master Files Under the Generic Drug User Fee Amendments_. If the
DMF passes the initial completeness assessment, FDA will identify the
DMF on the Type II Drug Master Files - Available for Reference List.
