In [1]:
import qdrant_client
import torch
import numpy as np
import uuid

import pandas as pd
import numpy as np

from qdrant_client.models import Distance, VectorParams, PointStruct, ScoredPoint
from transformers import AutoTokenizer, AutoModel

# Missing values
I will check for missing values before train-test split, if there is few of them I will just remove them from the dataset whcih in this type of task will be fine as we need to creata reliable knowlege base.

In [2]:
df = pd.read_csv('data/mle_screening_dataset.csv')
print(df.shape)

(16406, 2)


In [3]:
df.isna().sum()

question    0
answer      5
dtype: int64

In [4]:
# there are 5 rows with no answers for questions (missing data), since the dataset is relatively large and we are going to use it as knowledge base we can drop it
df[df['answer'].isna()]

Unnamed: 0,question,answer
3587,What is (are) HELLP syndrome ?,
3836,What is (are) X-linked lymphoproliferative syn...,
4196,What is (are) Familial HDL deficiency ?,
4429,What is (are) Emery-Dreifuss muscular dystroph...,
6689,What is (are) Emery-Dreifuss muscular dystroph...,


In [5]:
df = df[~df['answer'].isna()].copy()

# Train Test split? Assumptions about the structure of the dataset and my solution

Since I have limited time and resources and the aim of this task (as I assume) is to show that I have the general knowledge of how to solve such problems, in my solution I am going to use Vector Databases for searching for the most relevant documents and hugging face model for question answering (looking for the most important part of the text). Hence my solution does not include any training/fine tuning of any model. For that reason the train-test split is not needed.

In [6]:
df.head()

Unnamed: 0,question,answer
0,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...
1,What is (are) Glaucoma ?,The optic nerve is a bundle of more than 1 mil...
2,What is (are) Glaucoma ?,Open-angle glaucoma is the most common form of...
3,Who is at risk for Glaucoma? ?,Anyone can develop glaucoma. Some people are a...
4,How to prevent Glaucoma ?,"At this time, we do not know how to prevent gl..."


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16401 entries, 0 to 16405
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  16401 non-null  object
 1   answer    16401 non-null  object
dtypes: object(2)
memory usage: 384.4+ KB


In [8]:
# there are many rows for some questions and answers
# we will need only part of the "answer" as an actual answer since some documents are very long
# some "answer" values are irrelevant to the question
for elem in df[df['question'] == 'What is (are) Glaucoma ?']['answer'].values:
    print(elem)
    print('====================================================')

Glaucoma is a group of diseases that can damage the eye's optic nerve and result in vision loss and blindness. The most common form of the disease is open-angle glaucoma. With early treatment, you can often protect your eyes against serious vision loss. (Watch the video to learn more about glaucoma. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.)  See this graphic for a quick overview of glaucoma, including how many people it affects, whos at risk, what to do if you have it, and how to learn more.  See a glossary of glaucoma terms.
The optic nerve is a bundle of more than 1 million nerve fibers. It connects the retina to the brain.
Open-angle glaucoma is the most common form of glaucoma. In the normal eye, the clear fluid leaves the anterior chamber at the open angle where the cornea and iris meet. When the fluid reaches the angle, it flows through a spongy meshwork, like a drain, and leaves 

In [9]:
# sometimes values in column answer are very long, we can use only part of them to answer the question
for i in np.random.choice(df.index, 30):
    print(df.iloc[i]['question'])
    print()
    print(df.iloc[i]['answer'])
    print()
    print('=============================================================================================')
    print()

What are the symptoms of Autosomal recessive axonal neuropathy with neuromyotonia ?

What are the signs and symptoms of Autosomal recessive axonal neuropathy with neuromyotonia? The Human Phenotype Ontology provides the following list of signs and symptoms for Autosomal recessive axonal neuropathy with neuromyotonia. If the information is available, the table below includes how often the symptom is seen in people with this condition. You can use the MedlinePlus Medical Dictionary to look up the definitions for these medical terms. Signs and Symptoms Approximate number of patients (when available) Abnormality of the foot - Autosomal recessive inheritance - Elevated serum creatine phosphokinase - Fasciculations - Foot dorsiflexor weakness - Hyperhidrosis - Muscle cramps - Muscle stiffness - Myokymia - Myotonia - Progressive - Sensory axonal neuropathy - Skeletal muscle atrophy - The Human Phenotype Ontology (HPO) has collected information on how often a sign or symptom occurs in a condit

In [10]:
df.shape

(16401, 2)

In [11]:
df['question'].unique().shape

(14976,)

In [12]:
df['answer'].unique().shape

(15811,)

In [13]:
# some answers are very long, we will need to slice them into batches
df['answer'].apply(len).max()

np.int64(29046)

# vector database
we use vector database to find relevant values

In [14]:
class VectorDatabase:
    def __init__(self, collection_name="answers_collection", embed_model="sentence-transformers/all-distilroberta-v1"):
        """Vector database and embedding model initialization."""
        self.collection_name = collection_name
        self.client = qdrant_client.QdrantClient(":memory:")
    
        # embedding model and tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(embed_model)
        self.model = AutoModel.from_pretrained(embed_model)

        # create collection
        self.client.recreate_collection(
            collection_name=self.collection_name,
            vectors_config=VectorParams(size=768, distance=Distance.COSINE)  # we are using cosine similarity
        )

    def get_embedding(self, text):
        """Create embeddings for provided text."""
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
        with torch.no_grad():
            outputs = self.model(**inputs)
        return outputs.last_hidden_state[:, 0, :].squeeze().numpy()

    def split_text(self, text, chunk_size=500, overlap=100):
        """Divides text into smaller chunks with overlapping."""
        return [text[i : i + chunk_size] for i in range(0, len(text), chunk_size - overlap)]

    def insert_answers(self, answers):
        """Insert answers into the Qdrant vector database."""
        points = []
        for answer in answers:
            chunks = self.split_text(answer)
            for chunk in chunks:
                embedding = self.get_embedding(chunk)
                points.append(
                    PointStruct(
                        id=str(uuid.uuid4()),  # unique id
                        vector=embedding.tolist(),  # embedding of the chunk
                        payload={"text": chunk, "full_text": answer}  # store original text (both chunk and full)
                    )
                )
        self.client.upsert(self.collection_name, points)

    def search_answer(self, query, top_k=3):
        """Search for the most relevant answers in the collection."""
        query_embedding = self.get_embedding(query)
        search_results = self.client.search(
            collection_name=self.collection_name,
            query_vector=query_embedding.tolist(),
            limit=top_k
        )
        return search_results

In [15]:
EMB_MODEL_NAME = "sentence-transformers/all-distilroberta-v1"
COLLECTION_NAME = "answers_collection"

vector_database = VectorDatabase(collection_name=COLLECTION_NAME, embed_model=EMB_MODEL_NAME)

  self.client.recreate_collection(


In [16]:
# smaller sample for development
answers = df['answer'].unique().tolist()[:100]
answers[0]

"Glaucoma is a group of diseases that can damage the eye's optic nerve and result in vision loss and blindness. The most common form of the disease is open-angle glaucoma. With early treatment, you can often protect your eyes against serious vision loss. (Watch the video to learn more about glaucoma. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.)  See this graphic for a quick overview of glaucoma, including how many people it affects, whos at risk, what to do if you have it, and how to learn more.  See a glossary of glaucoma terms."

In [17]:
# create vector database with answers sample
vector_database.insert_answers(answers)

In [18]:
# example usage:
search_query = "What are clinical trials?"
search_results = vector_database.search_answer(search_query)

  search_results = self.client.search(


In [19]:
# score, chunk of the text and full text for top 3 results
for result in search_results:
    print(f"🔍 Score: {result.score:.3f}")
    print('===========================================')
    print(f"Text Chunk:\n{result.payload['text']}")
    print('===========================================')
    print(f"Full Text:\n{result.payload['full_text']}")
    print()
    print()

🔍 Score: 0.709
Text Chunk:
Clinical trials are part of clinical research and at the heart of all treatment advances. Clinical trials look at new ways to prevent, detect, or treat disease. The National Institute of Mental Health at NIH supports research studies on mental health and disorders. To learn how clinical trials work, see  Participating in Clinical Trials. To see NIH-funded studies currently recruiting participants in anxiety disorders, visit www.ClinicalTrials.gov  and type in "anxiety disorders." Clinical Trials
Full Text:
Clinical trials are part of clinical research and at the heart of all treatment advances. Clinical trials look at new ways to prevent, detect, or treat disease. The National Institute of Mental Health at NIH supports research studies on mental health and disorders. To learn how clinical trials work, see  Participating in Clinical Trials. To see NIH-funded studies currently recruiting participants in anxiety disorders, visit www.ClinicalTrials.gov  and type 

# Question answering model

In [20]:
from transformers import pipeline

class QuestionAnsweringModel:
    def __init__(self, model_name="distilbert-base-uncased-distilled-squad"):
        """Initialize the Hugging Face QA model."""
        self.qa_pipeline = pipeline("question-answering", model=model_name)

    def get_best_result(self, query, search_results, min_score=0.3):
        """
        Select the best among search results using hugging face question-answering model.
        """
        # here we create dummy best result
        # it has default score equal to 0.3 since below that value usually answers are unreliable 
        # in that case we return info, that answer could not be found
        # we can change it so that it always returns "something"
        best_result = {
            'score': min_score,
            'start': 0,
            'end': 33,
            'answer': "Sorry, I couldn't find an answer.",
            'chunk': "Sorry, I couldn't find an answer.",
            'full_text': "Sorry, I couldn't find an answer."
        }

        # for each search result we look for good answer
        for hit in search_results:
            chunk = hit.payload['text']
            full_text = hit.payload['full_text']
            
            result = self.qa_pipeline(question=query, context=chunk)
            
            result['chunk'] = chunk
            result['full_text'] = full_text

            # if the answer is the best to this moment we save it with the chunk and full text
            if result['score'] > best_result['score']:
                best_result = result.copy()

        return best_result

In [21]:
QA_MODEL_NAME = "distilbert-base-uncased-distilled-squad"
question_answering_model = QuestionAnsweringModel(QA_MODEL_NAME)

Device set to use cpu


In [22]:
# output of our solution
result = question_answering_model.get_best_result(search_query, search_results)
result

{'score': 0.4432848393917084,
 'start': 20,
 'end': 88,
 'answer': 'part of clinical research and at the heart of all treatment advances',
 'chunk': 'Clinical trials are part of clinical research and at the heart of all treatment advances. Clinical trials look at new ways to prevent, detect, or treat disease. The National Institute of Mental Health at NIH supports research studies on mental health and disorders. To learn how clinical trials work, see  Participating in Clinical Trials. To see NIH-funded studies currently recruiting participants in anxiety disorders, visit www.ClinicalTrials.gov  and type in "anxiety disorders." Clinical Trials',
 'full_text': 'Clinical trials are part of clinical research and at the heart of all treatment advances. Clinical trials look at new ways to prevent, detect, or treat disease. The National Institute of Mental Health at NIH supports research studies on mental health and disorders. To learn how clinical trials work, see  Participating in Clinical 

# Evaluation

In [23]:
EMB_MODEL_NAME = "sentence-transformers/all-distilroberta-v1"
COLLECTION_NAME = "answers_collection"

vector_database = VectorDatabase(collection_name=COLLECTION_NAME, embed_model=EMB_MODEL_NAME)

  self.client.recreate_collection(


In [24]:
QA_MODEL_NAME = "distilbert-base-uncased-distilled-squad"
question_answering_model = QuestionAnsweringModel(QA_MODEL_NAME)

Device set to use cpu


In [25]:
# insert whole database
vector_database.insert_answers(df['answer'].unique().tolist())

  self.client.upsert(self.collection_name, points)


In [26]:
# we test the quality of question answering on smaller sample
sample = df.sample(1000,random_state = 101)
sample

Unnamed: 0,question,answer
15086,What is (are) methylmalonic acidemia with homo...,Methylmalonic acidemia with homocystinuria is ...
7089,How to diagnose Cone dystrophy ?,How is cone dystrophy diagnosed? The diagnosis...
11338,What are the genetic changes related to Horner...,Although congenital Horner syndrome can be pas...
10386,What is (are) Lujan syndrome ?,Lujan syndrome is a condition characterized by...
14517,How many people are affected by Guillain-Barr ...,The prevalence of Guillain-Barr syndrome is es...
...,...,...
14163,What are the genetic changes related to fibrod...,Mutations in the ACVR1 gene cause fibrodysplas...
15679,What is (are) Hypoglycemia ?,"Hypoglycemia, also called low blood glucose or..."
2206,What is (are) Plague ?,Plague is an infection caused by the bacterium...
15839,What causes Fecal Incontinence ?,"Fecal incontinence has many causes, including\..."


In [27]:
# function for finding the document based on which according to our solution the answer should be provided
def find_answer(search_query):
    search_results = vector_database.search_answer(search_query)
    result = question_answering_model.get_best_result(search_query, search_results)
    return result['full_text']

In [28]:
# it will be compared to document in provided dataset
sample['pred_answer'] = sample.apply(lambda x: find_answer(x['question']), axis = 1)

  search_results = self.client.search(


In [29]:
# exact match accuracy for sample
(sample['answer'] == sample['pred_answer']).mean()

np.float64(0.24)

In [30]:
# share of "I don't know" answers
("Sorry, I couldn't find an answer." == sample['pred_answer']).mean()

np.float64(0.217)

In [31]:
# exact match accuracy for cases where model did not answer "I don't know"
certain_sample = sample[sample['pred_answer'] != "Sorry, I couldn't find an answer."]

(certain_sample['answer'] == certain_sample['pred_answer']).mean()

np.float64(0.3065134099616858)

In [32]:
# manual check for wrong answers - are they really that far away?
i  = 0
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['question'].values[i])
print('=================================================')
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['answer'].values[i])
print('=================================================')
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['pred_answer'].values[i])

What is (are) methylmalonic acidemia with homocystinuria ?
Methylmalonic acidemia with homocystinuria is an inherited disorder in which the body is unable to properly process protein building blocks (amino acids), certain fats (lipids), and a waxy fat-like substance called cholesterol. Individuals with this disorder have a combination of features from two separate conditions, methylmalonic acidemia and homocystinuria. The signs and symptoms of the combined condition, methylmalonic acidemia with homocystinuria, usually develop in infancy, although they can begin at any age.  When the condition begins early in life, affected individuals typically have an inability to grow and gain weight at the expected rate (failure to thrive), which is sometimes recognized before birth (intrauterine growth retardation). These infants can also have difficulty feeding and an abnormally pale appearance (pallor). Neurological problems are also common in methylmalonic acidemia with homocystinuria, including

In [33]:
i  = 1
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['question'].values[i])
print('=================================================')
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['answer'].values[i])
print('=================================================')
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['pred_answer'].values[i])

What are the genetic changes related to Horner syndrome ?
Although congenital Horner syndrome can be passed down in families, no associated genes have been identified. Horner syndrome that appears after the newborn period (acquired Horner syndrome) and most cases of congenital Horner syndrome result from damage to nerves called the cervical sympathetics. These nerves belong to the part of the nervous system that controls involuntary functions (the autonomic nervous system). Within the autonomic nervous system, the nerves are part of a subdivision called the sympathetic nervous system. The cervical sympathetic nerves control several functions in the eye and face such as dilation of the pupil and sweating. Problems with the function of these nerves cause the signs and symptoms of Horner syndrome. Horner syndrome that occurs very early in life can lead to iris heterochromia because the development of the pigmentation (coloring) of the iris is under the control of the cervical sympathetic 

In [34]:
i  = 2
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['question'].values[i])
print('=================================================')
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['answer'].values[i])
print('=================================================')
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['pred_answer'].values[i])

What are the symptoms of Ainhum ?
What are the signs and symptoms of Ainhum? The Human Phenotype Ontology provides the following list of signs and symptoms for Ainhum. If the information is available, the table below includes how often the symptom is seen in people with this condition. You can use the MedlinePlus Medical Dictionary to look up the definitions for these medical terms. Signs and Symptoms Approximate number of patients (when available) Amniotic constriction ring - Autosomal dominant inheritance - The Human Phenotype Ontology (HPO) has collected information on how often a sign or symptom occurs in a condition. Much of this information comes from Orphanet, a European rare disease database. The frequency of a sign or symptom is usually listed as a rough estimate of the percentage of patients who have that feature. The frequency may also be listed as a fraction. The first number of the fraction is how many people had the symptom, and the second number is the total number of pe

In [35]:
i  = 3
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['question'].values[i])
print('=================================================')
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['answer'].values[i])
print('=================================================')
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['pred_answer'].values[i])

What are the treatments for distal hereditary motor neuropathy, type V ?
These resources address the diagnosis or management of distal hereditary motor neuropathy, type V:  - Gene Review: Gene Review: BSCL2-Related Neurologic Disorders/Seipinopathy  - Gene Review: Gene Review: GARS-Associated Axonal Neuropathy  - Genetic Testing Registry: Distal hereditary motor neuronopathy type 5  - Genetic Testing Registry: Distal hereditary motor neuronopathy type 5B  - MedlinePlus Encyclopedia: High-Arched Foot   These resources from MedlinePlus offer information about the diagnosis and management of various health conditions:  - Diagnostic Tests  - Drug Therapy  - Surgery and Rehabilitation  - Genetic Counseling   - Palliative Care
There are no standard treatments for hereditary neuropathies. Treatment is mainly symptomatic and supportive. Medical treatment includes physical therapy and if needed, pain medication. Orthopedic surgery may be needed to correct severe foot or other skeletal deformiti

In [36]:
i  = 4
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['question'].values[i])
print('=================================================')
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['answer'].values[i])
print('=================================================')
print(certain_sample[certain_sample['answer'] != certain_sample['pred_answer']]['pred_answer'].values[i])

What are the symptoms of Dystonia 18 ?
What are the signs and symptoms of Dystonia 18? The Human Phenotype Ontology provides the following list of signs and symptoms for Dystonia 18. If the information is available, the table below includes how often the symptom is seen in people with this condition. You can use the MedlinePlus Medical Dictionary to look up the definitions for these medical terms. Signs and Symptoms Approximate number of patients (when available) Irritability 5% Migraine 5% Ataxia - Autosomal dominant inheritance - Cerebral atrophy - Choreoathetosis - Cognitive impairment - Dyskinesia - Dystonia - EEG abnormality - Hypoglycorrhachia - Incomplete penetrance - Reticulocytosis - The Human Phenotype Ontology (HPO) has collected information on how often a sign or symptom occurs in a condition. Much of this information comes from Orphanet, a European rare disease database. The frequency of a sign or symptom is usually listed as a rough estimate of the percentage of patients 