# Introduction

This notebook uses the embeddings to create a search engine. This notebook shows how to prepare a search to understand natural language and return relevant results. In the next notebook, we will use this tto enhance the response from the large language model. 

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Import basic computation libraries
import pandas as pd 

## vector database search
from qdrant_client import models, QdrantClient

## vector computing framework
from sentence_transformers import SentenceTransformer

# tensor computation library
from torch import mps

## Data Processing

Load the data and remove null values.

In [3]:
## Load 'Covid Clinical Drug Trial Data' 
df = pd.read_csv('./data/coronavirus_clinical_trials.csv')

In [4]:
## Check if any of the cells are empty. Missing values cause errors in LLM. We will remove them before processing further
# Count empty cells in each column
print(df.isnull().sum())

Unnamed: 0     0
status         0
phase         71
sex            0
age            0
nct number     0
inclusion      0
exclusion      0
enrollment     3
dtype: int64


In [5]:
## remove missing valies as it creates create error in serialisation
df = df[df['phase'].notna()]

In [6]:
## dataset stats like total count and data field distributions (std/mean)
df.describe()

Unnamed: 0.1,Unnamed: 0,enrollment
count,131.0,131.0
mean,91.427481,753.618321
std,59.121652,3592.547045
min,0.0,0.0
25%,39.0,48.5
50%,87.0,150.0
75%,137.5,440.0
max,201.0,40000.0


In [7]:
## Maps data fields to the format needed for vectorisation
data = df.to_dict('records')
data[1]

{'Unnamed: 0': 1,
 'status': 'Not yet recruiting',
 'phase': 'Phase 2\nPhase 3',
 'sex': 'All',
 'age': '18 Years to 75 Years   (Adult, Older Adult)',
 'nct number': 'NCT04291053',
 'inclusion': 'Mortality rate [\xa0Time\xa0Frame:\xa0up to 28 days\xa0]All cause mortality',
 'exclusion': 'Clinical status assessed according to the official guideline [\xa0Time\xa0Frame:\xa0up to 28 days\xa0]1.mild type：no No symptoms, Imaging examination showed no signs of pneumonia; 2,moderate type: with fever or respiratory symptoms,Imaging examination showed signs of pneumonia, SpO2＞93% without oxygen inhalation ; severe type:Match any of the following：a. R≥30bpm；b.Pulse Oxygen Saturation(SpO2)≤93% without oxygen inhalation，c. PaO2/FiO2(fraction of inspired oxygen )≤300mmHg ；4. Critically type：match any of the follow: a. need mechanical ventilation; b. shock; c. (multiple organ dysfunction syndrome) MODS\nThe differences in oxygen intake methods [\xa0Time\xa0Frame:\xa0up to 28 days\xa0]Pulse Oxygen Sat

## Process Embeddings 
Embeddings are representation of the text data (in our case the wine csv file) as vectors in a high-dimentional space. We use embeddings to be able to complare the simarify between sentences. Vectors allow us to represent the text in matematical terms. In this notebook, I use cosine similarify that allows to compute and measure the cosine of the angle between two vectors, effectively quantifying how similar two sentences regardless of their lenght. 

In [8]:
## encode using the 'all-MiniLM-L6-v2' model. 
encoder = SentenceTransformer('all-MiniLM-L6-v2') # model: download ML model locally

## database to store the vectors. Since the data is in a small size, we can store the data in memory. 
qdrant = QdrantClient(":memory:")

In [9]:
# create a collection that will be stored in the database. The collection stored the params 
# size: takes the size from the input data
# distance function: cosine

qdrant.recreate_collection(
    collection_name = "covid_ct",
    vectors_config = models.VectorParams(
        size = encoder.get_sentence_embedding_dimension(),
        distance = models.Distance.COSINE
    )
)

True

In [10]:
# creates an index and uploads all the data into the in-memory database
# payload holds the metadata 
qdrant.upload_points(
    collection_name = "covid_ct",
    points = [
        models.PointStruct(
            id = idx,
            vector = encoder.encode(doc['exclusion']).tolist(),
            payload = doc
        ) 
        for idx, doc in enumerate(data)
    ]
)

## Search with given input text

Let's search! The answer is hidden in the includion/exclusion criteria free text data field. 

In [11]:
user_prompt = "What are the characterictics of suitable patients for covid trials"
hits = qdrant.search(
    collection_name = "covid_ct",
    query_vector = encoder.encode(user_prompt).tolist(),
    limit = 5
)
for hit in hits:
    print(hit.payload, "score:", hit.score)

{'Unnamed: 0': 200, 'status': 'Recruiting', 'phase': 'Phase 3', 'sex': 'All', 'age': '18 Years and older   (Adult, Older Adult)', 'nct number': 'NCT04308668', 'inclusion': 'To test if post-exposure prophylaxis with hydroxychloroquine can prevent progression development of symptomatic COVID19 disease after known exposure to the SARS-CoV2 virus. To test if preemptive therapy with hydroxychloroquine can prevent progression of persons with known symptomatic COVID19 disease, preventing hospitalization.', 'exclusion': 'Incidence of COVID19 Disease among those who are asymptomatic at trial entry [\xa0Time\xa0Frame:\xa014 days\xa0]Number of participants at 14 days post enrollment with active COVID19 disease.\nOrdinal Scale of COVID19 Disease Severity at 14 days among those who are symptomatic at trial entry [\xa0Time\xa0Frame:\xa014 days\xa0]Participants will self-report disease severity status as one of the following 3 options; no COVID19 illness (score of 1), COVID19 illness with no hospital

In [12]:
search_result = [hit.payload for hit in hits]

Please note that you will need OpenAI token to run the next cell. 

In [None]:
## Connect to LLM from OpenAI 
from openai import OpenAI

client = OpenAI(
    base_url = "http://127.0.0.1:8080/v1",
    api_key = "sk_no_key_required"
)
completion = client.chat.completions.create(
    model = "LLaMA_CPP",
    messages = [
        {"role": "system", "content": "Covid 19 Clinical Trial Assistant"},
        {"role": "user", "content": "What are the characterictics of suitable patients for covid trials?"},
        {"role": "assistant", "content": str(search_result)}
    ]
)