# Introduction

This notebook uses the embeddings to create a search engine. This notebook shows how to prepare a search to understand natural language and return relevant results. In the next notebook, we will use this tto enhance the response from the large language model. 

In [22]:
import warnings
warnings.filterwarnings("ignore")

In [23]:
# Import basic computation libraries
import pandas as pd 

## vector database search
from qdrant_client import models, QdrantClient

## vector computing framework
from sentence_transformers import SentenceTransformer

# tensor computation library
from torch import mps

## Data Processing

Load the data and remove null values.

In [24]:
## Load 'Covid Clinical Drug Trial Data' 
df = pd.read_csv('./data/coronavirus_clinical_trials.csv')

In [25]:
## Check if any of the cells are empty. Missing values cause errors in LLM. We will remove them before processing further
# Count empty cells in each column
print(df.isnull().sum())

Unnamed: 0     0
status         0
phase         71
sex            0
age            0
nct number     0
inclusion      0
exclusion      0
enrollment     3
dtype: int64


In [28]:
## remove missing valies as it creates create error in serialisation
df = df[df['phase'].notna()]

In [29]:
## dataset stats like total count and data field distributions (std/mean)
df.describe()

Unnamed: 0.1,Unnamed: 0,enrollment
count,131.0,131.0
mean,91.427481,753.618321
std,59.121652,3592.547045
min,0.0,0.0
25%,39.0,48.5
50%,87.0,150.0
75%,137.5,440.0
max,201.0,40000.0


In [30]:
## Maps data fields to the format needed for vectorisation
data = df.to_dict('records')
data

[{'Unnamed: 0': 0,
  'status': 'Active, not recruiting',
  'phase': 'Not Applicable',
  'sex': 'All',
  'age': '18 Years and older   (Adult, Older Adult)',
  'nct number': 'NCT04321421',
  'inclusion': 'death [\xa0Time\xa0Frame:\xa0within 7 days\xa0]death from any cause',
  'exclusion': 'time to extubation [\xa0Time\xa0Frame:\xa0within 7 days\xa0]days since intubation\nlength of intensive care unit stay [\xa0Time\xa0Frame:\xa0within 7 days\xa0]days from entry to exit from ICU\ntime to CPAP weaning [\xa0Time\xa0Frame:\xa0within 7 days\xa0]days since CPAP initiation\nviral load [\xa0Time\xa0Frame:\xa0at days 1, 3 and 7\xa0]naso-pharyngeal swab, sputum and BAL\nimmune response [\xa0Time\xa0Frame:\xa0at days 1, 3 and 7\xa0]neutralizing title length of intensive care unit stay [\xa0Time\xa0Frame:\xa0within 7 days\xa0]days from entry to exit from ICU\ntime to CPAP weaning [\xa0Time\xa0Frame:\xa0within 7 days\xa0]days since CPAP initiation\nviral load [\xa0Time\xa0Frame:\xa0at days 1, 3 and 7

## Process Embeddings 
Embeddings are representation of the text data (in our case the wine csv file) as vectors in a high-dimentional space. We use embeddings to be able to complare the simarify between sentences. Vectors allow us to represent the text in matematical terms. In this notebook, I use cosine similarify that allows to compute and measure the cosine of the angle between two vectors, effectively quantifying how similar two sentences regardless of their lenght. 

In [36]:
## encode using the 'all-MiniLM-L6-v2' model. 
encoder = SentenceTransformer('all-MiniLM-L6-v2') # model: download ML model locally

## database to store the vectors. Since the data is in a small size, we can store the data in memory. 
qdrant = QdrantClient(":memory:")

In [37]:
# create a collection that will be stored in the database. The collection stored the params 
# size: takes the size from the input data
# distance function: cosine

qdrant.recreate_collection(
    collection_name = "covid_ct",
    vectors_config = models.VectorParams(
        size = encoder.get_sentence_embedding_dimension(),
        distance = models.Distance.COSINE
    )
)

True

In [38]:
# creates an index and uploads all the data into the in-memory database
# payload holds the metadata 
qdrant.upload_points(
    collection_name = "covid_ct",
    points = [
        models.PointStruct(
            id = idx,
            vector = encoder.encode(doc['exclusion']).tolist(),
            payload = doc
        ) 
        for idx, doc in enumerate(data)
    ]
)

## Search with given input text

Let's search! 

In [39]:
user_prompt = "What are the characterictics of suitable patients for covid trials"
hits = qdrant.search(
    collection_name = "covid_ct",
    query_vector = encoder.encode(user_prompt).tolist(),
    limit = 5
)
for hit in hits:
    print(hit.payload, "score:", hit.score)

{'Unnamed: 0': 120, 'status': 'Not yet recruiting', 'phase': 'Phase 3', 'sex': 'All', 'age': '18 Years and older   (Adult, Older Adult)', 'nct number': 'NCT04324463', 'inclusion': 'Outpatients: Hospital Admission or Death [\xa0Time\xa0Frame:\xa0Up to 6 weeks post randomization\xa0]In outpatients with COVID-19, the occurrence of hospital admission or death\nInpatients: Invasive mechanical ventilation or mortality [\xa0Time\xa0Frame:\xa0Up to 6 weeks post randomization\xa0]Patients intubated or requiring imminent intubation at the time of randomization will only be followed for the primary outcome of death. Inpatients: Invasive mechanical ventilation or mortality [\xa0Time\xa0Frame:\xa0Up to 6 weeks post randomization\xa0]Patients intubated or requiring imminent intubation at the time of randomization will only be followed for the primary outcome of death.', 'exclusion': 'Age ≥ 18 years of age Informed consent COVID-19 confirmed by established testing', 'enrollment': 1500.0} score: 0.465

In [40]:
search_result = [hit.payload for hit in hits]

In [None]:
## Connect to LLM from OpenAI 
from openai import OpenAI

client = OpenAI(
    base_url = "http://127.0.0.1:8080/v1",
    api_key = "sk_no_key_required"
)
completion = client.chat.completions.create(
    model = "LLaMA_CPP",
    messages = [
        {"role": "system", "content": "Covid 19 Clinical Trial Assistant"},
        {"role": "user", "content": "What are the characterictics of suitable patients for covid trials?"},
        {"role": "assistant", "content": str(search_result)}
    ]
)