# MyScale Filtered Vector Search

## Data Preprocessing

We use PubMed dataset for this showcase, simple use wget to download from [huggingface](https://huggingface.co/datasets/owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH):


In [None]:
!wget -c https://huggingface.co/datasets/owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH/resolve/main/PubMed%20Multi%20Label%20Text%20Classification%20Dataset%20Processed.csv
!python3 -m pip install clickhouse-connect pandas tqdm transformers

In [None]:
from os import environ
from tqdm import tqdm

Filtered vector search is pretty easy in MyScale. 

First get your credentials to connect to MyScale with clickhouse-connect:

In [None]:
import clickhouse_connect

client = clickhouse_connect.get_client(
    host='your-myscale-backend',
    port=443,
    username='your-user-name',
    password='your-password'
)

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v2')

In [None]:
import torch
import numpy as np
def get_embedding(text):
    with torch.no_grad():
        emb = model.encode(text).squeeze()        
        return (emb / np.linalg.norm(emb, ord=2)).tolist()
len(get_embedding('text'))

The dataset contains strings, array of strings and also array of arrays. To demonstrate how you can use those seamlessly within MyScale, we need to convert those string representation into native lists:

In [None]:
import torch
import pandas as pd
from tqdm import tqdm
from ast import literal_eval

df = pd.read_csv('PubMed Multi Label Text Classification Dataset Processed.csv')
df_s = df.astype(str)
for k in ['meshroot', 'meshid', 'meshMajor']:
    df_s[k] = [literal_eval(r) for r in tqdm(df_s[k], desc=f'Converting {k} to Python Native...')]
with torch.no_grad():
    embeddings = [get_embedding(r) for r in tqdm(df_s['abstractText'])]
df_s['meshEmbedding'] = embeddings

for k in df.keys()[:5]:
    print(f'{k} : {type(df_s[k][0])}')
df_s[:3]

## Create table with schema

In [8]:
table_name = "pubmed_multilabel"

In [None]:
client.command(f'DROP TABLE IF EXISTS {table_name}')

client.command(f'''CREATE TABLE IF NOT EXISTS {table_name}(
    Title String,
    abstractText String,
    meshMajor Array(String),
    pmid Int64,
    meshid Array(Array(String)),
    meshroot Array(String),
    A Int8,
    B Int8,
    C Int8,
    D Int8,
    E Int8,
    F Int8,
    G Int8,
    H Int8,
    I Int8,
    J Int8,
    L Int8,
    M Int8,
    N Int8,
    Z Int8,
    meshEmbedding Array(Float32),
    CONSTRAINT vec_len CHECK length(meshEmbedding) = 512,
    VECTOR INDEX vindex meshEmbedding TYPE IVFFLAT('metric_type=cosine')
) ENGINE = MergeTree ORDER BY pmid''')

[Clickhouse-connect](https://clickhouse.com/docs/en/integrations/python) offers [`insert_df`](https://clickhouse.com/docs/en/integrations/python#client-insert-method) to insert a pandas's dataframe to a table.

In [None]:
client.insert_df(table_name, df_s)

## Get number of uploaded samples

In [9]:
[r for r in client.query(f"SELECT COUNT(*) FROM {table_name}").named_results()]

[{'count()': 50000}]

## Get some samples from the database

In [20]:
for p in [r for r in client.query(f"SELECT pmid, Title FROM {table_name} LIMIT 3").named_results()]:
    print(p)

{'pmid': 506, 'Title': 'Phospholipases. III. Effects of ionic surfactants on the phospholipase-catalyzed hydrolysis of unsonicated egg lecithin liposomes.'}
{'pmid': 2524, 'Title': 'Reduction of blood platelet monoamine oxidase activity in schizophrenic patients on phenothiazines.'}
{'pmid': 6714, 'Title': 'Identification of monohydroxylated metabolites of cannabidiol formed by rat liver.'}


## Vector Search

In [22]:
emb_str = f"[{','.join(map(str, get_embedding('vaccine')))}]"
for p in [r for r in client.query(f"SELECT abstractText, d FROM {table_name} ORDER BY distance(meshEmbedding, {emb_str}) AS d LIMIT 3").named_results()]:
    print(p)

{'abstractText': 'Vaccine-induced protection may not be homogeneous across individuals. It is possible that a vaccine gives complete protection for a portion of individuals, while the rest acquire only incomplete (leaky) protection of varying magnitude. If vaccine efficacy is estimated under wrong assumptions about such individual level heterogeneity, the resulting estimates may be difficult to interpret. For instance, population-level predictions based on such estimates may be biased. We consider the problem of estimating heterogeneous vaccine efficacy against an infection that can be acquired multiple times (susceptible-infected-susceptible model). The estimation is based on a limited number of repeated measurements of the current status of each individual, a situation commonly encountered in practice. We investigate how the placement of consecutive samples affects the estimability and efficiency of vaccine efficacy parameters. The same sampling frequency may not be optimal for effic

## Vector Search with Title Pattern Filter

In [24]:
emb_str = f"[{','.join(map(str, get_embedding('vaccine')))}]"
for p in [r for r in client.query(f"SELECT abstractText, d FROM {table_name} \
                                  WHERE abstractText LIKE 'BACKGROUND%' \
                                  ORDER BY distance(meshEmbedding, {emb_str}) AS d LIMIT 3").named_results()]:
    print({k[:10]: v[:30] if type(v) is str else v for k, v in p.items()})

{'abstractTe': 'BACKGROUND: Influenza is the m', 'd': 0.5786518454551697}
{'abstractTe': 'BACKGROUND: Simple and effecti', 'd': 0.6172389984130859}
{'abstractTe': 'BACKGROUND: Approximately 500 ', 'd': 0.6178913116455078}


## Vector Search with Array Filters

In [31]:
emb_str = f"[{','.join(map(str, get_embedding('vaccine')))}]"
for p in [r for r in client.query(f"SELECT abstractText, d, meshroot FROM {table_name} \
                                  WHERE has(meshroot, 'Organisms [B]') \
                                  ORDER BY distance(meshEmbedding, {emb_str}) AS d LIMIT 3").named_results()]:
    print({k[:10]: v[:7] if type(v) is list else (v[:10] if type(v) is str else v) for k, v in p.items()})

{'abstractTe': 'Vaccine-in', 'd': 0.5617110729217529, 'meshroot': ['Health Care [N]', 'Organisms [B]', 'Diseases [C]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Chemicals and Drugs [D]']}
{'abstractTe': 'The manage', 'd': 0.5732998251914978, 'meshroot': ['Named Groups [M]', 'Organisms [B]', 'Health Care [N]', 'Geographicals [Z]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]']}
{'abstractTe': 'BACKGROUND', 'd': 0.5786518454551697, 'meshroot': ['Geographicals [Z]', 'Psychiatry and Psychology [F]', 'Health Care [N]', 'Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]', 'Information Science [L]', 'Organisms [B]', 'Chemicals and Drugs [D]']}


## Flatten Arrays of Arrays to Arrays then Filter

In [14]:
emb_str = f"[{','.join(map(str, get_embedding('vaccine')))}]"
for p in [r for r in client.query(f"SELECT abstractText, d, meshid FROM {table_name} \
                                  WHERE has(arrayFlatten(meshid), 'C23.550.291.937') \
                                  ORDER BY distance(meshEmbedding, {emb_str}) AS d LIMIT 3").named_results()]:
    print({k[:10]: v[:10] if type(v) is str else v for k, v in p.items()})

{'abstractTe': 'Vaccine-in', 'd': 0.5617110729217529, 'meshid': [['N05.715.350.150', 'N06.850.490.500'], ['B01.050.150.900.649.313.988.400.112.400.400'], ['C01'], ['E05.318.740.500', 'E05.599.835', 'N05.715.360.750.530', 'N06.850.520.830.500'], ['C23.550.291.937'], ['E01.789.800', 'N04.761.559.590.800', 'N05.715.360.575.575.800'], ['D20.215.894']]}


## Multi-Column Filter

In [18]:
emb_str = f"[{','.join(map(str, get_embedding('vaccine')))}]"
for p in [r for r in client.query(f"SELECT abstractText, d, A, B, C FROM {table_name} \
                                  WHERE A+B+C>=2 \
                                  ORDER BY distance(meshEmbedding, {emb_str}) AS d LIMIT 3").named_results()]:
    print({k[:10]: v[:10] if type(v) is str else v for k, v in p.items()})

{'abstractTe': 'Vaccine-in', 'd': 0.5617110729217529, 'A': 0, 'B': 1, 'C': 1}
{'abstractTe': 'BACKGROUND', 'd': 0.5786518454551697, 'A': 0, 'B': 1, 'C': 1}
{'abstractTe': 'Vaccinatio', 'd': 0.6105412244796753, 'A': 1, 'B': 1, 'C': 1}
