### Uncomment and run the following cells if you work on Google Colab :) Don't forget to change your runtime type to GPU!

In [1]:
# !git clone https://github.com/kstathou/vector_engine

In [2]:
# cd vector_engine

In [3]:
# pip install -r requirements.txt

### Let's begin!

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2
# Used to import data from local.
import pandas as pd

# Used to create the dense document vectors.
import torch
from sentence_transformers import SentenceTransformer

# Used to create and store the Faiss index.
import faiss
import numpy as np
import pickle
from pathlib import Path

# Used to do vector searches and display the results.
from vector_engine.utils import vector_search, id2details

  from .autonotebook import tqdm as notebook_tqdm


Stored and processed data in s3

In [3]:
# Read a CSV in a table
df = pd.read_csv('/home/jj/Desktop/semantic_search_engine/data/CSV_Data/Ready_v3_FINAL/Merged_Dataset_Final.csv', error_bad_lines=False)

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
df.head(3)

Unnamed: 0,indexId,paperId,url,title,abstract,year,referenceCount,citationCount,influentialCitationCount,isOpenAccess,...,authors/138/authorId,authors/138/name,authors/139/authorId,authors/139/name,authors/140/authorId,authors/140/name,authors/141/authorId,authors/141/name,authors/142/authorId,authors/142/name
0,1,46200b99c40e8586c8a0f588488ab6414119fb28,https://www.semanticscholar.org/paper/46200b99...,TensorFlow: A system for large-scale machine l...,TensorFlow is a machine learning system that o...,2016,94,13969,1679,False,...,,,,,,,,,,
1,2,f9c602cc436a9ea2f9e7db48c77d924e09ce3c32,https://www.semanticscholar.org/paper/f9c602cc...,Fashion-MNIST: a Novel Image Dataset for Bench...,"We present Fashion-MNIST, a new dataset compri...",2017,6,4588,1340,False,...,,,,,,,,,,
2,3,9c9d7247f8c51ec5a02b0d911d1d7b9e8160495d,https://www.semanticscholar.org/paper/9c9d7247...,TensorFlow: Large-Scale Machine Learning on He...,TensorFlow is an interface for expressing mach...,2016,55,9429,1008,False,...,,,,,,,,,,


In [5]:
print(f"T'ikray Prototype: {df.indexId.unique().shape[0]}")

T'ikray Prototype: 4291


The [Sentence Transformers library](https://github.com/UKPLab/sentence-transformers) offers pretrained transformers that produce SOTA sentence embeddings. Checkout this [spreadsheet](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/) with all the available models.

In this tutorial, we will use the `distilbert-base-nli-stsb-mean-tokens` model which has the best performance on Semantic Textual Similarity tasks among the DistilBERT versions. Moreover, although it's slightly worse than BERT, it is quite faster thanks to having a smaller size.

I use the same model in [Orion's semantic search engine](https://www.orion-search.org/)!

In [6]:
# Instantiate the sentence-level DistilBERT
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
# Check if GPU is available and use it
if torch.cuda.is_available():
    model = model.to(torch.device("cuda"))
print(model.device)

cuda:0


In [7]:
# Convert abstracts to vectors
embeddings = model.encode(df.abstract.to_list(), show_progress_bar=True)

Batches: 100%|████████████████████████████████| 135/135 [00:09<00:00, 14.69it/s]


In [8]:
print(f'Shape of the vectorised abstract: {embeddings[0].shape}')

Shape of the vectorised abstract: (768,)


## Vector similarity search with Faiss
[Faiss](https://github.com/facebookresearch/faiss) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, even ones that do not fit in RAM. 
    
Faiss is built around the `Index` object which contains, and sometimes preprocesses, the searchable vectors. Faiss has a large collection of [indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes). You can even create [composite indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes-(composite)). Faiss handles collections of vectors of a fixed dimensionality d, typically a few 10s to 100s.

**Note**: Faiss uses only 32-bit floating point matrices. This means that you will have to change the data type of the input before building the index.

To learn more about Faiss, you can read their paper on [arXiv](https://arxiv.org/abs/1702.08734).

Here, we will the `IndexFlatL2` index:
- It's a simple index that performs a brute-force L2 distance search
- It scales linearly. It will work fine with our data but you might want to try [faster indexes](https://github.com/facebookresearch/faiss/wiki/Faster-search) if you work will millions of vectors.

To create an index with the `misinformation` abstract vectors, we will:
1. Change the data type of the abstract vectors to float32.
2. Build an index and pass it the dimension of the vectors it will operate on.
3. Pass the index to IndexIDMap, an object that enables us to provide a custom list of IDs for the indexed vectors.
4. Add the abstract vectors and their ID mapping to the index. In our case, we will map vectors to their paper IDs from MAG.

In [50]:
#df.astype({'paperId': 'int32'}).dtypes
#df['paperId'].astype(str).astype(int)
#df['paperId'] = df.paperId.astype(int)
#df['paperId'] = pd.to_numeric(df['paperId'])
#df["paperId"] = pd.to_numeric(df["paperId"], errors='coerce')

#s = pd.Series(df['paperId'])
#print(s)
#pd.to_numeric(s, errors='coerce')


In [9]:
df.indexId

0          1
1          2
2          3
3          4
4          5
        ... 
4286    4287
4287    4288
4288    4289
4289    4290
4290    4291
Name: indexId, Length: 4291, dtype: int64

In [10]:
# Step 1: Change data type
embeddings = np.array([embedding for embedding in embeddings]).astype("float32")

# Step 2: Instantiate the index
index = faiss.IndexFlatL2(embeddings.shape[1])

# Step 3: Pass the index to IndexIDMap
index = faiss.IndexIDMap(index)

# Step 4: Add vectors and their IDs
index.add_with_ids(embeddings, df.indexId.values)

print(f"Number of vectors in the Faiss index: {index.ntotal}")

Number of vectors in the Faiss index: 4291


### Searching the index
The index we built will perform a k-nearest-neighbour search. We have to provide the number of neighbours to be returned. 

Let's query the index with an abstract from our dataset and retrieve the 10 most relevant documents. **The first one must be our query!**


In [16]:
# Paper abstract
df.iloc[2984, 4]

'We propose a novel, efficient approach for distributed sparse learning in high-dimensions, where observations are randomly partitioned across machines. Computationally, at each round our method only requires the master machine to solve a shifted ell_1 regularized M-estimation problem, and other workers to compute the gradient. In respect of communication, the proposed approach provably matches the estimation error bound of centralized methods within constant rounds of communications (ignoring logarithmic factors). We conduct extensive experiments on both simulated and real world datasets, and demonstrate encouraging performances on high-dimensional regression and classification tasks.'

In [21]:
# Retrieve the 10 nearest neighbours
D, I = index.search(np.array([embeddings[2984]]), k=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nIndex IDs: {I.flatten().tolist()}')

L2 distance: [0.0, 52.00844955444336, 57.677120208740234, 60.324066162109375, 61.46210861206055, 63.674522399902344, 64.19221496582031, 65.92388153076172, 67.6544189453125, 67.86244201660156]

Semantic Scholar paper IDs: [2985, 1845, 678, 3171, 1437, 2179, 3120, 2803, 1255, 4074]


In [22]:
# Fetch the paper titles based on their index
id2details(df, I, 'title')

[['Efficient Distributed Learning with Sparsity'],
 ['SparCML: high-performance sparse communication for machine learning'],
 ['Online learning with kernels'],
 ['Random Rotation Ensembles'],
 ['ARTMAP: supervised real-time learning and classification of nonstationary data by a self-organizing neural network'],
 ['Local-Learning-Based Feature Selection for High-Dimensional Data Analysis'],
 ['Machine learning vortices at the Kosterlitz-Thouless transition'],
 ['Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate'],
 ['Learning with Marginalized Corrupted Features'],
 ['Study and Observation of the Variation of Accuracies of KNN, SVM, LMNN, ENN Algorithms on Eleven Different Datasets from UCI Machine Learning Repository']]

In [23]:
# Fetch the paper abstracts based on their index
id2details(df, I, 'abstract')

[['We propose a novel, efficient approach for distributed sparse learning in high-dimensions, where observations are randomly partitioned across machines. Computationally, at each round our method only requires the master machine to solve a shifted ell_1 regularized M-estimation problem, and other workers to compute the gradient. In respect of communication, the proposed approach provably matches the estimation error bound of centralized methods within constant rounds of communications (ignoring logarithmic factors). We conduct extensive experiments on both simulated and real world datasets, and demonstrate encouraging performances on high-dimensional regression and classification tasks.'],
 ['Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node\'s contribution to the overall gradient is summed using a global allred


## Putting all together

So far, we've built a Faiss index using the misinformation abstract vectors we encoded with a sentence-DistilBERT model. That's helpful but in a real case scenario, we would have to work with unseen data. To query the index with an unseen query and retrieve its most relevant documents, we would have to do the following:

1. Encode the query with the same sentence-DistilBERT model we used for the rest of the abstract vectors.
2. Change its data type to float32.
3. Search the index with the encoded query.

Here, we will use the introduction of an article published on [HKS Misinformation Review](https://misinforeview.hks.harvard.edu/article/can-whatsapp-benefit-from-debunked-fact-checked-stories-to-reduce-misinformation/).


In [25]:
user_query = """
There have been tremendous advances in artificial intelligence (AI) and machine learning (ML) within the past decade, 
especially in the application of deep learning to various challenges. These include advanced competitive games (such as Chess and Go), 
self-driving cars, speech recognition, and intelligent personal assistants. Rapid advances in computer vision for recognition of 
objects in pictures have led some individuals, including computer science experts and health care system experts in machine learning, 
to make predictions that ML algorithms will soon lead to the replacement of the radiologist. However, there are complex technological, 
regulatory, and medicolegal obstacles facing the implementation of machine learning in radiology that will definitely preclude replacement 
of the radiologist by these algorithms within the next two decades and beyond. While not a comprehensive review of machine learning, 
this article is intended to highlight specific features of machine learning which face significant technological and health care systems challenges. 
Rather than replacing radiologists, machine learning will provide quantitative tools that will increase the value of diagnostic imaging as a biomarker, 
increase image quality with decreased acquisition times, and improve workflow, communication, and patient safety.
"""

In [26]:
# For convenience, I've wrapped all steps in the vector_search function.
# It takes four arguments: 
# A query, the sentence-level transformer, the Faiss index and the number of requested results
D, I = vector_search([user_query], model, index, num_results=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nIndex IDs: {I.flatten().tolist()}')

L2 distance: [7.857951095369131e-11, 7.857951095369131e-11, 64.52151489257812, 67.89201354980469, 72.01084899902344, 72.01084899902344, 73.52568817138672, 74.28388977050781, 74.80601501464844, 75.0186538696289]

Index IDs: [3163, 3069, 388, 150, 1781, 1743, 245, 140, 1745, 2501]


In [28]:
# Fetching the paper titles based on their index
id2details(df, I, 'title')

[['Will machine learning end the viability of radiology as a thriving medical specialty?'],
 ['Will machine learning end the viability of radiology as a thriving medical specialty?'],
 ['Applications of Deep Learning and Reinforcement Learning to Biological Data'],
 ['Unintended Consequences of Machine Learning in Medicine'],
 ['Deep Learning: The Good, the Bad, and the Ugly.'],
 ['Deep Learning: The Good, the Bad, and the Ugly.'],
 ['Entanglement-based machine learning on a quantum computer.'],
 ['Machine Learning in Medicine.'],
 ['Machine Learning: A Historical and Methodological Analysis'],
 ['A Review of Deep Machine Learning']]

In [30]:
# Define project base directory
# Change the index from 1 to 0 if you run this on Google Colab
project_dir = Path('notebooks').resolve().parents[1]
print(project_dir)

# Serialise index and store it as a pickle
with open(f"{project_dir}/home/jj/Desktop/semantic_search_engine/models/faiss_index.pickle", "wb") as h:
    pickle.dump(faiss.serialize_index(index), h)

/home/jj/Desktop


FileNotFoundError: [Errno 2] No such file or directory: '/home/jj/Desktop/home/jj/Desktop/semantic_search_engine/models/faiss_index.pickle'