### Uncomment and run the following cells if you work on Google Colab :)

In [None]:
# !git clone https://github.com/kstathou/vector_engine

In [None]:
# cd vector_engine

In [None]:
# pip install -r requirements.txt

### Let's begin!

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2
import pandas as pd
import s3fs
import numpy as np
import torch
import faiss
from sentence_transformers import SentenceTransformer
from vector_engine.utils import vector_search, id2details

Stored and processed data in s3

In [3]:
# Use pandas to read files from S3 buckets!
df = pd.read_csv('s3://vector-search-blog/misinformation_papers.csv')
df = df.sample(100)

In [4]:
df.head(1)

Unnamed: 0,original_title,abstract,year,citations,id,is_EN
7191,Migrant women and sexual and gender-based viol...,Abstract Background Sexual and Gender-Based Vi...,2020,0,3092205359,1


In [5]:
print(f"Misinformation, disinformation and fake news papers: {df.id.unique().shape[0]}")

Misinformation, disinformation and fake news papers: 100


The [Sentence Transformers library](https://github.com/UKPLab/sentence-transformers) offers pretrained transformers that produce SOTA sentence embeddings. Checkout this [spreadsheet](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/) with all the available models.

In this tutorial, we will use the `distilbert-base-nli-stsb-mean-tokens` model which has the best performance on Semantic Textual Similarity tasks among the DistilBERT versions. Moreover, although it's slightly worse than BERT, it is quite faster thanks to having a smaller size.

I use the same model in [Orion's semantic search engine](https://www.orion-search.org/)!

In [4]:
# Instantiate the sentence-level DistilBERT
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
if torch.cuda.is_available():
    model = model.to(torch.device("cuda"))
print(model.device)

100%|██████████| 245M/245M [00:40<00:00, 5.98MB/s] 


In [25]:
# Convert abstracts to vectors
embeddings = model.encode(df.abstract.to_list(), show_progress_bar=True)

HBox(children=(FloatProgress(value=0.0, description='Batches', max=4.0, style=ProgressStyle(description_width=…




In [26]:
print(f'Shape of the vectorised abstract: {embeddings[0].shape}')

Shape of the vectorised abstract: (768,)


## Vector similarity search with Faiss
[Faiss](https://github.com/facebookresearch/faiss) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, even ones that do not fit in RAM. 

Faiss is built around the `Index` object which contains, and sometimes preprocesses, the searchable vectors. Faiss has a large collection of [indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes). You can even create [composite indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes-(composite)). Faiss handles collections of vectors of a fixed dimensionality d, typically a few 10s to 100s.

**Note**: Faiss uses only 32-bit floating point matrices. This means that you will have to change the data type of the input before building the index before building.

To learn more about Faiss, you can read their paper on [arXiv](https://arxiv.org/abs/1702.08734).

Here, we will the `IndexFlatL2` index:
- It's a simple index that performs a brute-force L2 distance search
- It scales linearly. It will work fine with our data but you might want to try [faster indexes](https://github.com/facebookresearch/faiss/wiki/Faster-search) if you work will millions of vectors.

To create an index with the `misinformation` abstract vectors, we will:
1. Change the data type of the abstract vectors to float32.
2. Build an index and pass it the dimension of the vectors it will operate on.
3. Pass the index to IndexIDMap, an object that enables us to provide a custom list of IDs for the indexed vectors.
4. Add the abstract vectors and their ID mapping to the index. In our case, we will map vectors to their paper IDs from MAG.

In [28]:
# Step 1: Change data type
embeddings = np.array([embedding for embedding in embeddings]).astype("float32")

# Step 2: Instantiate the index
index = faiss.IndexFlatL2(embeddings.shape[1])

# Step 3: Pass the index to IndexIDMap
index = faiss.IndexIDMap(index)

# Step 4: Add vectors and their IDs
index.add_with_ids(embeddings, df.id.values)

print(f"Number of vectors in the Faiss index: {index.ntotal}")

Number of vectors in the Faiss index: 100


### Searching the index
The index we built will perform a k-nearest-neighbour search. We have to provide the number of neighbours to be returned. 

Let's query the index with an abstract from our dataset and retrieve the 10 most relevant documents. **The first one must be our query!**


In [31]:
df

Unnamed: 0,original_title,abstract,year,citations,id
9300,Neural data-to-text generation: A comparison b...,"Traditionally, most data-to-text applications ...",2019,0,2969686025
4534,QANet: Combining Local Convolution with Global...,Current end-to-end machine reading and questio...,2018,221,2798858969
9427,Maximizing Stylistic Control and Semantic Accu...,Neural generation methods for task-oriented di...,2019,0,2964239055
12845,ConvBERT: Improving BERT with Span-based Dynam...,Pre-trained language models like BERT and its ...,2020,0,3047171714
841,Distributed Representations of Sentences and D...,Many machine learning algorithms require the i...,2014,1985,2949547296
...,...,...,...,...,...
283,ANN-based Innovative Segmentation Method for H...,Artificial Neural Network (ANN) s has widely b...,2009,13,2161423555
6305,Vietnamese Open Information Extraction,Open information extraction (OIE) is the proce...,2018,0,2775071667
5630,Convolutional neural network compression for n...,Convolutional neural networks are modern model...,2018,8,2803928583
1630,Recognizing Extended Spatiotemporal Expression...,Precise geocoding and time normalization for t...,2015,0,1946516469


In [33]:
# Paper title
df.iloc[0, 0]

'Neural data-to-text generation: A comparison between pipeline and end-to-end architectures'

In [34]:
# Paper abstract
df.iloc[0, 1]

'Traditionally, most data-to-text applications have been designed using a modular pipeline architecture, in which non-linguistic input data is converted into natural language through several intermediate transformations. In contrast, recent neural models for data-to-text generation have been proposed as end-to-end approaches, where the non-linguistic input is rendered in natural language with much less explicit intermediate representations in-between. This study introduces a systematic comparison between neural pipeline and end-to-end data-to-text approaches for the generation of text from RDF triples. Both architectures were implemented making use of state-of-the art deep learning methods as the encoder-decoder Gated-Recurrent Units (GRU) and Transformer. Automatic and human evaluations together with a qualitative analysis suggest that having explicit intermediate steps in the generation process results in better texts than the ones generated by end-to-end approaches. Moreover, the pi

In [36]:
# Retrieve the 10 nearest neighbours
D, I = index.search(np.array([embeddings[0]]), k=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

L2 distance: [0.0, 88.87393188476562, 91.87527465820312, 93.9849853515625, 95.50212860107422, 95.56340026855469, 95.60047912597656, 96.01769256591797, 96.63569641113281, 97.08345794677734]

MAG paper IDs: [2969686025, 2803928583, 2625324377, 3089046173, 2964910501, 2810809989, 3012813596, 3009445007, 2898799786, 3089952842]


In [None]:
# Fetching the paper titles based on their index
id2details(df, I, 'original_title')