### Uncomment and run the following cells if you work on Google Colab :) Don't forget to change your runtime type to GPU!

In [77]:
!git clone https://github.com/kstathou/vector_engine

fatal: destination path 'vector_engine' already exists and is not an empty directory.


In [78]:
cd vector_engine

/content/vector_engine/vector_engine


In [79]:
pip install -r requirements.txt

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m


### Let's begin!

In [80]:
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [81]:
%autoreload 2
# Used to import data from local.
import pandas as pd

# Used to create the dense document vectors.
import torch
from sentence_transformers import SentenceTransformer

# Used to create and store the Faiss index.
import faiss
import numpy as np
import pickle
from pathlib import Path

# Used to do vector searches and display the results.
from vector_engine.utils import vector_search, id2details

Stored and processed data in s3

The [Sentence Transformers library](https://github.com/UKPLab/sentence-transformers) offers pretrained transformers that produce SOTA sentence embeddings. Checkout this [spreadsheet](https://docs.google.com/spreadsheets/d/14QplCdTCDwEmTqrn1LH4yrbKvdogK4oQvYO1K1aPR5M/) with all the available models.

In this tutorial, we will use the `distilbert-base-nli-stsb-mean-tokens` model which has the best performance on Semantic Textual Similarity tasks among the DistilBERT versions. Moreover, although it's slightly worse than BERT, it is quite faster thanks to having a smaller size.

I use the same model in [Orion's semantic search engine](https://www.orion-search.org/)!

In [82]:
# Instantiate the sentence-level DistilBERT
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
# Check if GPU is available and use it
if torch.cuda.is_available():
    model = model.to(torch.device("cuda"))
print(model.device)

cuda:0


In [83]:
#load the corpus, corpus format: list of paragraphs

with open("/content/corpus_gpt2_final.txt") as fin:
  corpus = [line.rstrip('\n') for line in fin]
print('Loaded all files, num total lines:', len(corpus))


Loaded all files, num total lines: 364


In [84]:
ids = np.array([index for index, value in enumerate(corpus)])

In [86]:
len(ids)

364

In [99]:
# Convert abstract to vectors
embeddings = model.encode(corpus, show_progress_bar=True)

Batches:   0%|          | 0/12 [00:00<?, ?it/s]

In [88]:
print(f'Shape of the vectorised abstract: {embeddings[0].shape}')

Shape of the vectorised abstract: (768,)


## Vector similarity search with Faiss
[Faiss](https://github.com/facebookresearch/faiss) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, even ones that do not fit in RAM. 
    
Faiss is built around the `Index` object which contains, and sometimes preprocesses, the searchable vectors. Faiss has a large collection of [indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes). You can even create [composite indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes-(composite)). Faiss handles collections of vectors of a fixed dimensionality d, typically a few 10s to 100s.

**Note**: Faiss uses only 32-bit floating point matrices. This means that you will have to change the data type of the input before building the index.

To learn more about Faiss, you can read their paper on [arXiv](https://arxiv.org/abs/1702.08734).

Here, we will the `IndexFlatL2` index:
- It's a simple index that performs a brute-force L2 distance search
- It scales linearly. It will work fine with our data but you might want to try [faster indexes](https://github.com/facebookresearch/faiss/wiki/Faster-search) if you work will millions of vectors.

To create an index with the `misinformation` abstract vectors, we will:
1. Change the data type of the abstract vectors to float32.
2. Build an index and pass it the dimension of the vectors it will operate on.
3. Pass the index to IndexIDMap, an object that enables us to provide a custom list of IDs for the indexed vectors.
4. Add the abstract vectors and their ID mapping to the index. In our case, we will map vectors to their paper IDs from MAG.

In [89]:
# Step 1: Change data type
embeddings = np.array([embedding for embedding in embeddings]).astype("float32")

# Step 2: Instantiate the index
index = faiss.IndexFlatL2(embeddings.shape[1])

# Step 3: Pass the index to IndexIDMap
index = faiss.IndexIDMap(index)

# Step 4: Add vectors and their IDs
index.add_with_ids(embeddings, ids)

print(f"Number of vectors in the Faiss index: {index.ntotal}")

Number of vectors in the Faiss index: 364


### Searching the index
The index we built will perform a k-nearest-neighbour search. We have to provide the number of neighbours to be returned. 

Let's query the index with an abstract from our dataset and retrieve the 10 most relevant documents. **The first one must be our query!**


In [102]:
# Paper abstract
corpus[1]

'I need a possibility to print ( Amine offered the possibility to print at his office) junior needs to print as well but just 4A format.'

In [91]:
# Retrieve the 10 nearest neighbours
D, I = index.search(np.array([embeddings[1]]), k=10)
print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')

L2 distance: [0.0, 182.37698364257812, 189.46029663085938, 189.4813690185547, 190.33587646484375, 190.70831298828125, 195.72972106933594, 197.02232360839844, 198.30970764160156, 199.48648071289062]

MAG paper IDs: [1, 355, 82, 83, 346, 106, 292, 81, 48, 156]


In [92]:
[corpus[idx] for idx in I[0]]

['I need a possibility to print ( Amine offered the possibility to print at his office) junior needs to print as well but just 4A format.',
 'In Not Lost in You, how did you choose the special pattern clothes for those moving bodies to wear?Since there is a performative act in the video series not lost in you and one person acts with the avatar, but I consider the basic feeling and question of the work to be something general that everyone knows, I also wanted to abstract and generalize the human body from the person acting in the video, to achieve this I chose the ornamental Nylon fabrics.',
 "I will basically focused on the way of production of Bernd an Hilla Becher,then how they teach, how the 'duesseldorf school' came up, what kind of production do we see and how do they work and how they find their inspiration. After that i will talk about how my educational way was, what Gursky taught in his class and what i learn from him and how we see photography nowadays, therefore i would na

In [93]:
corpus[80]

'I would be very happy to welcome you in April in my studio. please let me know when it is a good time for you the first part of the month i am completely free, so any time is good for me.'

In [94]:
def id2details(corpus, I):
    """Returns the paper titles based on the paper index."""
    return [corpus[idx] for idx in I[0]]

In [95]:
def vector_search(query, model, index, num_results=10):
    """Tranforms query to vector using a pretrained, sentence-level
    DistilBERT model and finds similar vectors using FAISS.
    
    Args:
        query (str): User query that should be more than a sentence long.
        model (sentence_transformers.SentenceTransformer.SentenceTransformer)
        index (`numpy.ndarray`): FAISS index that needs to be deserialized.
        num_results (int): Number of results to return.
    
    Returns:
        D (:obj:`numpy.array` of `float`): Distance between results and query.
        I (:obj:`numpy.array` of `int`): Paper ID of the results.
    
    """
    vector = model.encode(list(query))
    D, I = index.search(np.array(vector).astype("float32"), k=num_results)
    return D, I

In [96]:
# Fetch the paper titles based on their index
id2details(corpus, I)

['I need a possibility to print ( Amine offered the possibility to print at his office) junior needs to print as well but just 4A format.',
 'In Not Lost in You, how did you choose the special pattern clothes for those moving bodies to wear?Since there is a performative act in the video series not lost in you and one person acts with the avatar, but I consider the basic feeling and question of the work to be something general that everyone knows, I also wanted to abstract and generalize the human body from the person acting in the video, to achieve this I chose the ornamental Nylon fabrics.',
 "I will basically focused on the way of production of Bernd an Hilla Becher,then how they teach, how the 'duesseldorf school' came up, what kind of production do we see and how do they work and how they find their inspiration. After that i will talk about how my educational way was, what Gursky taught in his class and what i learn from him and how we see photography nowadays, therefore i would na


## Putting all together

So far, we've built a Faiss index using the misinformation abstract vectors we encoded with a sentence-DistilBERT model. That's helpful but in a real case scenario, we would have to work with unseen data. To query the index with an unseen query and retrieve its most relevant documents, we would have to do the following:

1. Encode the query with the same sentence-DistilBERT model we used for the rest of the abstract vectors.
2. Change its data type to float32.
3. Search the index with the encoded query.

Here, we will use the introduction of an article published on [HKS Misinformation Review](https://misinforeview.hks.harvard.edu/article/can-whatsapp-benefit-from-debunked-fact-checked-stories-to-reduce-misinformation/).


In [97]:
# function that generates the text based on the input
def generate(qa):
'''a query, 
  the sentence-level transformer, 
  the Faiss index 
  the number of requested result
'''
  D, I = vector_search([qa], model, index, num_results=1)
  print(f'L2 distance: {D.flatten().tolist()}\n\nMAG paper IDs: {I.flatten().tolist()}')
  # Fetching the paper titles based on their index
  output=id2details(corpus, I)
  return output

In [103]:
#load the question, short answer pair dataset and generate an extension for each pair, save the result in a file 
with open("/content/qa.txt", "r") as fin, open("/content/transformer_faiss_output.txt", "w") as transformer_faiss_output:
  for qa in fin:
    qa=qa.strip()
    print(qa)
    if not qa.endswith("."):
      qa= qa + "."
    output=generate(qa)
    output=str(output).rstrip("']").lstrip("['"). rstrip('""').lstrip('""')
    print(output, file=transformer_faiss_output)


What inspires your art? A lot, experiences, texts, songs, films, conversations, observations, the daily life as news.
L2 distance: [147.82144165039062]

MAG paper IDs: [340]
What inspires your art? Many things which are coming together, to me it is the mixture of researches I am  doing, life and observations as experiences.
L2 distance: [132.22988891601562]

MAG paper IDs: [340]
What inspires your art? Many things which are coming together, to me it is the mixture of researches I am  doing, life and observations as experiences.
L2 distance: [132.22988891601562]

MAG paper IDs: [340]
What inspires your art? A lot, experiences, texts, songs, films, conversations, observations, the daily life as news.
L2 distance: [147.82144165039062]

MAG paper IDs: [340]
Are you interested in architecture? Yes I am very interested in architecture
L2 distance: [260.4167785644531]

MAG paper IDs: [138]
Are you interested in architecture? Yes architecture is also a big reference in my work
L2 distance: [24