# FAISS With Haystack

FAISS is unfortunately **not** presently supported on Windows, so if you are on Windows then you will need to stick with Elasticsearch. If you have access to Linux or Mac then continue.

We'll be using Haystack again, so fortunately setup is very straight-forward. We first import and initialize a FAISS document store using a very similiar logic to what we used before - but this time we will be storing the FAISS index locally.

Storing the index locally means that we will need two files, a SQLite database, and the FAISS index. We create the FAISS index later, but we create the SQLite database on initialization.

We will store both in the `models` directory, but adjust this to your own needs.

In [45]:
path = '../../models/faiss'

import os

if not os.path.exists(path):
    os.makedirs(path)

And now we include this path within a SQLite database location string in the following document store initialization.

In [46]:
# from haystack import faiss
from haystack.document_stores.faiss import FAISSDocumentStore

# initialize FAISS
document_store = FAISSDocumentStore(
    faiss_index_factory_str='Flat',
    sql_url=f'sqlite:///{path}/squad_dev.db',
    return_embedding=True
)

Next, we load our validation data from file, which we will be adding to the FAISS index.

In [47]:
import json

with open('../../data/squad/dev.json', 'r') as f:
    squad = json.load(f)

## Adding Data

As we saw with Elasticsearch, our current FAISS index has been initialized but contains nothing. Now we need to populate the index with our *dev.json* data. 

This time, we'll be making use of the Haystack `Document` object. Which we import with:

In [48]:
from haystack import Document

This object prepares our data into the correct object format for our document stores - which in this case is FAISS.

As before where we had a dictionary with two keys `'text'` and `'meta'`, the *Document* object provides two corresponding arguments, `text` and `meta`. So rather than using the format we used before which looked like:

```json
{
    'text': '<document text here>',
    'meta': {
        'other': '<other info here>'
    }
}
```

We will be using this *Document* object format instead:

```python
Document(
    text='<document text here>',
    meta={
        'other': '<other info here>'
    }
)
```

Just like before, we will be feeding these *Document* objects into a list, which we will then feed into our FAISS `write_documents` method. Remember, our dataset contains duplicate contexts, so we must remove them first using `list(set(...))`.

In [49]:
# Create list of contexts
contexts = [sample['context'] for sample in squad]

# Remove duplicates
contexts = list(set(contexts))

# Create list of Document objects
# squad_docs = [Document(text=sample) for sample in contexts]
squad_docs = [Document(content=sample) for sample in contexts]

Now, because we're storing our FAISS index on file, we may find (if running this script more than once) that we first need to delete any documents that already exist in the index.

In [50]:
# document_store.delete_all_documents()
document_store.delete_documents()

Then we add the data to the index just like before:

In [51]:
document_store.write_documents(squad_docs)

Writing Documents:   0%|          | 0/1204 [00:00<?, ?it/s]

The way that our documents are indexed will depend on the embedding model being used by our retriever. So, we need to initialize our DPR model (the retriever), and then `update_embeddings` using this retriever.

In [52]:
# from haystack.retriever.dense import DensePassageRetriever # Deprecated
from haystack.nodes.retriever.dense import DensePassageRetriever

retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model='facebook/dpr-question_encoder-single-nq-base',
    passage_embedding_model='facebook/dpr-ctx_encoder-single-nq-base',
    use_gpu=True,
    embed_title=True
)



The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizerFast'.


In [53]:
document_store.update_embeddings(retriever=retriever)

Updating Embedding:   0%|          | 0/1204 [00:00<?, ? docs/s]

Create embeddings:   0%|          | 0/1216 [00:00<?, ? Docs/s]

Now that we've fully prepared our document store, we can save it. We will save to the same location we saved our SQLite database, but this time we will be using the *.faiss* filetype.

In [54]:
document_store.save(f'{path}/squad_dev.faiss')

Our FAISS index is now saved to file! We'll go ahead and delete the `document_store` and `retriever`, and try reinitializing both using the data we've saved to file.

In [55]:
del document_store, retriever

All we do now is apply the `load` method directly from `FAISSDocumentStore`, including both the FAISS index location, and SQLite database location:

In [60]:
# document_store = FAISSDocumentStore.load(f'{path}/squad_dev.faiss', f'sqlite:///{path}/squad_dev.db')

document_store = FAISSDocumentStore(f'{path}/squad_dev.faiss', f'sqlite:///{path}/squad_dev.db')

TypeError: Wrong number or type of arguments for overloaded function 'index_factory'.
  Possible C/C++ prototypes are:
    faiss::index_factory(int,char const *,faiss::MetricType)
    faiss::index_factory(int,char const *)


And now we can re-initialize our retriever, using the same arguments as before.

In [None]:
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model='facebook/dpr-question_encoder-single-nq-base',
    passage_embedding_model='facebook/dpr-ctx_encoder-single-nq-base',
    use_gpu=True,
    embed_title=True
)

Finally, we can begin retrieving relevant contexts to our questions using `retriever.retrieve`, which requires a single argument, `query`.

In [None]:
retriever.retrieve('What subject is most abstract?')

And now we've extracted a few contexts stored within FAISS, that our DPR model believes answers our query.