# Create HNSW index

This notebook creates the index for the data in the `data` folder. The index for the embeddings is a HNSW index made with Faiss.
It's then exported to a file called `index.faiss` in the `data` folder.

The database use duckDB and is called "hn.duck". It's located in the `data` folder.

Here is the schema for the table 'story':

```
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │  null   │   key   │ default │ extra │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ int32 │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ id          │ INTEGER     │ NO      │ PRI     │         │       │
│ title       │ VARCHAR     │ YES     │         │         │       │
│ url         │ VARCHAR     │ YES     │         │         │       │
│ score       │ INTEGER     │ YES     │         │         │       │
│ time        │ INTEGER     │ YES     │         │         │       │
│ comments    │ INTEGER     │ YES     │         │         │       │
│ author      │ VARCHAR     │ YES     │         │         │       │
│ embeddings  │ FLOAT[]     │ YES     │         │         │       │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘
```

The embeddings are stored as a float array of size 1536.

## Code

### Imports

This notebook uses the following libraries:
- [duckdb](https://duckdb.org/) for the database
- [faiss](https://github.com/facebookresearch/faiss) for the index
- [numpy](https://numpy.org/) for the embeddings


In [22]:
from faiss import IndexHNSWFlat, IndexIDMap, write_index
import duckdb
import numpy as np

### Data loading

In [21]:
db = duckdb.connect(database='data/hn.duckdb', read_only=True)

"""
embeddings is the dictionnary with 2 keys: id and embeddings

embeddings['id'] is a list of ids
embeddings['embeddings'] is a list of embeddings
"""

# Load the embeddings
embeddings = db.execute("SELECT id, embeddings FROM story WHERE embeddings IS NOT NULL").fetchnumpy()
print("I have %d embeddings loaded from the database" % len(embeddings["id"]))

ids = embeddings['id']
embeddingsNP = embeddings['embeddings']
print("I'm now converting the embeddings to a numpy array")
embeddingsNP = np.stack(embeddingsNP)
print("All data is now in memory")


I have 108477 embeddings loaded from the database
I'm now converting the embeddings to a numpy array
All data is now in memory


### Create the index

Using FAISS, we create a HNSW index with the embeddings. We use the default parameters.

In [28]:
%%time
DIMENSION = 1536
M = 64
efConstruction = 500

print("I'm now creating the index")
index = IndexHNSWFlat(DIMENSION, M)

# We wrap it in a IDMap so we can search by id
index = IndexIDMap(index)

print("I'm now adding the embeddings to the index")
index.add_with_ids(embeddingsNP, ids)

print("I'm now saving the index")
write_index(index, "data/index.faiss")

I'm now creating the index
I'm now adding the embeddings to the index
I'm now saving the index
CPU times: user 3min 42s, sys: 2.4 s, total: 3min 44s
Wall time: 32.7 s
