> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.


# Week 3: FAISS Tutorial

### What we are looking at
The goal of this small tutorial, is to provide you a quick overview into what FAISS does and how you can utilize it for Week 3 project. FAISS is an index for efficiently storing searchable embeddings of objects (e.g. sentences, images, ...). This efficient storing allows us to quickly compare our current object against the objects present in the index, and thus find relevant similar results. FAISS uses approximate nearest neighbor search to achieve these quick results.

### Instructions

1. Go through all the steps and look at what kind of embeddings we create.
1. Feel free to add more sentences to be embedded.
1. Make sure to have a look at the interactive graph, and see how close some results are, and how some are not. Does it make sense?
1. Have a look at the results retrieved from the FAISS index we made. Are they appropriate? Try and play around with the number of results it retrieves.

### Code Overview

- Dependencies: Install and import python dependencies
- Dataset creation
- Cohere API
- Creating a FAISS index


# Dependencies

✨ Now let's get started! To kick things off, as always, we will install some dependencies.

In [12]:
import faiss
import umap
import numpy as np
import pandas as pd
import altair as alt
import spacy

## Dataset creation

Below we create our own small dataset, and its WONDERFUL🤩. Please feel free to add your own examples to it, the more the better✨✨! We make use of Spacy to quickly retrieve sentence embeddings that can be used for storing in our FAISS index.

In [13]:
sentences = [
             # Movies
             "I am watching a movie.",
             "I'm going to the movies.",
             "Cinema's popcorn smell is amazing.",
             "These guys kept talking while I was watching the movie.",
             # Groceries
             "Groceries are expensive now?",
             "What happend to all my groceries, they are all rotten.",
             "I like avocado toast",
             "Cheese is over there!",
             "Spinach is the food of the gods.",
             "Healthy dose of protein powder is always good.",
             # Music
             "Coldplay is not my favorite band anymore.",
             "I really liked MTV, with all the video clips.",
             "What music would you like me to play?",
             "He's playing piano very well."
             ]

In [14]:
nlp = spacy.load("en_core_web_lg")
vectors = np.array([nlp(sentence).vector for sentence in sentences])
vectors.shape

# NOTE:  We are using Spacy here because it is free.  If you like, you can use one of the following
#        commercial offerings:
#        - OpenAI
#        = Cohere

# Here is an example of how to get the same sentence embeddings using the OpenAI API
# from openai import OpenAI
# client = OpenAI()

# response = client.embeddings.create(
#     input=sentences,
#     model="text-embedding-3-small"
# )

# vectors = np.array([data.embedding for data in response.data])
# vectors.shape

(14, 300)

Below we make use of UMAP and altair. UMAP we use to reduce the dimensions of our embeddings. With Altair we make an interactive plot.

Please hover over some of these points and see if you can identify a pattern.

In [15]:
# UMAP reduces dimensions from 300 to 2, which we can plot
reducer = umap.UMAP()
umap_embeds = reducer.fit_transform(vectors)
# Make interactive plot
df_explore = pd.DataFrame(data={'text': df['conversation']})
df_explore['x'] = umap_embeds[:,0]
df_explore['y'] = umap_embeds[:,1]
chart = alt.Chart(df_explore).mark_circle(size=60).encode(
    x=alt.X('x', scale=alt.Scale(zero=False)),
    y=alt.Y('y', scale=alt.Scale(zero=False)),
    tooltip=['text']
).properties(width=700, height=400)
chart.interactive()

  warn(


## Creating FAISS: the good stuff.
Creating FAISS is rather straightforward.
1. Identify which index you want to use, with the dimension your embeddings have.
1. Add all the embeddings you want to add.

Since we made embeddings of sentences, we can now query this index with an example like *"I like eating cabbage"*. We turn this into a embedding and search for related sentences in our small index.

In [16]:
# Create our Approximate Nearest Neighbour Index (ANN)
# https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
faiss_index = faiss.IndexFlatIP(vectors.shape[1])

# Convert from float64 to float32 to prevent bug:
# https://github.com/facebookresearch/faiss/issues/461
faiss_index.add(vectors)

# Create an embedding for our sentence
vector = np.array(nlp('I like eating cabbage!').vector).reshape(1, -1)

# Get the results
scores, indices = faiss_index.search(vector, 5)

# Print the results
for index, score in zip(indices[0], scores[0]):
    print(f'{sentences[index]} | score: {score}')

I like avocado toast | score: 1439.9027099609375
I am watching a movie. | score: 1260.8111572265625
I'm going to the movies. | score: 1248.7900390625
What music would you like me to play? | score: 1124.860595703125
What happend to all my groceries, they are all rotten. | score: 940.5963134765625


✨ Tada ✨, hopefully the results match your expectations!

🙌 Good luck with the project! 🙌