# Importing LEGO® BrickHeadz™ data into a vector store
![Do you know Wally?](./images/40619_small.png)

This notebook makes use of data obtained from the LEGO® BrickHeadz™ website. We used the online offering from [browse.ai](https://browse.ai) to create a CSV file. You can find the data in the [extract-data-brickheadz.csv](./data/extract-data-brickheadz.csv).

The file contains a number of columns that we want to retain to improve the search options. Therefore we first read and extend the data using Pandas. After that we use [Langchain](https://langchain.com) together with [OpenAI](https://www.openai.com) to create the embeddings.

In [35]:
from dotenv import load_dotenv
load_dotenv()

True

## Load the data into a Pandas DataSet
The csv file is created using Browse.ai on the original LEGO website. We extract the filename of the image from the url. We can use the name to find the image that is in the images folder. We also need to provided an id, we have chosen to use the Position column as the id.

In [1]:
import pandas as pd

df = pd.read_csv('./data/extract-data-brickheadz.csv')
df["image_name"] = df["image_link"].str.rsplit('/').str.get(-1)
df["id"] = df["Position"]

df.head()


Unnamed: 0,Position,product_link,age,number_of_pieces,title,price,image_link,product_description,image_name,id
0,1,https://www.lego.com/nl-nl/product/professors-...,10+,601,Leraren van Zweinstein™,"€39,99",https://www.lego.com/cdn/cs/set/assets/blt8c72...,Dit is een betoverende verrassing voor fans va...,40560.png,1
1,2,https://www.lego.com/nl-nl/product/harry-hermi...,10+,466,"Harry, Hermelien, Ron & Hagrid™","€24,99",https://www.lego.com/cdn/cs/set/assets/bltbc78...,LEGO® BrickHeadz™ versies van 4 van de bekends...,40495.jpg,2
2,3,https://www.lego.com/nl-nl/product/chip-dale-4...,10+,226,Knabbel & Babbel,"€19,99",https://www.lego.com/cdn/cs/set/assets/blt0bac...,Keer terug naar je jeugd met deze leuke LEGO® ...,40550.png,3
3,4,https://www.lego.com/nl-nl/product/woody-and-b...,10+,296,Woody & Bo Peep,"€19,99",https://www.lego.com/cdn/cs/set/assets/blt2bea...,Zorg dat je twee favoriete filmpersonages alti...,40553.png,4
4,5,https://www.lego.com/nl-nl/product/goofy-pluto...,10+,214,Goofy en Pluto,"€14,99",https://www.lego.com/cdn/cs/set/assets/blt4306...,Deze Goofy en Pluto set (40378) met 2 klassiek...,40378.jpg,5


## Add embeddings
We add embeddings for the title field to the DataSet using the OpenAI API. 

In [6]:
import openai
import os

openai.api_key=os.getenv('OPEN_AI_API_KEY')


In [7]:
def create_embedding(input_str: str):
    embedding_response = openai.Embedding.create(
       model="text-embedding-ada-002",
       input=input_str
    )    
    return embedding_response["data"][0]["embedding"]

df["embedding"] = df["title"].apply(create_embedding)
df.head()

Unnamed: 0,Position,product_link,age,number_of_pieces,title,price,image_link,product_description,image_name,id,embedding
0,1,https://www.lego.com/nl-nl/product/professors-...,10+,601,Leraren van Zweinstein™,"€39,99",https://www.lego.com/cdn/cs/set/assets/blt8c72...,Dit is een betoverende verrassing voor fans va...,40560.png,1,"[0.0009088412043638527, -0.02413908578455448, ..."
1,2,https://www.lego.com/nl-nl/product/harry-hermi...,10+,466,"Harry, Hermelien, Ron & Hagrid™","€24,99",https://www.lego.com/cdn/cs/set/assets/bltbc78...,LEGO® BrickHeadz™ versies van 4 van de bekends...,40495.jpg,2,"[-0.01215097401291132, -0.019284188747406006, ..."
2,3,https://www.lego.com/nl-nl/product/chip-dale-4...,10+,226,Knabbel & Babbel,"€19,99",https://www.lego.com/cdn/cs/set/assets/blt0bac...,Keer terug naar je jeugd met deze leuke LEGO® ...,40550.png,3,"[-0.015498985536396503, -0.02670714445412159, ..."
3,4,https://www.lego.com/nl-nl/product/woody-and-b...,10+,296,Woody & Bo Peep,"€19,99",https://www.lego.com/cdn/cs/set/assets/blt2bea...,Zorg dat je twee favoriete filmpersonages alti...,40553.png,4,"[-0.03133436664938927, -0.02872316911816597, 0..."
4,5,https://www.lego.com/nl-nl/product/goofy-pluto...,10+,214,Goofy en Pluto,"€14,99",https://www.lego.com/cdn/cs/set/assets/blt4306...,Deze Goofy en Pluto set (40378) met 2 klassiek...,40378.jpg,5,"[-0.011258577927947044, -0.027819199487566948,..."


In [14]:
print(f"Number of dimensions in embedding '{len(df.loc[0, 'embedding'])}'")

Number of dimensions in embedding '1536'


## Use FAISS to play with the embeddings we have created
Searching for vectors is easy using the FAISS index. There are multiple type of indexes to use. Some of the indexes need to be trained. Other do not. You can check if training is required using the _is_trained_ flag on the index. Creating the index uses the name of the index together with the distance calculation method.

faiss.IndexFlatL2(dimension) - We create a _Flat_ index and use the _L2_ distance calculation

In [46]:
import faiss
import numpy as np

available_embeddings = df["embedding"].to_list()
dimension = len(available_embeddings[0])
index = faiss.IndexFlatL2(dimension)
print(f"The index Flat is trained: {index.is_trained}")

embeddings = np.array(available_embeddings).astype('float32')
index.add(embeddings)

print(f"The index now has {index.ntotal} documents or embeddings")

The index Flat is trained: True
The index now has 34 documents or embeddings


### Search using FAISS - Flat/L2
The flat index is what we call an exhaustive search. We compare the query vector with all the available other vectors.

In [51]:
def search_for(_query: str, num_results: int = 4):
    k = num_results
    query_embedding = create_embedding(_query)
    query_vector = np.array([query_embedding]).astype('float32')
    distances, indexes = index.search(query_vector, k)
    
    output = []
    for i in range(len(indexes[0])):
        item = {
            'id': indexes[0][i],
            'title': df.loc[indexes[0][i], 'title'],
            'distance': distances[0][i]
        }
        output.append(item)
    
    return output

def print_found_brickheadz(brickheads: list):
    for brickhead in brickheads:
        print(f"{brickhead['id']:2d} - {brickhead['distance']:1.5f} - {brickhead['title']}")
        

In [52]:
print_found_brickheadz(search_for("harry potter"))

17 - 0.25717 - Harry Potter™ en Cho Chang
 1 - 0.26233 - Harry, Hermelien, Ron & Hagrid™
15 - 0.35530 - Draco Malfidus™ en Carlo Kannewasser
 0 - 0.36626 - Leraren van Zweinstein™
