# Importing LEGO® BrickHeadz™ data into a vector store
This notebook makes use of data obtained from the LEGO® BrickHeadz™ website. We used the online offering from [browse.ai](https://browse.ai) to create a CSV file. You can find the data in the [extract-data-brickheadz.csv](./data/extract-data-brickheadz.csv).

The file contains a number of columns that we want to retain to improve the search options. Therefore we first read and extend the data using Pandas. After that we use [Langchain](https://langchain.com) together with [OpenAI](https://www.openai.com) to create the embeddings. With the embeddings we upload the data to *Amazon OpenSearch Service*. To better control what happens in OpenSearch, we manage the index using templates.

First, we have a look at our data using the Pandas library.

In [3]:
import pandas as pd

df = pd.read_csv('./data/extract-data-brickheadz.csv')
df.head()

Unnamed: 0,Position,product_link,age,number_of_pieces,title,price,image_link,product_description
0,1,https://www.lego.com/nl-nl/product/professors-...,10+,601,Leraren van Zweinstein™,"€39,99",https://www.lego.com/cdn/cs/set/assets/blt8c72...,Dit is een betoverende verrassing voor fans va...
1,2,https://www.lego.com/nl-nl/product/harry-hermi...,10+,466,"Harry, Hermelien, Ron & Hagrid™","€24,99",https://www.lego.com/cdn/cs/set/assets/bltbc78...,LEGO® BrickHeadz™ versies van 4 van de bekends...
2,3,https://www.lego.com/nl-nl/product/chip-dale-4...,10+,226,Knabbel & Babbel,"€19,99",https://www.lego.com/cdn/cs/set/assets/blt0bac...,Keer terug naar je jeugd met deze leuke LEGO® ...
3,4,https://www.lego.com/nl-nl/product/woody-and-b...,10+,296,Woody & Bo Peep,"€19,99",https://www.lego.com/cdn/cs/set/assets/blt2bea...,Zorg dat je twee favoriete filmpersonages alti...
4,5,https://www.lego.com/nl-nl/product/goofy-pluto...,10+,214,Goofy en Pluto,"€14,99",https://www.lego.com/cdn/cs/set/assets/blt4306...,Deze Goofy en Pluto set (40378) met 2 klassiek...


We could show the image using one of the urls, but I wanted to have them local and not depend on the LEGO® website. Therefore I wrote a small piece of code to download them.

In [4]:
import requests


def write_images_to_disk():
    for key, value in df.iterrows():
        url = value.image_link
        image_name = url.rsplit('/', 1)[-1]
        response = requests.get(url)

        with open("images/" + image_name, "wb") as f:
            f.write(response.content)

# write_images_to_disk()

To have the image name available later on as well, we create a new column in the DataSet containing only the name of the image. At the same time, we later on need a column with the id. We create that here as well.

In [6]:
df["image_name"] = df["image_link"].str.rsplit('/').str.get(-1)
df["id"] = df["Position"]

df.head()

Unnamed: 0,Position,product_link,age,number_of_pieces,title,price,image_link,product_description,image_name,id
0,1,https://www.lego.com/nl-nl/product/professors-...,10+,601,Leraren van Zweinstein™,"€39,99",https://www.lego.com/cdn/cs/set/assets/blt8c72...,Dit is een betoverende verrassing voor fans va...,40560.png,1
1,2,https://www.lego.com/nl-nl/product/harry-hermi...,10+,466,"Harry, Hermelien, Ron & Hagrid™","€24,99",https://www.lego.com/cdn/cs/set/assets/bltbc78...,LEGO® BrickHeadz™ versies van 4 van de bekends...,40495.jpg,2
2,3,https://www.lego.com/nl-nl/product/chip-dale-4...,10+,226,Knabbel & Babbel,"€19,99",https://www.lego.com/cdn/cs/set/assets/blt0bac...,Keer terug naar je jeugd met deze leuke LEGO® ...,40550.png,3
3,4,https://www.lego.com/nl-nl/product/woody-and-b...,10+,296,Woody & Bo Peep,"€19,99",https://www.lego.com/cdn/cs/set/assets/blt2bea...,Zorg dat je twee favoriete filmpersonages alti...,40553.png,4
4,5,https://www.lego.com/nl-nl/product/goofy-pluto...,10+,214,Goofy en Pluto,"€14,99",https://www.lego.com/cdn/cs/set/assets/blt4306...,Deze Goofy en Pluto set (40378) met 2 klassiek...,40378.jpg,5


We have our data available, now we want to prepare our Amazon OpenSearch Service cluster. We need to create the index to store the product data. We like to use index templates to configure settings and mappings. There are three components that make our life easier when talking to OpenSearch: 
- [OpenSearchTemplate](./retriever/opensearch_template.py) : Helps to create the template from the available components
- [OpenSearchClient](./retriever/opensearch.py) : Helps for communicating with OpenSearch
- [find_auth_opensearch](./retriever/opensearch_auth_local.py) : Helps to obtain authorisation to connect to OpenSearch

In the next code block, we initialize the client and verify if the connection is working. 

In [7]:
from retriever import find_auth_opensearch, OpenSearchClient

config = find_auth_opensearch()
client = OpenSearchClient(config, alias_name="sg-products")

if client.ping():
    print("We have a connection to the Amazon OpenSearch Cluster")
else:
    print("ERROR: no connection to the Amazon OpenSearch Cluster")

We have a connection to the Amazon OpenSearch Cluster


With a working connection, we can insert or update the index template. You can find the required files here:
- [settings](./config_files/sg_product_component_settings.json)
- [mappings](./config_files/sg_product_component_mappings.json)

If you want to learn more about working with index templates with OpenSearch, you can read my blog post:
[jettro.dev](https://jettro.dev/using-index-templates-with-elasticsearch-and-opensearch-17f57f5410f)

In [8]:
from retriever import OpenSearchTemplate

template = OpenSearchTemplate(
    client=client,
    index_template_name="sg_product_index_template",
    component_name_settings="sg_product_component_settings",
    component_name_dyn_mappings="sg_product_component_dynamic_mappings",
    component_name_mappings="sg_product_component_mappings"
)

for result in template.create_update_template():
    print(result)

The version 2 of the component template sg_product_component_settings is up-to-date
The version 1 of the component template sg_product_component_dynamic_mappings is up-to-date
The version 1 of the component template sg_product_component_mappings is up-to-date
The version 3 of the index template is up-to-date


Now we are ready to create a new index that makes use of the template we have just created.

In [9]:
index_name = client.create_index()
print(f"Index created with the name {index_name}")

client.switch_alias_to(index_name=index_name)

Index created with the name sg-products-20230824160843


Next up, initializing Langchain. We create the vector store wrapper from Langchain. The wrapper needs an embedder to create the vector embeddings. We use OpenAI to create the embddings. We do need an environment variable to connect to OpenAI. The envrionment variables are loaded using the _load_dotenv()_ function. The Vector Store wrapper needs the url to the OpenSearch cluster and the authentication object. These are all initialised using the function some code blocks before _find_auth_opensearch()_.

Sometimes we get a message that the session is expired. In that case we need to recreate the OpenSearch client in one of the code blocks above.

In [10]:
import os

from langchain.vectorstores import OpenSearchVectorSearch
from langchain.embeddings import OpenAIEmbeddings
from opensearchpy import RequestsHttpConnection
from dotenv import load_dotenv

load_dotenv()

vector_store = OpenSearchVectorSearch(
    index_name=index_name,
    embedding_function=OpenAIEmbeddings(openai_api_key=os.getenv('OPEN_AI_API_KEY')),
    opensearch_url=f"https://{config['host']}:{config['port']}",
    use_ssl=True,
    verify_certs=True,
    http_auth=config["auth"],
    connection_class=RequestsHttpConnection
)

With the vector store in place, we can start indexing documents. You can use the kwargs argument to configure some of the engine specific aspects. Two examples that we use are:
- text_field: Name of the field to store the text in
- vector_field: Name of the field to store the vector in


In [11]:
response = vector_store.add_texts(
    texts=df["title"].to_list(),
    metadatas=df.to_dict('records'),
    ids=df["id"].to_list(),
    text_field="title",
    vector_field="title_vector"
)

print(f"Inserted {len(response)} documents")

Inserted 34 documents


Now we can open the OpenSearch Dashboard and use the Developer Console to look at the indexed data. You can find the console using the following link: [https://search-cdk-os-sg-es-ccgfabzzmjayaqn7vmsb4kvnvu.eu-west-1.es.amazonaws.com/_dashboards/app/dev_tools#/console](https://search-cdk-os-sg-es-ccgfabzzmjayaqn7vmsb4kvnvu.eu-west-1.es.amazonaws.com/_dashboards/app/dev_tools#/console)

The next code block shows how to use semantic search and see the difference in results from the lexical search. Both queries only use the title field to search for BirckHeadz.

In [22]:
query = "harry potter"

body = {"query": {"match": {"title": query}}}
search_results = client.search(body=body, explain=False)
print(f"\nResults from: OpenSearch using the lexical/index based search.")
for hit in search_results["hits"]["hits"]:
    print(f"{hit['_score']} - {hit['_source']['title']}")

found_docs = vector_store.similarity_search_with_score(query=query, text_field="title", vector_field="title_vector")
print(f"\nResults from: OpenSearch using the vector store functionality.")
for doc, _score in found_docs:
    print(f"{_score} - {doc.page_content}")


Results from: OpenSearch using the lexical/index based search.
4.6576157 - Harry Potter™ en Cho Chang
2.326733 - Harry, Hermelien, Ron & Hagrid™

Results from: OpenSearch using the vector store functionality.
0.7953482 - Harry Potter™ en Cho Chang
0.79218847 - Harry, Hermelien, Ron & Hagrid™
0.73784524 - Draco Malfidus™ en Carlo Kannewasser
0.7319271 - Leraren van Zweinstein™
