# Import content

Start by importing content. The content can be used to create embeddings and store in a vector store.

At the moment there are three files that are in the format of smaller blocks of text in a file. The following files with content are available:
- help-account.txt
- help-search.txt
- help-sustainability.txt

After importing the texts, we create chunks out of them. We use a Langchain text splitter. The splitter uses specific combination of characters to break up strings. It uses the chunk size to combine strings but stay within the chunk size.  

In the end, all chunks are stored in an array called _available_texts_. This array is used as input for the vector store based on OpenSearch.


In [54]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


def split_up_file(file_name: str):
    """Returns a list of Langchain documents, each containing a chunk if the text in the provided file."""
    with open(file_name) as split_file:
        help_account = split_file.read()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=100,
        length_function=len,
        is_separator_regex=False,
    )

    return text_splitter.create_documents([help_account])


In [55]:
available_texts = (split_up_file('./help-account.txt')
                   + split_up_file('./help-search.txt')
                   + split_up_file('./help-sustainability.txt'))

print(f"Number of found chunks in the three files is {len(available_texts)}")

Number of found chunks in the three files is 34


## Initialise the client to manage the index templates

In [59]:
from retriever import find_auth_opensearch, OpenSearchClient

config = find_auth_opensearch()
client = OpenSearchClient(config, alias_name="sg-content")

if client.ping():
    print("We have a connection to the Amazon OpenSearch Cluster")
else:
    print("ERROR: no connection to the Amazon OpenSearch Cluster")

We have a connection to the Amazon OpenSearch Cluster


In [39]:

from retriever import OpenSearchTemplate

template = OpenSearchTemplate(
    client=client,
    index_template_name="sg_content_index_template",
    component_name_settings="sg_content_component_settings",
    component_name_dyn_mappings="sg_content_component_dynamic_mappings",
    component_name_mappings="sg_content_component_mappings"
)

for result in template.create_update_template():
    print(result)

Update the component template sg_content_component_settings to version 2.
Update the component template sg_content_component_dynamic_mappings to version 1.
Update the component template sg_content_component_mappings to version 1.
Update the template to version 3.


# Loading content
Loading the content is a tricky beast. They make it feel so easy. You have a document, do some chunking, create embeddings, store the embeddings in a vector store and do similarity search. Having a extensive search background, there are so many facets to return relevant results, also for semantic search. Often you want more structure in your content.

To have more control, we are indexing the documents ourselves. We do use some Langchain components to make our life easier.


In [46]:
index_name = client.create_index()
print(f"Index created with the name {index_name}")

client.switch_alias_to(index_name=index_name)

Index created with the name sg-content-20230903012633


First we use Langchain to index some of the help content

In [60]:
import os

from langchain.vectorstores import OpenSearchVectorSearch
from langchain.embeddings import OpenAIEmbeddings
from opensearchpy import RequestsHttpConnection
from dotenv import load_dotenv

load_dotenv()

vector_store = OpenSearchVectorSearch(
    index_name=index_name,
    embedding_function=OpenAIEmbeddings(openai_api_key=os.getenv('OPEN_AI_API_KEY')),
    opensearch_url=f"https://{config['host']}:{config['port']}",
    use_ssl=True,
    verify_certs=True,
    http_auth=config["auth"],
    connection_class=RequestsHttpConnection
)


With the vector store in place, we can start indexing documents. You can use the kwargs to configure some of the engine specific aspects:
- text_field: Name of the field to store the text in
- vector_field: Name of the field to store the vector in


In [48]:

vector_store.add_documents(documents=available_texts)


['69314e26-f5e3-498f-a337-a8121cdc9a51',
 '884b4061-3ddd-4b67-b026-a852c0d858fe',
 '320f109c-4c57-49a2-86f0-4cda952a988e',
 '8f16b3e7-a29e-4b89-b398-9092e6cb3c45',
 'a841ad7f-d046-4e40-851e-c52e5b667161',
 'b5e925bc-fc1f-46cc-bfe7-d7939a0266bd',
 'e4c2d1ec-e07b-4a01-bcc9-fa94437e083a',
 '206a4079-1676-4b65-937e-f6171e15dd6d',
 'ebbea6dd-c62d-4527-92fe-63ec77d2e6d1',
 '92f6cc1c-b16d-4484-a236-db08ce434df1',
 '5c593658-1417-4042-bea2-43e9cb08f944',
 '0f52dc6e-a296-47a8-ae0a-3794fca3e8dd',
 '3078bd3e-8a36-4202-81ec-c8aa9ffdd203',
 'e3a15d0e-cb60-4e66-a735-e087d1c45b15',
 '765fb3d0-6d89-4198-8d69-2354f466ae1b',
 '43dc8dff-8dc3-451a-8bd6-6238b784af73',
 '81106da5-25b2-445a-8b1d-c51a66217fe4',
 'd00a1ebc-14f5-4cd0-9c6a-386bd44031b5',
 '0fbebeac-1435-4ac8-860c-24b6edd404e4',
 'e767ecfe-916f-4135-882b-e386508f604c',
 '07e95d1d-9d50-4ae0-87fb-66f475225dd3',
 '99b589ea-425c-4160-8221-ad0db9725bb7',
 'd34a5c28-2dda-41ed-8bb5-c9837a0942a3',
 '4bc10745-0fb5-47c1-96ab-a15065dd62a1',
 'd299e5ad-5cad-

# Search for answers

Now we want to query the vector store to see if it works better than lexical search

In [61]:
found_docs = vector_store.similarity_search_with_score(query="Do you support filters on your website?")
print(f"\nResults from: OpenSearch")
for doc, _score in found_docs:
    print("---")
    print(f"{_score} - {doc.page_content}")
    print("---")


Results from: OpenSearch
---
0.70916265 - Upon conducting a search, a list of products will be displayed. You can further narrow down these results using filters located on the side of the page. Filters include options like category, price range, brand, size, and color.
---
---
0.69707394 - For organized viewing, sort the search results by criteria like relevance, price, or popularity using the sorting dropdown menu near the top of the page.
---
---
0.68750495 - Remember that some platforms allow you to save your search criteria for future reference, streamlining your shopping experience.

Utilize these steps to effortlessly navigate our ecommerce website and successfully discover the products you're searching for.
---
---
0.6831702 - As you type in the search bar, you may notice search suggestions that can expedite the process.

For more specific searches, consider using the advanced search option if available. This enables refining searches using specialized criteria.
---


In [62]:
from langchain.chains import RetrievalQA
from langchain import PromptTemplate, OpenAI

prompt_template = """Use the context to answer the question. If you don't know the 
    answer, just say that you don't know, don't make up an answer.

    {context}

    Question: {question}:"""

custom_prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

chain_type_kwargs = {"prompt": custom_prompt}
chain = RetrievalQA.from_chain_type(
    llm=OpenAI(openai_api_key=os.getenv('OPEN_AI_API_KEY')),
    chain_type="stuff",
    retriever=vector_store.as_retriever(),
    chain_type_kwargs=chain_type_kwargs
)

In [64]:
print(chain.run("What are the opening times for the store in Amsterdam"))



Answer: I don't know.
