# Module 2: Keyword Search with Amazon OpenSearch 

In this module, we are going to perform a simple search in OpenSearch by matching the individual words in our search query. We will:
1. Load data into OpenSearch from the Amazon Product Question and Answer (PQA) dataset. This dataset contains a list of common questions and answers related to products.
2. Query the data using a simple query search for find potentially matching questions. We will search the PQA dataset for questions similar to our sample question "does this work with xbox?". We expect to find matches in the dataset based on the individual words such as "xbox" and "work".

In subsequent modules, we will then demonstrate how to use semantic search to improve the relvance of the query results.

### 1. Install required libraries

Before we begin, we need to install some required libraries.

In [None]:
!pip install -q boto3
!pip install -q requests
!pip install -q requests-aws4auth
!pip install -q opensearch-py
!pip install -q tqdm


### 2. Get Cloud Formation stack output variables

We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in "outputs" to be used later.

#### Note : The following refers the stack by name. If you didn't use the default stack name, please update the value of "cloudformation_stack_name" to the Cloud Formation stack name you specified when you provisioned your environment.

In [None]:
import boto3

cfn = boto3.client('cloudformation')

def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "static-cloudformation-semantic-search"

outputs = get_cfn_outputs(cloudformation_stack_name)

bucket = outputs['s3BucketTraining']
aos_host = outputs['DomainEndpoint']

outputs

### 3. Copy the data set locally
Before we can run any queries, we need to load the Amazon Product Question and Answer data from : https://registry.opendata.aws/amazon-pqa/

Let's start by having a look at all the files in the dataset.

In [None]:
!aws s3 ls --no-sign-request s3://amazon-pqa/

There's a lot of data here, so for the purposes of this demo, we focus on just the headset data. Let's copy the amazon_pqa_headsets.json data locally. 

In [None]:
!aws s3 cp --no-sign-request s3://amazon-pqa/amazon_pqa_headsets.json ./amazon-pqa/amazon_pqa_headsets.json

### 4. Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with OpenSearch Cluster

In [None]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
import boto3

#es_host = 'search-semanti-domain-7fc1mmzarfpg-vtklyjm33bhijjarsdhbyl7jxq.us-east-1.es.amazonaws.com' 
region = 'us-east-1' 

credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, region)

aos_client = OpenSearch(
    hosts = [{'host': aos_host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)

### 5. Create a index in OpenSearch 
We are defining an index with 2 fields: the first field is the "content" for raw sentence and the second field is "answer" for the raw answer data.

To create the index, we first define the index in JSON, then use the aos_client connection we initiated ealier to create the index in OpenSearch.

In [None]:
keyword_index = {
    "settings": {
        "analysis": {
          "analyzer": {
            "default": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        }
    },
    "mappings": {
        "properties": {
            "question": {
                "type": "text",
                "store": True
            },
            "answer": {
                "type": "text",
                "store": True
            }
        }
    }
}


If for any reason you need to recreate your dataset, you can uncomment and execute the following to delete any previously created indexes. If this is the first time you're running this, you can skip this step.

In [None]:
aos_client.indices.delete(index="keyword_pqa")


In [None]:
aos_client.indices.create(index="keyword_pqa",body=keyword_index,ignore=400)


Let's verify the created index information

In [None]:
aos_client.indices.get(index="keyword_pqa")

### 6. Load the raw data into the Index
Next, let's load the headset PQA data we copied locally into the index we've just created.

In [None]:
import json
from tqdm.contrib.concurrent import process_map
from multiprocessing import cpu_count


def load_pqa_as_json(file_name,number_rows=1000):
    result=[]
    with open(file_name) as f:
        i=0
        for line in f:
            data = json.loads(line)
            result.append(data)
            i+=1
            if(i == number_rows):
                break
    return result


qa_list_json = load_pqa_as_json('amazon-pqa/amazon_pqa_headsets.json',number_rows=1000)


def es_import(question):
    aos_client.index(index='keyword_pqa',
             body={"question": question["question_text"],"answer":question["answers"][0]["answer_text"]}
            )
        
workers = 4 * cpu_count()
    
process_map(es_import, qa_list_json, max_workers=workers,chunksize=1000)

To validate the load, we'll query the number of documents number in the index. We should have 1000 hits in the index.

In [None]:
res = aos_client.search(index="keyword_pqa", body={"query": {"match_all": {}}})
print("Got %d Hits" % res['hits']['total']['value'])

### 7. Run a "Keyword Search" in OpenSearch

The following will execute a simple keyword search in OpenSearch

In [None]:
import pandas as pd

query={
    "size": 50,
    "query": {
        "match": {
            "question":"does this work with xbox?"
        }
    }
}

res = aos_client.search(index="keyword_pqa", 
                       body=query,
                       stored_fields=["question","answer"])
#print("Got %d Hits:" % res['hits']['total']['value'])
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['fields']['question'][0],hit['fields']['answer'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
display(query_result_df)


Congratulations, you've now executed a simple keyword search on the data in OpenSearch.

If you take a look at the results above, you'll notice that the results match one or more of the key words from our question, most commonly the words "work" and "xbox".  You'll also notices that a lot of these results aren't relevant to our original question, such as "Does it work on PS3?" and "Does it work for computers". In Module 3, we'll instead use semantic search to make the result more relevant.