# About

This notebook uses Elastic Search to show how far a simple text retrieval can get us.

[ElasticSearch](https://www.elastic.co/elasticsearch) is a search engine that supports vector search and full text search using BM25.

## Import

In [1]:
import pandas as pd
import nltk.corpus
import nltk
import nltk.stem
import rich
import re
import code.helper
import importlib
import tqdm.auto
import code.helper_es
import json

In [2]:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import streaming_bulk

In [3]:
import ipywidgets as widgets
from ipywidgets import interact

In [4]:
from IPython.display import Image, JSON
from IPython.core.display import HTML
import rich

## Setup

check if elastic search is running

In [5]:
%%bash

curl -sX GET "localhost:9200/"

{
  "name" : "nxb1gv6dry",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "XRuFoetgQ-a5r2oh6zfbxQ",
  "version" : {
    "number" : "7.9.2",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "d34da0ea4a966c4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2020-09-23T00:45:33.626720Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


## Product Data

original data had product, query and rating. 

Here we will just get products

In [6]:
df = pd.read_parquet("../data/cleaned_input.parquet")

In [7]:
df.iloc[0].to_dict()

{'query': ' revent 80 cfm',
 'product_id': 'B000MOO21W',
 'relevance_label': 'Irrelevant',
 'product_title': 'Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan',
 'product_description': None,
 'product_bullet_point': 'WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move air\nDesigned to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pace\nDetachable adaptors, firmly secured duct ends, adjustable mounting brackets (up to 26-in), fan/motor units that detach easily from the housing and uncomplicated wiring all lend themselves to user-friendly installation\nThis Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fan\n0.35 amp',
 'product_brand': 'Panasonic',
 'product_color': 'White',
 'url_product': 'https://www.amazo

In [8]:
df_product = df[['product_id','product_title','url_product','url_image']].drop_duplicates(['product_id']).reset_index(drop=True)
df_product

Unnamed: 0,product_id,product_title,url_product,url_image
0,B000MOO21W,Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceil...,https://www.amazon.com/dp/B000MOO21W,http://images.amazon.com/images/P/B000MOO21W.0...
1,B07X3Y6B1V,Homewerks 7141-80 Bathroom Fan Integrated LED ...,https://www.amazon.com/dp/B07X3Y6B1V,http://images.amazon.com/images/P/B07X3Y6B1V.0...
2,B07WDM7MQQ,Homewerks 7140-80 Bathroom Fan Ceiling Mount E...,https://www.amazon.com/dp/B07WDM7MQQ,http://images.amazon.com/images/P/B07WDM7MQQ.0...
3,B07RH6Z8KW,Delta Electronics RAD80L BreezRadiance 80 CFM ...,https://www.amazon.com/dp/B07RH6Z8KW,http://images.amazon.com/images/P/B07RH6Z8KW.0...
4,B07QJ7WYFQ,Panasonic FV-08VRE2 Ventilation Fan with Reces...,https://www.amazon.com/dp/B07QJ7WYFQ,http://images.amazon.com/images/P/B07QJ7WYFQ.0...
...,...,...,...,...
1207284,0323377092,Neurology Self-Assessment: A Companion to Brad...,https://www.amazon.com/dp/0323377092,http://images.amazon.com/images/P/0323377092.0...
1207285,0323287832,"Bradley's Neurology in Clinical Practice, 2-Vo...",https://www.amazon.com/dp/0323287832,http://images.amazon.com/images/P/0323287832.0...
1207286,0198566344,Autonomic Failure: A Textbook of Clinical Diso...,https://www.amazon.com/dp/0198566344,http://images.amazon.com/images/P/0198566344.0...
1207287,0071845836,The Hospital Neurology Book,https://www.amazon.com/dp/0071845836,http://images.amazon.com/images/P/0071845836.0...


## Indexing

simple wrapper code around elastic search 

In [9]:
client_search = code.helper_es.SearchClient()

In [None]:
??client_search

In [None]:
# client_search.index_documents(df_product)

## Analyzers

In [9]:
query = "adidas shoes men 15"

a query can be processed differently

In [10]:
rich.print (
    client_search.analyze(query, analyzer="stop")
)

In [11]:
rich.print (
    client_search.analyze(query, analyzer="english")
    
)

in the `english` analyzer, `15` token was removed

## Retrieval

payload from elastic search

In [12]:
rich.print ( client_search.fetch_results(query) )

In [30]:
interesting_queries = json.load(open("../data/possible_queries.txt"))

resutls fetched by elastic search bm25

In [31]:
@interact
def interact_find_results(query=interesting_queries):
    code.helper.find_results(query ,df, show_ground_truth=False, num_hits=20 )

interactive(children=(Dropdown(description='query', options=('airpod case cute', 'adidas shoes men 15', 'norel…

In [15]:
interesting_queries

['ps4 for under 100',
 'seroma treatment',
 'streptocarpella',
 'trastes kawaii',
 'turbuhaler',
 '$1 dollar toys not fidgets',
 '$1 million that look real but that is not it',
 '$10 stuff not books',
 '$100 things that are not electronics',
 '$4 worthy items not books',
 '$80 golf cart without roof',
 "'freeze dryer' machine not dehydrator",
 '.3 cc syringe without needle',
 '/ machine wash low bleach iron do not dry clean. suggested',
 '0.9%sodium chloride without any preservatives',
 '1 cant believe its not butter and herb',
 '1/8” x 1” neoprene foam without adhesive',
 'black latex gloves not nitrile',
 'christianity without the cross',
 'exercise without exercise',
 'freeze dryer machine not dehydrator',
 "gnc women's prenatal formula without iron",
 'instax mini 11 accessories kit without camera',
 'laptop not chromebook',
 'microphone without headset',
 'mindfulness without the bells and beads',
 'nutrients to age without senility',
 'I want to get fancier one than the laptop I 

try your own queries

In [28]:
query = "airpod case cute"  # X


query = "adidas shoes men 15" # X


query = "adrenaline bbq"

query = "laptop not chromebook"

query = "compression socks women"

query = "dresses for women party wedding"

query = "Maxi Dresses"

query = "snow boots for women"
 
query ="gifts for men" # X relevancy

query = "digital alarm clock with usb port"

query = "norelco shavers for men" # X  relevancy


query = "salas para casa sofas baratos"


query = "jumpers for ladies"

query = "sweaters for women"

In [29]:
code.helper.find_results(query ,df, show_ground_truth=False , num_hits=50)