## Querying Elastic Search engine with elasticsearch-dsl python package

In this week we will try to send queirs to elastic search engine with Python. We will use the [elasticsearch-dsl](https://elasticsearch-dsl.readthedocs.io/en/latest/index.html) package which is a high-level library whose aim is to help with writing and running queries against Elasticsearch. It is built on top of the official low-level client (elasticsearch-py).

It provides a more convenient and idiomatic way to write and manipulate queries. It stays close to the Elasticsearch JSON DSL, mirroring its terminology and structure.

In [1]:
import pandas as pd
import requests

In [2]:
!pip install elasticsearch-dsl

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Collecting elasticsearch-dsl
  Downloading elasticsearch_dsl-7.1.0-py2.py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 284 kB/s eta 0:00:01
[?25hCollecting elasticsearch<8.0.0,>=7.0.0
  Downloading elasticsearch-7.6.0-py2.py3-none-any.whl (88 kB)
[K     |████████████████████████████████| 88 kB 414 kB/s eta 0:00:011
Installing collected packages: elasticsearch, elasticsearch-dsl
Successfully installed elasticsearch-7.6.0 elasticsearch-dsl-7.1.0


In [2]:
import elasticsearch_dsl
from elasticsearch_dsl import connections

In [211]:
connections.create_connection(alias='elastic', hosts=['https://1b12ab3f80124603bff9a8d923548521.us-east-1.aws.found.io:9243'],
                             timeout=60, http_auth=('elastic','zMJcA6De12xdU8OiVmOtDCu4'))

<Elasticsearch([{'host': '1b12ab3f80124603bff9a8d923548521.us-east-1.aws.found.io', 'port': 9243, 'use_ssl': True}])>

In [6]:
from elasticsearch import Elasticsearch
client = Elasticsearch('https://elastic:zMJcA6De12xdU8OiVmOtDCu4@1b12ab3f80124603bff9a8d923548521.us-east-1.aws.found.io:9243/kibana_sample_data_ecommerce/')

In [7]:
from elasticsearch_dsl import Q, Search

q = Q("match", customer_id="41") 

s = Search().using(client).query(q)
response = s.execute()

In [8]:
response

<Response: [<Hit(kibana_sample_data_ecommerce/cM-6MHEB7buSyQM2s8sN): {'category': ["Men's Clothing"], 'currency': 'EUR', 'custome...}>, <Hit(kibana_sample_data_ecommerce/zc-6MHEB7buSyQM2s8sN): {'category': ["Men's Clothing"], 'currency': 'EUR', 'custome...}>, <Hit(kibana_sample_data_ecommerce/xs-6MHEB7buSyQM2s8wP): {'category': ["Men's Shoes", "Men's Accessories"], 'currency...}>, <Hit(kibana_sample_data_ecommerce/1c-6MHEB7buSyQM2s8wP): {'category': ["Men's Accessories", "Men's Shoes"], 'currency...}>, <Hit(kibana_sample_data_ecommerce/_M-6MHEB7buSyQM2s8wP): {'category': ["Men's Accessories", "Men's Clothing"], 'curre...}>, <Hit(kibana_sample_data_ecommerce/Hs-6MHEB7buSyQM2s80P): {'category': ["Men's Clothing"], 'currency': 'EUR', 'custome...}>, <Hit(kibana_sample_data_ecommerce/Us-6MHEB7buSyQM2ytOa): {'category': ["Men's Accessories", "Men's Clothing"], 'curre...}>, <Hit(kibana_sample_data_ecommerce/Y8-6MHEB7buSyQM2ytOa): {'category': ["Men's Clothing"], 'currency': 'EUR', 'custome...

In [9]:
print(s.to_dict())

{'query': {'match': {'customer_id': '41'}}}


In [10]:
df = []
for h in response.hits.hits:
    df.append(h["_source"].to_dict())
    

In [11]:
pd.DataFrame(df)

Unnamed: 0,category,currency,customer_first_name,customer_full_name,customer_gender,customer_id,customer_last_name,customer_phone,day_of_week,day_of_week_i,...,order_date,order_id,products,sku,taxful_total_price,taxless_total_price,total_quantity,total_unique_products,type,user
0,[Men's Clothing],EUR,Jim,Jim Dixon,MALE,41,Dixon,,Saturday,5,...,2020-03-21T10:53:46+00:00,553689,"[{'base_price': 32.99, 'discount_percentage': ...","[ZO0416304163, ZO0530405304]",61.98,61.98,2,2,order,jim
1,[Men's Clothing],EUR,Jim,Jim Lewis,MALE,41,Lewis,,Thursday,3,...,2020-04-09T06:21:36+00:00,579005,"[{'base_price': 10.99, 'discount_percentage': ...","[ZO0629306293, ZO0578405784]",37.98,37.98,2,2,order,jim
2,"[Men's Shoes, Men's Accessories]",EUR,Jim,Jim Rowe,MALE,41,Rowe,,Friday,4,...,2020-03-27T12:57:36+00:00,561969,"[{'base_price': 41.99, 'discount_percentage': ...","[ZO0521205212, ZO0316003160]",74.98,74.98,2,2,order,jim
3,"[Men's Accessories, Men's Shoes]",EUR,Jim,Jim Jensen,MALE,41,Jensen,,Monday,0,...,2020-03-23T21:02:53+00:00,556939,"[{'base_price': 11.99, 'discount_percentage': ...","[ZO0463404634, ZO0404804048]",76.98,76.98,2,2,order,jim
4,"[Men's Accessories, Men's Clothing]",EUR,Jim,Jim Sanders,MALE,41,Sanders,,Wednesday,2,...,2020-03-25T14:47:02+00:00,559344,"[{'base_price': 11.99, 'discount_percentage': ...","[ZO0596105961, ZO0588805888]",32.98,32.98,2,2,order,jim
5,[Men's Clothing],EUR,Jim,Jim Love,MALE,41,Love,,Monday,0,...,2020-04-13T10:20:38+00:00,584718,"[{'base_price': 49.99, 'discount_percentage': ...","[ZO0428604286, ZO0579905799]",72.98,72.98,2,2,order,jim
6,"[Men's Accessories, Men's Clothing]",EUR,Jim,Jim Ramsey,MALE,41,Ramsey,,Sunday,6,...,2020-04-12T22:40:48+00:00,584070,"[{'base_price': 84.99, 'discount_percentage': ...","[ZO0467304673, ZO0530605306]",113.98,113.98,2,2,order,jim
7,[Men's Clothing],EUR,Jim,Jim Stewart,MALE,41,Stewart,,Tuesday,1,...,2020-03-31T16:53:46+00:00,567592,"[{'base_price': 28.99, 'discount_percentage': ...","[ZO0535405354, ZO0291302913]",228.98,228.98,2,2,order,jim
8,"[Men's Shoes, Men's Clothing]",EUR,Jim,Jim Foster,MALE,41,Foster,,Monday,0,...,2020-04-13T01:20:38+00:00,584234,"[{'base_price': 74.99, 'discount_percentage': ...","[ZO0400704007, ZO0565605656]",87.98,87.98,2,2,order,jim
9,"[Men's Shoes, Men's Clothing]",EUR,Jim,Jim Henderson,MALE,41,Henderson,,Wednesday,2,...,2020-04-08T05:38:24+00:00,577635,"[{'base_price': 64.99, 'discount_percentage': ...","[ZO0687406874, ZO0566105661]",79.98,79.98,2,2,order,jim


#### combining queries

In [12]:
q = Q("match", customer_gender="FEMALE") | Q("match", category="shoes")

s = Search().using(client).query(q)
response = s.execute()

In [13]:
s.to_dict()

{'query': {'bool': {'should': [{'match': {'customer_gender': 'FEMALE'}},
    {'match': {'category': 'shoes'}}]}}}

In [14]:
df = []
for h in response.hits.hits:
    df.append(h["_source"].to_dict())
    
pd.DataFrame(df)

Unnamed: 0,category,currency,customer_first_name,customer_full_name,customer_gender,customer_id,customer_last_name,customer_phone,day_of_week,day_of_week_i,...,order_date,order_id,products,sku,taxful_total_price,taxless_total_price,total_quantity,total_unique_products,type,user
0,[Women's Shoes],EUR,Gwen,Gwen Stokes,FEMALE,26,Stokes,,Tuesday,1,...,2020-03-24T13:26:24+00:00,557899,"[{'base_price': 64.99, 'discount_percentage': ...","[ZO0375403754, ZO0673306733]",154.98,154.98,2,2,order,gwen
1,[Women's Shoes],EUR,Mary,Mary Wood,FEMALE,20,Wood,,Saturday,5,...,2020-04-04T22:35:02+00:00,573302,"[{'base_price': 28.99, 'discount_percentage': ...","[ZO0008100081, ZO0017000170]",57.98,57.98,2,2,order,mary
2,[Women's Shoes],EUR,Rabbia Al,Rabbia Al Moss,FEMALE,5,Moss,,Tuesday,1,...,2020-03-24T05:45:36+00:00,557447,"[{'base_price': 84.99, 'discount_percentage': ...","[ZO0247302473, ZO0669706697]",144.98,144.98,2,2,order,rabbia
3,[Women's Shoes],EUR,Abigail,Abigail Potter,FEMALE,46,Potter,,Sunday,6,...,2020-04-12T23:19:41+00:00,584113,"[{'base_price': 64.99, 'discount_percentage': ...","[ZO0679806798, ZO0244902449]",139.98,139.98,2,2,order,abigail
4,[Women's Shoes],EUR,Brigitte,Brigitte Goodman,FEMALE,12,Goodman,,Friday,4,...,2020-04-03T07:52:19+00:00,571128,"[{'base_price': 28.99, 'discount_percentage': ...","[ZO0140701407, ZO0003800038]",53.98,53.98,2,2,order,brigitte
5,[Women's Shoes],EUR,Elyssa,Elyssa Reese,FEMALE,27,Reese,,Sunday,6,...,2020-04-05T13:07:41+00:00,574115,"[{'base_price': 11.99, 'discount_percentage': ...","[ZO0002000020, ZO0137701377]",36.98,36.98,2,2,order,elyssa
6,[Women's Shoes],EUR,Wilhemina St.,Wilhemina St. Ball,FEMALE,17,Ball,,Thursday,3,...,2020-03-26T13:45:07+00:00,560628,"[{'base_price': 28.99, 'discount_percentage': ...","[ZO0143801438, ZO0674006740]",103.98,103.98,2,2,order,wilhemina
7,[Women's Shoes],EUR,Sonya,Sonya Lloyd,FEMALE,28,Lloyd,,Saturday,5,...,2020-04-18T07:26:24+00:00,591195,"[{'base_price': 29.99, 'discount_percentage': ...","[ZO0140301403, ZO0383103831]",89.98,89.98,2,2,order,sonya
8,[Women's Shoes],EUR,Yasmine,Yasmine Mckinney,FEMALE,43,Mckinney,,Saturday,5,...,2020-03-21T00:21:36+00:00,553099,"[{'base_price': 24.99, 'discount_percentage': ...","[ZO0010600106, ZO0018900189]",66.98,66.98,2,2,order,yasmine
9,[Women's Shoes],EUR,Brigitte,Brigitte Barber,FEMALE,12,Barber,,Saturday,5,...,2020-04-11T09:15:50+00:00,581886,"[{'base_price': 69.99, 'discount_percentage': ...","[ZO0665706657, ZO0027400274]",102.98,102.98,2,2,order,brigitte


#### filtering 
As opposed to `match` filtering aims to answer the question "how does the record match the query clause?", so the answer is a simple yes or no and there is no score involved (https://www.elastic.co/guide/en/elasticsearch/reference/2.0/query-filter-context.html). 

In [15]:
s = Search().using(client).filter('terms', day_of_week=['Tuesday', 'Thursday'])
response = s.execute()

In [16]:
s.to_dict()

{'query': {'bool': {'filter': [{'terms': {'day_of_week': ['Tuesday',
       'Thursday']}}]}}}

In [17]:
df = []
for h in response.hits.hits:
    df.append(h["_source"].to_dict())
    
pd.DataFrame(df)

Unnamed: 0,category,currency,customer_first_name,customer_full_name,customer_gender,customer_id,customer_last_name,customer_phone,day_of_week,day_of_week_i,...,order_date,order_id,products,sku,taxful_total_price,taxless_total_price,total_quantity,total_unique_products,type,user
0,[Men's Clothing],EUR,Ahmed Al,Ahmed Al Morris,MALE,4,Morris,,Tuesday,1,...,2020-04-14T19:53:46+00:00,586554,"[{'base_price': 11.99, 'discount_percentage': ...","[ZO0441604416, ZO0113501135]",44.98,44.98,2,2,order,ahmed
1,[Women's Shoes],EUR,rania,rania Gilbert,FEMALE,24,Gilbert,,Thursday,3,...,2020-04-02T12:28:48+00:00,570056,"[{'base_price': 24.99, 'discount_percentage': ...","[ZO0131801318, ZO0215802158]",49.98,49.98,2,2,order,rani
2,[Women's Shoes],EUR,Elyssa,Elyssa Mccormick,FEMALE,27,Mccormick,,Tuesday,1,...,2020-04-07T03:31:41+00:00,576208,"[{'base_price': 64.99, 'discount_percentage': ...","[ZO0666606666, ZO0139201392]",97.98,97.98,2,2,order,elyssa
3,"[Women's Shoes, Women's Accessories]",EUR,Mary,Mary Adams,FEMALE,20,Adams,,Tuesday,1,...,2020-04-07T07:56:38+00:00,576454,"[{'base_price': 64.99, 'discount_percentage': ...","[ZO0245302453, ZO0357203572]",129.98,129.98,2,2,order,mary
4,"[Women's Shoes, Women's Accessories]",EUR,Brigitte,Brigitte Schultz,FEMALE,12,Schultz,,Tuesday,1,...,2020-04-07T08:38:24+00:00,576501,"[{'base_price': 59.99, 'discount_percentage': ...","[ZO0323203232, ZO0194201942]",73.98,73.98,2,2,order,brigitte
5,[Women's Clothing],EUR,Selena,Selena Goodman,FEMALE,42,Goodman,,Tuesday,1,...,2020-04-07T09:43:12+00:00,576557,"[{'base_price': 10.99, 'discount_percentage': ...","[ZO0641506415, ZO0171301713]",22.98,22.98,2,2,order,selena
6,[Men's Clothing],EUR,Fitzgerald,Fitzgerald Schultz,MALE,11,Schultz,,Tuesday,1,...,2020-04-14T12:10:05+00:00,586119,"[{'base_price': 24.99, 'discount_percentage': ...","[ZO0630306303, ZO0532605326]",41.98,41.98,2,2,order,fuzzy
7,"[Men's Clothing, Men's Shoes]",EUR,Recip,Recip Perkins,MALE,10,Perkins,,Tuesday,1,...,2020-04-14T12:46:05+00:00,586157,"[{'base_price': 10.99, 'discount_percentage': ...","[ZO0130401304, ZO0509705097]",27.98,27.98,2,2,order,recip
8,"[Women's Clothing, Women's Shoes]",EUR,Wilhemina St.,Wilhemina St. Tran,FEMALE,17,Tran,,Tuesday,1,...,2020-04-14T13:40:48+00:00,586213,"[{'base_price': 32.99, 'discount_percentage': ...","[ZO0225502255, ZO0031400314]",74.98,74.98,2,2,order,wilhemina
9,"[Women's Shoes, Women's Clothing]",EUR,rania,rania Clayton,FEMALE,24,Clayton,,Tuesday,1,...,2020-04-14T15:07:12+00:00,586303,"[{'base_price': 28.99, 'discount_percentage': ...","[ZO0003300033, ZO0712307123]",47.98,47.98,2,2,order,rani


#### Aggregations

In [18]:
from elasticsearch_dsl import A

a = A('terms', field='customer_gender')

In [19]:

s = Search().using(client)
s.aggs.bucket('gender', 'terms', field='customer_gender')\
    .metric('num_customers', 'value_count', field='customer_id')

Terms(aggs={'num_customers': ValueCount(field='customer_id')}, field='customer_gender')

In [20]:
s.to_dict()

{'aggs': {'gender': {'terms': {'field': 'customer_gender'},
   'aggs': {'num_customers': {'value_count': {'field': 'customer_id'}}}}}}

In [21]:
response = s.execute()

In [22]:
response.aggregations.to_dict()

{'gender': {'doc_count_error_upper_bound': 0,
  'sum_other_doc_count': 0,
  'buckets': [{'key': 'FEMALE',
    'doc_count': 2433,
    'num_customers': {'value': 2433}},
   {'key': 'MALE', 'doc_count': 2242, 'num_customers': {'value': 2242}}]}}

In [23]:
df = []
for r in response.aggregations.gender.buckets:
    df.append(r.to_dict())
pd.DataFrame(df)

Unnamed: 0,doc_count,key,num_customers
0,2433,FEMALE,{'value': 2433}
1,2242,MALE,{'value': 2242}
