# ElasticSearch 6.3.2 With Python
https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html

- Open source search engine based on apache lucene
- written in java
- Cross platform 
- Big focus on scalability
- Designed to take data from any source, analyze it and search through it
- Communication done through HTTP REST API
- Schema-less JSON documents (like NoSQL dbs)
- Near real-time search (small latency from document is indexd until it is seacheable, around 1s)
- Used by quora, netflix, github, soundcloud

## Cluster
- Collection of nodes (servers) containing data
- Cluster provides indexing and search capability
- Identified by a unique name (defaults to "elastichsearch")

## Node
- A single node that is part of a cluster
- Stores searchable data
- Participates in a cluster's indexing and search capability
- Identified by a name
- Joins cluster "elastichseach" by default

## Index
- A collection of documents (product, account, movie)
- Corresponds to a db witin a relational db system
- Identified by a name, which must be lowercased (used when indexing, searching, updating and deleting index)
- We can define as many index as you want within a cluster

## Type
- Represents a class/category of similar documents eg "user"
- Consists of a anme and mapping
- Equivalent of a table in relational database
- An index can have one or more types defined each with their own mapping
- Stored within a metadata field name _type


## Mapping
- Similar to a database schme for a table in a relational database
- Describes the fields that a document of a given type may have
- Equivalent of a table in relational database : includes data type of each filed string, integer, date and also information on how fileds should be indexed
- Dynamic mapping means that is optional to define a mapping explicitly

## Document
- A basic unit of information that can be indexed
- Consists of fileds, which are key/value pairs
- Corresponds to an object in OOP 
- Documents are expressed in JSON
- You can store as many documents with an index as you want

## Shard
- A index can be deivided into multiple pieces called shards, useful if an index contains more data than the hardware of a node can store (eg 1 tb data on a 500gb disk)
- A shard is a fully functional and independant index, can be stored on any node in a cluster  
- The number of shards can be specified when creating an index
- Allows the scale horizontally by content volume (index spaceà
- Allows to distribyte and //lize operations acorss shards, which increases perfomance 

## Replica
- A copy of a shard
- Provides highly avalability in case a shard or node fails, a replica never resides on the same node as the original shard
- Allows scaling search volume, because search queries can be executed on all replicas in //
- by default, ES add 5 primary shards and 1 replica for each index

## The Missing Web UI for Elasticsearch
https://github.com/appbaseio/dejavu

<img src="https://camo.githubusercontent.com/5a15b79cebc8c9fe20ac6ac6ea0f54472b5fc41e/68747470733a2f2f692e696d6775722e636f6d2f484745455966752e676966">

In [34]:
from datetime import datetime #alaways use this datetime!
from elasticsearch import Elasticsearch

#just for format dict output
import json

## 1- Connecting the client to the sever

In [2]:
# by default we connect to localhost:9200
es = Elasticsearch()

##  2 - Creating an index

In [3]:
# drop it if it already exists
es.indices.delete(index='eventbrite', ignore=[400, 404])

mappings = {
    'mappings' : {
        'event' : {
            'properties' : {
                'id' : {'type' : 'keyword', 'index' : 'true'},
                 # lowercases, 
                 # splits on white spaces and puncation
                 #stems : e.g farming farmed => farm
                'name' : {'type' : 'text', 'analyzer' : 'english'}, 

                'description' : {'type' : 'text', 'analyzer' : 'english'}, 

                'city' : {'type' : 'keyword', 'index' : 'true'}, 

                'start_date' : {'type' : 'date'}, 

                'price' : {'type' : 'float'}
            }
        }
    }
}

In [4]:
# ignore status code 400 (index already exists)
print(es.indices.create(index='eventbrite', body=mappings, ignore=400))

{'shards_acknowledged': True, 'acknowledged': True, 'index': 'eventbrite'}


## 3 - Indexing (Inserting) events

In [29]:
docs = [
    {
        'id' : '0',
        'name' : 'Filbert Sorting for Fun and Profit', 
        'description' : 'test', 
        'city' : 'Nashville', 
        'start_date' : datetime(2016, 2, 15), 
        'price' : 15.99
    },
    {
        'id' : '1',
        'name' : 'test test2', 
        'description' : 'test test2', 
        'city' : 'Nashville', 
        'start_date' : datetime(2017, 2, 15), 
        'price' : 15.99
    },
    {
        'id' : '2',
        'name' : 'test test2', 
        'description' : 'test test2', 
        'city' : 'California', 
        'start_date' : datetime(2017, 2, 15), 
        'price' : 25
    }
]

In [30]:
delete_all_query = {
    'query' : {
        'match_all' : {}
    }
}
es.delete_by_query(index="eventbrite", doc_type="event", body=delete_all_query)

{'batches': 1,
 'deleted': 2,
 'failures': [],
 'noops': 0,
 'requests_per_second': -1.0,
 'retries': {'bulk': 0, 'search': 0},
 'throttled_millis': 0,
 'throttled_until_millis': 0,
 'timed_out': False,
 'took': 13,
 'total': 2,
 'version_conflicts': 0}

In [31]:
# there are options for bulk and streaming indexing
for i, doc in enumerate(docs):
    # Adds a typed JSON document in a specific index, making it searchable.
    # Behind the scenes this method calls index(…, op_type=’create’)
    res = es.create(index="eventbrite", doc_type="event", body=doc, id=int(doc['id']))
    print(res)

{'_type': 'event', '_primary_term': 1, 'result': 'created', '_index': 'eventbrite', '_id': '0', '_shards': {'successful': 1, 'failed': 0, 'total': 2}, '_seq_no': 2, '_version': 3}
{'_type': 'event', '_primary_term': 1, 'result': 'created', '_index': 'eventbrite', '_id': '1', '_shards': {'successful': 1, 'failed': 0, 'total': 2}, '_seq_no': 2, '_version': 3}
{'_type': 'event', '_primary_term': 1, 'result': 'created', '_index': 'eventbrite', '_id': '2', '_shards': {'successful': 1, 'failed': 0, 'total': 2}, '_seq_no': 0, '_version': 1}


## 4- Querying events : match_all with QUERY SQL

In [21]:
query = {
    'query' : {
        'match_all' : {} # select * from events.
    }
}
es.search(index='eventbrite', body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '0',
    '_index': 'eventbrite',
    '_score': 1.0,
    '_source': {'city': 'Nashville',
     'description': 'test',
     'id': '0',
     'name': 'Filbert Sorting for Fun and Profit',
     'price': 15.99,
     'start_date': '2016-02-15T00:00:00'},
    '_type': 'event'},
   {'_id': '1',
    '_index': 'eventbrite',
    '_score': 1.0,
    '_source': {'city': 'Nashville',
     'description': 'test test2',
     'id': '1',
     'name': 'test test2',
     'price': 15.99,
     'start_date': '2017-02-15T00:00:00'},
    '_type': 'event'}],
  'max_score': 1.0,
  'total': 2},
 'timed_out': False,
 'took': 1}

## 4- Querying events : term

In [20]:
query = {
    'query' : {
        'term' : {
          'city' : 'Nashville'   
        }
    }
}

es.search(index='eventbrite', body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '0',
    '_index': 'eventbrite',
    '_score': 0.2876821,
    '_source': {'city': 'Nashville',
     'description': 'test',
     'id': '0',
     'name': 'Filbert Sorting for Fun and Profit',
     'price': 15.99,
     'start_date': '2016-02-15T00:00:00'},
    '_type': 'event'},
   {'_id': '1',
    '_index': 'eventbrite',
    '_score': 0.2876821,
    '_source': {'city': 'Nashville',
     'description': 'test test2',
     'id': '1',
     'name': 'test test2',
     'price': 15.99,
     'start_date': '2017-02-15T00:00:00'},
    '_type': 'event'}],
  'max_score': 0.2876821,
  'total': 2},
 'timed_out': False,
 'took': 1}

## 4- Querying events : match

In [19]:
query = {
    'query' : {
        'match' : {
          'name' : 'SORTED' #analyzer english
        }
    }
}

es.search(index='eventbrite', body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '0',
    '_index': 'eventbrite',
    '_score': 0.2876821,
    '_source': {'city': 'Nashville',
     'description': 'test',
     'id': '0',
     'name': 'Filbert Sorting for Fun and Profit',
     'price': 15.99,
     'start_date': '2016-02-15T00:00:00'},
    '_type': 'event'}],
  'max_score': 0.2876821,
  'total': 1},
 'timed_out': False,
 'took': 1}

## 5- Querying events : match_phrase

In [18]:
query = {
    'query' : {
        'match_phrase' : {
          'name' : 'sorting filbert' #order not respected so fail
        }
    }
}
es.search(index='eventbrite', body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [], 'max_score': None, 'total': 0},
 'timed_out': False,
 'took': 1}

In [17]:
query = {
    'query' : {
        'match_phrase' : {
          'name' : 'filberts sort' #notice spelling changes!
        }
    }
}

es.search(index='eventbrite', body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '0',
    '_index': 'eventbrite',
    '_score': 0.5753642,
    '_source': {'city': 'Nashville',
     'description': 'test',
     'id': '0',
     'name': 'Filbert Sorting for Fun and Profit',
     'price': 15.99,
     'start_date': '2016-02-15T00:00:00'},
    '_type': 'event'}],
  'max_score': 0.5753642,
  'total': 1},
 'timed_out': False,
 'took': 1}

In [16]:
query = {
    'query' : {
        'match_phrase' : {
          'name' : 'filbert fun' #fail world spaces not respected
        }
    }
}

es.search(index='eventbrite', body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [], 'max_score': None, 'total': 0},
 'timed_out': False,
 'took': 2}

In [15]:
query = {
    'query' : {
        'match_phrase' : {
          'name' : {
            'query' : 'filbert fun',
            'slop' : 2 #withing the space of 2
            }
        }
    }
}

es.search(index='eventbrite', body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '0',
    '_index': 'eventbrite',
    '_score': 0.27517417,
    '_source': {'city': 'Nashville',
     'description': 'test',
     'id': '0',
     'name': 'Filbert Sorting for Fun and Profit',
     'price': 15.99,
     'start_date': '2016-02-15T00:00:00'},
    '_type': 'event'}],
  'max_score': 0.27517417,
  'total': 1},
 'timed_out': False,
 'took': 3}

## 6- Querying events : Complexe query with bool

In [26]:
query = {
    'query' : {
        'bool' : {
          'must' : [{
            'term' : {
            'city' : 'Nashville'
                }
            }],
            'should' : [{
            'match_phrase' : {
              'name' : {
                'query' : 'filbert fun',
                    'slop' : 2, #withing the space of 2
                    'boost' : 2 #control how important each clause (adjusting weights)
                    }
                }
            }]
        }
    }
}
es.search(index='eventbrite', body=query)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '0',
    '_index': 'eventbrite',
    '_score': 0.83803046,
    '_source': {'city': 'Nashville',
     'description': 'test',
     'id': '0',
     'name': 'Filbert Sorting for Fun and Profit',
     'price': 15.99,
     'start_date': '2016-02-15T00:00:00'},
    '_type': 'event'},
   {'_id': '1',
    '_index': 'eventbrite',
    '_score': 0.2876821,
    '_source': {'city': 'Nashville',
     'description': 'test test2',
     'id': '1',
     'name': 'test test2',
     'price': 15.99,
     'start_date': '2017-02-15T00:00:00'},
    '_type': 'event'}],
  'max_score': 0.83803046,
  'total': 2},
 'timed_out': False,
 'took': 2}

## 6- Scoring based on TF*IDF
**Term freqeuncey** \* ** Inverse document frequencey **

In [27]:
""" Example:
Phrase = "hey diddle diddle the cat and the fiddle"
Query = "the diddle"

TF(the) = 2
TF(diddle) = 2
IDF(the) = 1/infini = 0  (word the is very famous in document)
IDF(diddle) = 1/7
score = TF(the) * IDF(the)  + TF(diddle) * IDF(diddle) = 2/7
 
"""

' Example:\nPhrase = "hey diddle diddle the cat and the fiddle"\nQuery = "the diddle"\n\nTF(the) = 2\nTF(diddle) = 2\nIDF(the) = 1/infini = 0  (word the is very famous in document)\nIDF(diddle) = 1/7\nscore = TF(the) * IDF(the)  + TF(diddle) * IDF(diddle) = 2/7\n \n'

# Search Engine Internals

## Getting Data In : Analysis
1- Tokenization

2- Lower casing

3- Stop wording

4- Stemming 

5- Indexing (secret sauce, analyzer + tokenizer)

## Getting Data Out : boolean query
inverted_index = {
    'hi' : [34, 92, 119],
    'zebra' : [34, 92, 118]
}

hi & zebra ==> [34, 92]

hi or zebra ==> [34, 92, 119, 118]

## Getting Data Out : Sorting by relevance
inverted_index = {
    'hi' : [34, 92, 119],
    'zebra' : [34, 92, 118]
}

hi or zebra ==> [34, 92, 119, 118]

- Initialize Priority Queue
- For each matching doc : 
  - calculate TF * IDF
  - add to priority queue
  - pop off lowest scoring doc
- Return contents of Priority

## Getting Data Out : Aggregation
inverted_index = {
    'hi' : [34, 92, 119],
    'zebra' : [34, 92, 118]
}

hi or zebra ==> [34, 92, 119, 118]

- Initialize Aggregator(sum, average, count, etc)
- For each matching doc : 
  - Retrieve "interesting" info (ex.price)
  - add to Aggregator
- Return results from Aggregator

In [40]:
query = {
    'query' : {
      'match_all' : {}
    },
    'aggs' : {
        'by_city' : {
            'terms' : { #count by city
                'field' : 'city' 
            }
        },
         'by_price' : {
            'histogram' : {
                'field' : 'price',
                'interval' : 10,
            }
        }
    }
}
res = es.search(index='eventbrite', body=query)
print(json.dumps(res, indent=2))

{
  "aggregations": {
    "by_price": {
      "buckets": [
        {
          "key": 10.0,
          "doc_count": 2
        },
        {
          "key": 20.0,
          "doc_count": 1
        }
      ]
    },
    "by_city": {
      "doc_count_error_upper_bound": 0,
      "buckets": [
        {
          "key": "Nashville",
          "doc_count": 2
        },
        {
          "key": "California",
          "doc_count": 1
        }
      ],
      "sum_other_doc_count": 0
    }
  },
  "timed_out": false,
  "_shards": {
    "skipped": 0,
    "successful": 5,
    "failed": 0,
    "total": 5
  },
  "took": 2,
  "hits": {
    "hits": [
      {
        "_index": "eventbrite",
        "_type": "event",
        "_score": 1.0,
        "_source": {
          "name": "Filbert Sorting for Fun and Profit",
          "id": "0",
          "city": "Nashville",
          "description": "test",
          "price": 15.99,
          "start_date": "2016-02-15T00:00:00"
        },
        "_id": "0"
     

In [41]:
query = {
    'query' : {
      'match_all' : {}
    },
    'aggs' : {
        'by_city' : {
            'terms' : { #count by city
                'field' : 'city' 
            },
         'aggs' : {
            'by_price' : {
                'histogram' : {
                    'field' : 'price',
                    'interval' : 10,
                    }
                }
            }
        }
    }
}
res = es.search(index='eventbrite', body=query)
print(json.dumps(res, indent=2))

{
  "aggregations": {
    "by_city": {
      "doc_count_error_upper_bound": 0,
      "buckets": [
        {
          "by_price": {
            "buckets": [
              {
                "key": 10.0,
                "doc_count": 2
              }
            ]
          },
          "key": "Nashville",
          "doc_count": 2
        },
        {
          "by_price": {
            "buckets": [
              {
                "key": 20.0,
                "doc_count": 1
              }
            ]
          },
          "key": "California",
          "doc_count": 1
        }
      ],
      "sum_other_doc_count": 0
    }
  },
  "timed_out": false,
  "_shards": {
    "skipped": 0,
    "successful": 5,
    "failed": 0,
    "total": 5
  },
  "took": 5,
  "hits": {
    "hits": [
      {
        "_index": "eventbrite",
        "_type": "event",
        "_score": 1.0,
        "_source": {
          "name": "Filbert Sorting for Fun and Profit",
          "id": "0",
          "city": "Nashv