# Elasticsearch

Elasticsearch is a distributed search engine based on Lucene. It provides full-text search capabilities via a HTTP web interface and schema-free JSON documents. 

In [1]:
elasticsearch_url=localhost:9200
kibana_url=localhost:5601
index_1=my_index

In [2]:
curl -XGET http://$elasticsearch_url/

{
  "name" : "IGp-s0M",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "3sKj2maoR1G14YjtSFRhLw",
  "version" : {
    "number" : "5.5.1",
    "build_hash" : "19c13d0",
    "build_date" : "2017-07-18T20:44:24.823Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.0"
  },
  "tagline" : "You Know, for Search"
}


## Table of Contents:
* [Components of Elasticsearch](#Components of Elasticsearch)
   * [Documents](#Documents)
   * [Index](#Index)
   
* [Creating an Index with Data](#Creating an Index with Data)
* [Search!](#Search!)
    * [Searching Terms](#Searching Terms)
    * [Searching Phrases](#Searching Phrases)
    * [Search with Ranges](#Search with Ranges)
    * [The bool query](#The bool query)

* [Text Analysis](#Text Analysis)
    * [What is an Inverted Index?](#What is an Inverted Index?)
    * [Analyzers](#Analyzers)
    * [Character Filters](#Character Filters)
    * [Tokenizer](#Tokenizer)
    * [Token Filters](#Token Filters)
* [ES for CS8903](#ES for CS8903)

## Components of Elasticsearch

### Documents

A piece of data you want to search. Documents are similar to records in a traditional RDBMS database, but not quite the same. Documents in elasticsearch are stored in a schema-free JSON format.

A document can be any text or numeric data you want to search and/or analyze.

Example:
  * an entry in a log file,
  * a comment made by a customer on your website,
  * the weather details from a weather station at a specific moment

Specifically, a document is a top-level object that is serialized into JSON and stored in Elasticsearch. Every document has a unique ID

  * which either you provide
  * or Elasticsearch generates for you

In [3]:
curl -XGET "http://$elasticsearch_url/products/product/5964/_source?pretty=true" -H "Content-Type: application/json"

{
  "brandName" : "Swiss Miss",
  "customerRating" : 1,
  "price" : 17.58,
  "grp_id" : "5964",
  "quantitySold" : 168556,
  "upc12" : "070920474021",
  "productName" : "Swiss Miss Marshmallow Madness Hot Cocoa Mix - 14 Ct"
}


### Index

An index can be a collection of documents that have similar characteristics. All the documents within an index do not necessarily need to have all the same fields. Again, an Index can be thought of as a Table in a traditional RDBMS, but not quite the same.

   * An index contains mappings that define the documents' field names and data types of the index
   * it is a logical namespace that maps to where its contents are stored in the cluster

In [4]:
curl -XGET http://$elasticsearch_url/products?pretty=true

{
  "products" : {
    "aliases" : { },
    "mappings" : {
      "product" : {
        "_all" : {
          "enabled" : false
        },
        "properties" : {
          "brandName" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "customerRating" : {
            "type" : "long"
          },
          "grp_id" : {
            "type" : "keyword"
          },
          "price" : {
            "type" : "float"
          },
          "productName" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "quantitySold" : {
            "type" : "long"
          },
          "upc12" : {
            "type" : "keyword"
          }
        }
      }
    },
    "settings" : {
      

## Creating an Index with Data

Before we Index the documents, we need to create an Index that we can use.

In [5]:
curl -XPUT http://$elasticsearch_url/$index_1 -H "Content-Type: application/json"

{"acknowledged":true,"shards_acknowledged":true}

Indexing Data
I have 3 simple JSON documents, now let's add them to an Index called "my_index".

{
"username": "harrison",
"comment": "My Favourite movie is Star Wars!"
}

{
"username": "john",
"comment": "The North Star is right above my house."
}

{
"username": "lily",
"comment": "My Favourite movie star in Hollywood is Harrison Ford."
}

Here, "username" and "comments" are fields.

In [6]:
curl -XPOST http://$elasticsearch_url/$index_1/doc/ -i -H "Content-Type: application/json" -d '{ "username": "harrison", "comment": "My Favourite movie is Star Wars!" }'

HTTP/1.1 201 Created
Location: /my_index/doc/AV5dWK1WI_fCNZVDFGDw
content-type: application/json; charset=UTF-8
content-length: 159

{"_index":"my_index","_type":"doc","_id":"AV5dWK1WI_fCNZVDFGDw","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}

In [7]:
curl -XPOST http://$elasticsearch_url/$index_1/doc/ -i -H "Content-Type: application/json" -d '{ "username": "john", "comment": "The North Star is right above my house." }'

HTTP/1.1 201 Created
Location: /my_index/doc/AV5dWLqTI_fCNZVDFGDx
content-type: application/json; charset=UTF-8
content-length: 159

{"_index":"my_index","_type":"doc","_id":"AV5dWLqTI_fCNZVDFGDx","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}

In [8]:
curl -XPOST http://$elasticsearch_url/$index_1/doc/ -i -H "Content-Type: application/json" -d '{ "username": "lily", "comment": "My Favourite movie star in Hollywood is Harrison Ford." }'

HTTP/1.1 201 Created
Location: /my_index/doc/AV5dWL7rI_fCNZVDFGDy
content-type: application/json; charset=UTF-8
content-length: 159

{"_index":"my_index","_type":"doc","_id":"AV5dWL7rI_fCNZVDFGDy","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}

In [9]:
curl -XGET "http://$elasticsearch_url/_cat/indices?v"

health status index     uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   crimes    yctKv_ytTfuEuygBti-fzQ   5   1     192147            0     50.1mb         50.1mb
green  open   products  0DDoLSiHTOmravxDCOdB0A   2   0     110435            0       28mb           28mb
yellow open   stocks    qRbfSek5TOG9tk3IYCtKMg   5   1     122574            0     21.1mb         21.1mb
yellow open   test_html 9yu1vs3JTm2r8pC7_Ov_jA   5   1          1            0    276.6kb        276.6kb
yellow open   nutrition KI0MI9qZSN659K27swIg3A   5   1        499            0    549.5kb        549.5kb
yellow open   my_index  LERRQikHRKy2jRd2NCeV-w   5   1          3            0     12.9kb         12.9kb
yellow open   .kibana   9SkA5iqbSgWfcnreADu_zw   1   1          7            0     31.9kb         31.9kb


## Search!

The simplest form of search we can perform is the "match_all" search, which matches all the documents in elasticsearch. 

Search can be performed on:

   * the whole elasticsearch cluster "http://elasticsearch_url/_search"
   * on a specific index "http://elasticsearch_url/my_index/_search"
   * on index patterns "http://elasticsearch_url/my*/_search"

In [10]:
curl -XGET "http://$elasticsearch_url/$index_1/_search?pretty=true" -H "Content-Type: application/json"  \
-d '{"query": { "match_all": {} }}'

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWLqTI_fCNZVDFGDx",
        "_score" : 1.0,
        "_source" : {
          "username" : "john",
          "comment" : "The North Star is right above my house."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWL7rI_fCNZVDFGDy",
        "_score" : 1.0,
        "_source" : {
          "username" : "lily",
          "comment" : "My Favourite movie star in Hollywood is Harrison Ford."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWK1WI_fCNZVDFGDw",
        "_score" : 1.0,
        "_source" : {
          "username" : "harrison",
          "comment" : "My Favourite movie is Star Wars!"
        }
      }
    ]
  }
}


### Searching Terms

Using the "Match" query, one can easily search for terms in the document corpus. It can be used to search fields that are text, numerical values or dates.

In [11]:
curl -XGET "http://$elasticsearch_url/$index_1/_search?pretty=true" -H "Content-Type: application/json"  \
-d '{"query": { "match": { "comment": "Favourite star" } }}'

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.80781925,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWL7rI_fCNZVDFGDy",
        "_score" : 0.80781925,
        "_source" : {
          "username" : "lily",
          "comment" : "My Favourite movie star in Hollywood is Harrison Ford."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWK1WI_fCNZVDFGDw",
        "_score" : 0.53484553,
        "_source" : {
          "username" : "harrison",
          "comment" : "My Favourite movie is Star Wars!"
        }
      },
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWLqTI_fCNZVDFGDx",
        "_score" : 0.16823316,
        "_source" : {
          "username" : "john",
          "comment" : "The North Star is right above my house."
   

In [12]:
curl -XGET "http://$elasticsearch_url/$index_1/_search?pretty=true" -H "Content-Type: application/json"  \
-d '{"query": { "match": { "comment": {"query": "Favourite star", "operator": "and" }  } }}'

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.80781925,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWL7rI_fCNZVDFGDy",
        "_score" : 0.80781925,
        "_source" : {
          "username" : "lily",
          "comment" : "My Favourite movie star in Hollywood is Harrison Ford."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWK1WI_fCNZVDFGDw",
        "_score" : 0.53484553,
        "_source" : {
          "username" : "harrison",
          "comment" : "My Favourite movie is Star Wars!"
        }
      }
    ]
  }
}


In [13]:
curl -XGET "http://$elasticsearch_url/$index_1/_search?pretty=true" -H "Content-Type: application/json"  \
-d '{"query": { "match": { "comment": {"query": "My Favourite movie", "minimum_should_match": 2 }  } }}'

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.4474053,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWL7rI_fCNZVDFGDy",
        "_score" : 1.4474053,
        "_source" : {
          "username" : "lily",
          "comment" : "My Favourite movie star in Hollywood is Harrison Ford."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWK1WI_fCNZVDFGDw",
        "_score" : 0.80226827,
        "_source" : {
          "username" : "harrison",
          "comment" : "My Favourite movie is Star Wars!"
        }
      }
    ]
  }
}


### Searching Phrases

The "match_phrase" query is for searching text when you want to find terms that are near each other.

More specifically, you can search with conditions like:

    * All the terms in the phrase must be in the document
    * The position of terms must be in the same relevant order

In [14]:
curl -XGET "http://$elasticsearch_url/$index_1/_search?pretty=true" -H "Content-Type: application/json"  \
-d '{"query": { "match_phrase": { "comment": "movie star"  } }}'

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.80781925,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWL7rI_fCNZVDFGDy",
        "_score" : 0.80781925,
        "_source" : {
          "username" : "lily",
          "comment" : "My Favourite movie star in Hollywood is Harrison Ford."
        }
      }
    ]
  }
}


In [15]:
curl -XGET "http://$elasticsearch_url/$index_1/_search?pretty=true" -H "Content-Type: application/json"  \
-d '{"query": { "match_phrase": { "comment": {"query": "favourite star", "slop" : 2}  } }}'

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.51109093,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWL7rI_fCNZVDFGDy",
        "_score" : 0.51109093,
        "_source" : {
          "username" : "lily",
          "comment" : "My Favourite movie star in Hollywood is Harrison Ford."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWK1WI_fCNZVDFGDw",
        "_score" : 0.2481963,
        "_source" : {
          "username" : "harrison",
          "comment" : "My Favourite movie is Star Wars!"
        }
      }
    ]
  }
}


### Search with Ranges

The range query is for finding documents with fields that fall in a given range. Typically used for numeric and date fields.

In [16]:
curl -XGET "http://$elasticsearch_url/stocks/_search?pretty=true" -H "Content-Type: application/json"  \
-d '{"query": { "range": { "open": {"gt": 50.00, "lt": 60.00}  } }}'

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 12777,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "stocks",
        "_type" : "stock_type",
        "_id" : "AV3nptS9UTojrU_UL6vN",
        "_score" : 1.0,
        "_source" : {
          "volume" : 63264,
          "high" : 54.55,
          "trade_date_text" : "20100216",
          "stock_symbol" : "DE",
          "low" : 53.54,
          "close" : 53.78,
          "trade_date" : "2010-02-16T07:00:00.000Z",
          "open" : 53.76
        }
      },
      {
        "_index" : "stocks",
        "_type" : "stock_type",
        "_id" : "AV3nptS9UTojrU_UL6vO",
        "_score" : 1.0,
        "_source" : {
          "volume" : 173792,
          "high" : 58.05,
          "trade_date_text" : "20100217",
          "stock_symbol" : "DE",
          "low" : 56.05,
          "close" : 56.48,
          "trade_date" : "2010-02-17T07:00

In [17]:
curl -XGET "http://$elasticsearch_url/stocks/_search?pretty=true" -H "Content-Type: application/json"  \
-d '{"query": { "range": { "trade_date": {"gt": "2009-08", "lt": "2009-09"}  } }}'

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3493,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "stocks",
        "_type" : "stock_type",
        "_id" : "AV3nptSmUTojrU_UL6t5",
        "_score" : 1.0,
        "_source" : {
          "volume" : 671719,
          "high" : 15.76,
          "trade_date_text" : "20090827",
          "stock_symbol" : "DELL",
          "low" : 14.43,
          "close" : 15.654,
          "trade_date" : "2009-08-27T06:00:00.000Z",
          "open" : 14.73
        }
      },
      {
        "_index" : "stocks",
        "_type" : "stock_type",
        "_id" : "AV3nptSAUTojrU_UL6r2",
        "_score" : 1.0,
        "_source" : {
          "volume" : 46971,
          "high" : 46.06,
          "trade_date_text" : "20090826",
          "stock_symbol" : "DE",
          "low" : 44.9,
          "close" : 45.13,
          "trade_date" : "2009-08-26T06:0

### The bool query

The bool query is a combination of one or more of the following boolean clauses:
     * must
     * must_not
     * should
     * filter

In [18]:
curl -XGET "http://$elasticsearch_url/$index_1/_search?pretty=true" -H "Content-Type: application/json"  \
-d '{ "query": { "bool": {"must": [ { "match": { "comment": "star" }}], "should": [{ "bool": {"must_not": [{"match": {"comment": "wars"}}]}}]}}}'


{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.1682332,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWLqTI_fCNZVDFGDx",
        "_score" : 1.1682332,
        "_source" : {
          "username" : "john",
          "comment" : "The North Star is right above my house."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWL7rI_fCNZVDFGDy",
        "_score" : 1.1682332,
        "_source" : {
          "username" : "lily",
          "comment" : "My Favourite movie star in Hollywood is Harrison Ford."
        }
      },
      {
        "_index" : "my_index",
        "_type" : "doc",
        "_id" : "AV5dWK1WI_fCNZVDFGDw",
        "_score" : 0.26742277,
        "_source" : {
          "username" : "harrison",
          "comment" : "My Favourite movie is Star Wars!"
      

## Text Analysis

In general, data in Elasticsearch can be categorized into two groups:

    * Exact Value: Includes Numeric values, dates and certain strings
    * Full Text: unstructured text data that we want to search
 
 <img src="es1.png">
 
 Analysis is the process of converting full text into terms for the 'inverted' index. The full text fields get analyzed during the ingestion of the documents. 
 
  <img src="es2.png">

### What is an Inverted Index?

Lucene creates an Index similar to an Index in the back of a book, where useful terms are sorted and page numbers tell you where to find out more about them.

  <img src="es3.png">

## Analyzers

There are pre-built and custom analyzers built into Elasticsearch. More information about different kinds of Analyzers can be found here.

An Analyzer is made up of  3 parts:
   
   * Character Filter
   * Tokenizer
   * Token Filter
   
  <img src="es4.png">

### Character Filters

These are used to preprocess the text before we pass them onto a tokenizer. Some of the common character tokenizers can include:

   * removing html tags
   * replacing id's with respective mappings
   * replace texts using patterns
   

### Tokenizer

This is the second phase of the Analyzer, this divides the stream of texts into tokens. Some of the commonly used tokenizers include, but not limited to:

   * whitespace
   * keyword
   * pattern
   
Custom tokenizers can also be built in Java.

### Token Filters

The final part of the analyzer is mostly used to,

     * add tokens (Synonym Filter)
     * remove tokens (Stop Words)
     * change tokens (Stemming root word)

## ES for CS8903

Instead of using a scrapper to do focused scrapping on websites, we can build broad web crawlers, that can save web pages in elasticsearch. This means, we can write search queries to try and match assignments against indexed web pages.

<img src="es54.png">

Here is an example analyzer for our project:

<img src="es55.jpg">