Elasticsearch Cheat Sheet

Elasticsearch

ElasticSearch is a highly scalable open source search engine with a REST API

https://github.com/elastic/elasticsearch

Distributed, scalable, and highly available
Real-time search and analytics capabilities
Sophisticated RESTful API
is build ton top of Lucene

Default port - localhost:9200

Basic Concept

https://www.elastic.co/guide/en/elasticsearch/reference/1.3/_basic_concepts.html

NRT

Near Real Time, this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.

Cluster

A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch"

Node

A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Marvel character name that is assigned to the node at startup. In a single cluster, you can have as many nodes as you want.

Index

An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data.

Type ( Deprecatd in 6.0)

Within an index, you can define one or more types. A type is a logical category/partition of your index whose semantics is completely up to you. In general, a type is defined for documents that have a set of common fields. For example, let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.

Document

A document is a basic unit of information that can be indexed.

Shards & Replica -

An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone. To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards.

Sharding is important for two primary reasons:

It allows you to horizontally split/scale your content volume
It allows you distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput

Start

$ elasticsearch       # or use window service

Exploring

base url - http://localhost:9200

cat API

/_cat --- will use cat api for checking health of cluster, will show all option for _cat command
/_cat/nodes?v --- list of nodes in our cluster
/_cat/indices?v --- list of indexes
/_cat/plugins -- List of plugins

Health Status

Green - Everything Ok
Yellow - Some replica not allocated
Red - Some data is missing

Create index

  $ curl -XPUT 'localhost:9200/blog?pretty'
  
  $ curl -XPUT 'http://localhost:9200/blog/user/dilbert' -d '{ "name" : "Dilbert Brown" }'

  $ curl -XPUT 'http://localhost:9200/blog/post/1' -d '
  { 
      "user": "dilbert", 
      "postDate": "2011-12-15", 
      "body": "Search is hard. Search should be easy." ,
      "title": "On search"
  }'
  
  $ curl -XPUT 'http://localhost:9200/blog/post/2' -d '
  { 
      "user": "dilbert", 
      "postDate": "2011-12-12", 
      "body": "Distribution is hard. Distribution should be easy." ,
      "title": "On distributed search"
  }'
  
  # Get 
  $ curl -XGET 'http://localhost:9200/blog/user/dilbert?pretty=true'
  $ curl -XGET 'http://localhost:9200/blog/post/1?pretty=true'

Update

$ curl -XPOST 'customer/external/1/_update?pretty&pretty' -H 'Content-Type: application/json' -d'
{
  "doc": { "name": "Jane Doe" }
}
'

Batch Processing

Elasticsearch provides the abiltiy to perform any of operation ( insert, update, delete) in batches using the bulk api.

    POST /customer/internal/_bulk
    { "index": { "id": 1} }
    { "name" : "john" }
    { "index": { "id": 2} }
    { "name" : "doe" }

Above example - indexes two document ( Id -1 john & Id-2 - Doe) in one bulk operation.

    $ POST /customer/external/_bulk?pretty
    {"update":{"_id":"1"}}
    {"doc": { "name": "John Doe becomes Jane Doe" } }
    {"delete":{"_id":"2"}}

Above example - update the first document & then delete the second document.

Searching

$ curl 'localhost:9200/blog/_search?q=user:ganesh&pretty=true'

This will return result in

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "blog",
      "_type" : "post",
      "_id" : "1",
      "_score" : 0.30685282,
      "_source":{"user":"ganesh", "title":"on search", "body":"search is hard"}
    } ]
  }
}

We can also use the JSON query language

$ curl 'localhost:9200/blog/_search?pretty' -d '{
    "query" : {
        "match" : { "user" : "ganesh"}
    }
}'

Search API through URI

GET /bank/_search?q=*&sort=account_number:asc&pretty

Search API though Body

    GET /bank/_search
    {
        "query": { "match_all": {} },
        "sort" : [
            { "account_number" : "asc" }
        ]
    }


    // match term `mill` or `lane`
    GET /bank/_search
    {
      "query": { "match": { "address": "mill lane" } }
    }

    // match term `mill lane`
    GET /bank/_search
    {
      "query": { "match_phrase": { "address": "mill lane" } }
    }

pagingation - from 10 to 20

    GET /bank/_search
    {
      "query": { "match_all": {} },
      "from": 10,
      "size": 10
    }

Match record

    GET /bank/_search
    {
      "query": { "match": { "account_number": 20 } }
    }

Return specific column from search

    GET /bank/_search
    {
      "query": { "match_all": {} },
      "_source": ["account_number", "balance"]
    }

Boolen query

// return all accounts containing "mill" and "lane" in address
GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}


GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "40" } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}

Aggregation

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"
      }
    }
  }
}

Above aggregation is similar in concept to - SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC.

Note that we set size=0 to not show search hits because we only want to see the aggregation results in the response.

Plugin

Plugin installation

$ sudo bin/elasticsearch_plugin install <plugin-name>

List of install plugin

$ ./bin/plugin list

Plugin Example

x-pack
head
Marvel - https://www.elastic.co/downloads/marvel

Default Password

elastic / changeme

Kibana

Installation

https://www.elastic.co/downloads/kibana

Run

$ bin/kibana

Logstash

Open source centralized logging manager

Client

Elasticsearch.Net
NEST

Elasticsearch.Net is a very low level, dependency free.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly