<img src="https://static-www.elastic.co/assets/blt9a26f88bfbd20eb5/icon-elasticsearch-bb.svg" width="200" />


# Getting Started With Elasticsearch

## Overview

Elasticsearch is an open-source search engine built on top of Apache Lucene™, a full-text search-engine library. Lucene is arguably the most advanced, high-performance, and fully featured search engine library in existence today—both open source and proprietary.

But Lucene is just a library. To leverage its power, you need to work in Java and to integrate Lucene directly with your application. Worse, you will likely require a degree in information retrieval to understand how it works. Lucene is very complex.

Elasticsearch is also written in Java and uses Lucene internally for all of its indexing and searching, but it aims to make full-text search easy by hiding the complexities of Lucene behind a simple, coherent, RESTful API.

[More details here](https://www.elastic.co/guide/en/elasticsearch/guide/current/intro.html)

## Clients

We provide clients for different programming languages that wrap the REST API:
* Java
* Java REST
* JavaScript
* Groovy
* .NET
* PHP
* Perl
* Python
* Ruby

## Hands-on

### Accessing Using Python Client

This client was designed as very thin wrapper around Elasticseach’s REST API to allow for maximum flexibility.  More details [here](https://elasticsearch-py.readthedocs.io/en/master/).

Using the `Elasticsearch` client we can provide different arguments like hosts, ports , credentials, etc.

In [1]:
from elasticsearch import Elasticsearch

es = Elasticsearch(['localhost:9200'])
print es.ping() # can you hear me now?

True


### Getting Cluster Information

Elasticsearch provides different APIs to get information regarding cluster health,  nodes, indices and shards among others.

Thea health command tells us the status of our cluster. It can be green, yellow or red. The argument `v=True` is to get a more verbose output:

In [2]:
print es.cat.health(v=True)

epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1482299730 13:55:30  elasticsearch yellow          1         1     10  10    0    0       10             0                  -                 50.0%



The nodes command shows the cluster topology:

In [3]:
print es.cat.nodes(v=True)

ip        heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1           19         100  61    4.83                  mdi       *      RQf38mS



The indices command provides a cross-section of each index. This information spans nodes:

In [4]:
print es.cat.indices(v=True)

health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   users  Fb7EAeguQKag7JqO0zpMiQ   5   1         16            0       73kb           73kb
yellow open   movies dQyDdE7aSNmvoeq7eLJMEw   5   1      40344            0     64.6mb         64.6mb



The shards command is the detailed view of what nodes contain which shards. It will tell you if it’s a primary or replica, the number of docs, the bytes it takes on disk, and the node where it’s located:

In [5]:
print es.cat.shards(v=True)

index  shard prirep state      docs  store ip        node
users  1     p      STARTED       1  4.6kb 127.0.0.1 RQf38mS
users  1     r      UNASSIGNED                       
users  2     p      STARTED       3 13.6kb 127.0.0.1 RQf38mS
users  2     r      UNASSIGNED                       
users  3     p      STARTED       6 27.2kb 127.0.0.1 RQf38mS
users  3     r      UNASSIGNED                       
users  4     p      STARTED       4 18.2kb 127.0.0.1 RQf38mS
users  4     r      UNASSIGNED                       
users  0     p      STARTED       2  9.2kb 127.0.0.1 RQf38mS
users  0     r      UNASSIGNED                       
movies 1     p      STARTED    7982   12mb 127.0.0.1 RQf38mS
movies 1     r      UNASSIGNED                       
movies 2     p      STARTED    8101 13.5mb 127.0.0.1 RQf38mS
movies 2     r      UNASSIGNED                       
movies 3     p      STARTED    8033 13.4mb 127.0.0.1 RQf38mS
movies 3     r      UNASSIGNED                       
movies 4     p      ST

### Indexing Data

The index API adds or updates a typed JSON document in a specific index, making it searchable. The following example inserts a JSON document into the `users` index, under a type called `tweet`:

In [6]:
doc = {
    'name  '  : 'Mati22222as Cascallares',
    'twitter' : 'mcascallares',
    'location': 'Singapore',
}
response = es.index(index='users', doc_type='user', body=doc)

import json # pretty print
print json.dumps(response, indent=2)

{
  "_type": "user", 
  "_shards": {
    "successful": 1, 
    "failed": 0, 
    "total": 2
  }, 
  "_index": "users", 
  "_version": 1, 
  "created": true, 
  "result": "created", 
  "_id": "AVkf8uo8SIVCcDa64dMr"
}


### Searching Data

The search API allows you to execute a search query and get back search hits that match the query. All search APIs can be applied across multiple types within an index, and across multiple indices with support for the multi index syntax. For example, we can search on all documents across all types within the `users` index:

In [7]:
response = es.search(q='matias', index='users')
print json.dumps(response, indent=2)

{
  "hits": {
    "hits": [
      {
        "_score": 1.5404451, 
        "_type": "user", 
        "_id": "AVkbYT621nb8p2CYof6Q", 
        "_source": {
          "twitter": "mcascallares", 
          "location": "Singapore", 
          "name  ": "Matias Cascallares"
        }, 
        "_index": "users"
      }
    ], 
    "total": 1, 
    "max_score": 1.5404451
  }, 
  "_shards": {
    "successful": 5, 
    "failed": 0, 
    "total": 5
  }, 
  "took": 13, 
  "timed_out": false
}


### Searching Using DSL

Elasticsearch DSL is a high-level library whose aim is to help with writing and running queries against Elasticsearch. It is built on top of the official low-level client (elasticsearch-py).

It provides a more convenient and idiomatic way to write and manipulate queries. It stays close to the Elasticsearch JSON DSL, mirroring its terminology and structure. It exposes the whole range of the DSL from Python either directly using defined classes or a queryset-like expressions.

In [8]:
from elasticsearch_dsl import Search, Q

# let's build a search object
search = Search(using=es, index='users').query('match', location='Singapore')
response = search.execute()

# it's an iterable containing Result objects
for i in response:
    print(i)

<Result(users/user/AVkf5bXzSIVCcDa64Vzp): {u'twitter': u'mcascallares', u'location': u'Singapore', u'n...}>
<Result(users/user/AVkbZQZ41nb8p2CYoiX8): {u'twitter': u'mcascallares', u'location': u'Singapore', u'n...}>
<Result(users/user/AVkf5w9ASIVCcDa64Vzq): {u'twitter': u'mcascallares', u'location': u'Singapore', u'n...}>
<Result(users/user/AVkbY2qi1nb8p2CYoiX3): {u'twitter': u'mcascallares', u'location': u'Singapore', u'n...}>
<Result(users/user/AVkf8NKsSIVCcDa64ZgP): {u'twitter': u'mcascallares', u'location': u'Singapore', u'n...}>
<Result(users/user/AVkf8SgNSIVCcDa64avD): {u'twitter': u'mcascallares', u'location': u'Singapore', u'n...}>
<Result(users/user/AVkbY_o31nb8p2CYoiX5): {u'twitter': u'mcascallares', u'location': u'Singapore', u'n...}>
<Result(users/user/AVkbZBh91nb8p2CYoiX6): {u'twitter': u'mcascallares', u'location': u'Singapore', u'n...}>
<Result(users/user/AVkf4xl7SIVCcDa64Vze): {u'twitter': u'mcascallares', u'location': u'Singapore', u'n...}>
<Result(users/user/AVkf6LllS

## A Real Example: IMDB Dataset

Let's import a [dataset containing 5000 movies from IMDB](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset) into Elasticsearch. First, let's load those 5K movies from the CSV file:

In [1]:
import csv
movies = []
with open('datasets/imdb-5000-movie-dataset.csv', 'rb') as csvfile:
    reader = csv.DictReader(csvfile)
    movies = [i for i in reader]
print len(movies)

5043


Now let's index those movies into Elasticsearch using the bulk API:

In [10]:
from elasticsearch import helpers

# required fields
action = {
    '_index' : 'movies',
    '_type' : 'movie'
}
helpers.bulk(es, [dict(action.items() + i.items()) for i in movies])

(5043, [])