# Elasticsearch

In this notebook we will setup an Elasticsearch server, read in Shakespeares works, and analyze them to unerstand term vectors.

You may mix direct API calls, the Python API, or url calls from Python. Whatever gives you access to the data.



### Install the necessary elasticsearch Python packages

In [1]:
!pip install 'elasticsearch<7.14.0'

# docs are here https://elasticsearch-py.readthedocs.io/en/v7.13.4/#

Collecting elasticsearch<7.14.0
  Downloading elasticsearch-7.13.4-py2.py3-none-any.whl (356 kB)
[?25l[K     |█                               | 10 kB 19.4 MB/s eta 0:00:01[K     |█▉                              | 20 kB 22.0 MB/s eta 0:00:01[K     |██▊                             | 30 kB 25.5 MB/s eta 0:00:01[K     |███▊                            | 40 kB 28.5 MB/s eta 0:00:01[K     |████▋                           | 51 kB 30.7 MB/s eta 0:00:01[K     |█████▌                          | 61 kB 32.6 MB/s eta 0:00:01[K     |██████▍                         | 71 kB 30.2 MB/s eta 0:00:01[K     |███████▍                        | 81 kB 29.3 MB/s eta 0:00:01[K     |████████▎                       | 92 kB 30.0 MB/s eta 0:00:01[K     |█████████▏                      | 102 kB 32.0 MB/s eta 0:00:01[K     |██████████▏                     | 112 kB 32.0 MB/s eta 0:00:01[K     |███████████                     | 122 kB 32.0 MB/s eta 0:00:01[K     |████████████                    

### Import packages

In [2]:
import os
import time
from elasticsearch import Elasticsearch
import numpy as np
import pandas as pd

## Setup Elasticsearch Instance


In [3]:
%%bash

wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-oss-7.9.2-linux-x86_64.tar.gz
sudo chown -R daemon:daemon elasticsearch-7.9.2/
shasum -a 512 -c elasticsearch-oss-7.9.2-linux-x86_64.tar.gz.sha512 

elasticsearch-oss-7.9.2-linux-x86_64.tar.gz: OK


Run the instance as a daemon (background) process

In [4]:
%%bash --bg

sudo -H -u daemon elasticsearch-7.9.2/bin/elasticsearch

Starting job # 0 in a separate thread.


In [5]:
# Sleep for few seconds to let the instance start.  - here in case you are running end-to-end
time.sleep(20)

query the base endpoint to retrieve information about the cluster.

In [8]:
%%bash

curl -sX GET "localhost:9200/"

{
  "name" : "e3328963ddfd",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "pp84yMKrTQi83sOtUUa8TQ",
  "version" : {
    "number" : "7.9.2",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "d34da0ea4a966c4e49417f2da2f244e3e97b4e6e",
    "build_date" : "2020-09-23T00:45:33.626720Z",
    "build_snapshot" : false,
    "lucene_version" : "8.6.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


### Data

Get the Shakespeare data 

In [9]:
%%bash 

wget 'https://download.elastic.co/demos/kibana/gettingstarted/shakespeare_6.0.json' -q

In [10]:
%%bash

head -5 shakespeare_6.0.json

{"index":{"_index":"shakespeare","_id":0}}
{"type":"act","line_id":1,"play_name":"Henry IV", "speech_number":"","line_number":"","speaker":"","text_entry":"ACT I"}
{"index":{"_index":"shakespeare","_id":1}}
{"type":"scene","line_id":2,"play_name":"Henry IV","speech_number":"","line_number":"","speaker":"","text_entry":"SCENE I. London. The palace."}
{"index":{"_index":"shakespeare","_id":2}}


In [11]:
from elasticsearch import helpers, Elasticsearch
import csv

ES_NODES = "http://localhost:9200"

es = Elasticsearch(hosts = [ES_NODES])
index_name = 'shakespeare'
doctype = 'shakespeare_works'
es.indices.delete(index=index_name, ignore=[400, 404])
es.indices.create(index=index_name, ignore=400, 
      body={
              "mappings": {
                  "properties" : {
                  "speaker": 
                    {"type": "keyword"},
                  "play_name": 
                    {"type": "keyword"},
                  "line_id": 
                    {"type": "integer"},
                  "speech_number": 
                    {"type": "integer"}, 
                  "text_entry":
                    {"term_vector": "with_positions_offsets",
                     "type": "text", 
                     "fielddata": True}
            }
      }}
  )
  

{'acknowledged': True, 'index': 'shakespeare', 'shards_acknowledged': True}

Bulk upload the data

In [12]:
! curl -s -q -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/shakespeare/_bulk?pretty' --data-binary @shakespeare_6.0.json 

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        "_seq_no" : 5863,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "5864",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 5864,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "5865",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 5865,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "5866",
        "_

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



,
        "_type" : "_doc",
        "_id" : "25871",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 25871,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "25872",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 25872,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "25873",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 25873,
        "_primary_term" : 1,
  

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        "_seq_no" : 34337,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "34338",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 34338,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "34339",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 34339,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "34340",
    

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        "_type" : "_doc",
        "_id" : "48847",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 48847,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "48848",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 48848,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "48849",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
      

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        "_type" : "_doc",
        "_id" : "61578",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 61578,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "61579",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 61579,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "shakespeare",
        "_type" : "_doc",
        "_id" : "61580",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
      

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [13]:
! curl http://localhost:9200/_cat/indices

yellow open shakespeare CXrRgdcbSd2cUmPsbZQ7fQ 1 1 111396 0 26.2mb 26.2mb


### Extract term vectors

### Find a rare term

### Search for the term
