# don't be plastic, elastipy!

Hi there, this tutorial is actually a [jupyter notebook](https://jupyter.org/) and can be found in [examples](https://github.com/defgsus/elastipy/blob/development/examples/)/[tutorial.ipynb](https://github.com/defgsus/elastipy/blob/development/examples/tutorial.ipynb)

In [None]:
# hidden

# here we just add the parent path of this notebook to 
# the path list where python modules are searched.
# So we find elastipy even if it's not installed via setup.py
import sys
sys.path.insert(0, "..")

## exporting some objects

Without too much thinking we can just use the built-in export helper
and generate some data.

In [None]:
from elastipy import Exporter

class ShapeExporter(Exporter):
    INDEX_NAME = "elastipy-example-shapes"
    MAPPINGS = {
        "properties": {
            "shape": {"type": "keyword"},
            "color": {"type": "keyword"},
            "area": {"type": "float"},
        }
    }

The `INDEX_NAME` is obviously the name of the elasticsearch index. 
The `MAPPINGS` parameter describes the 
[elasticsearch mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html). 
Here we say that documents will at least have these common fields, 
one of type `float` and two of type `keyword` which means they 
are strings but not full-text searchable ones. Instead they are 
efficiently indexed and aggregatable.

The data we create out of thin air.. 

In [None]:
import random

def shape_generator(count=1000, seed=42):
    rnd = random.Random(seed)
    for i in range(count):
        yield {
            "shape": rnd.choice(("triangle", "square")),
            "color": rnd.choice(("red", "green", "blue")),
            "area": rnd.gauss(5, 1.3),
        }

Now create our exporter and export a couple of documents. It uses the [bulk helper tools](https://elasticsearch-py.readthedocs.io/en/7.10.0/helpers.html#bulk-helpers) internally.

In [None]:
exporter = ShapeExporter()

count, errors = exporter.export_list(shape_generator(), refresh=True)

print(count, "exported")

The `refresh=True` parameter will refresh the index as soon as everything is exported, so we do not have to wait for objects to appear in the elasticsearch index. 

## query oh elastipyia 

In most cases this import is enough to access all the good stuff:

In [None]:
from elastipy import Search, query

Now get some documents:

In [None]:
s = Search(index="elastipy-example-shapes")

**s** is now a search request that can be configured. Setting any search related options will always return a new instance. Here we set the maximum number of documents to respond:

In [None]:
s = s.size(3)

Next we add a [query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html), more specifically a [term query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html).

In [None]:
s = s.term(field="color", value="green")

Our request to elasticsearch would look like this right now:

In [None]:
s.dump.body()

More queries can be added, which defaults to an **AND** combination:

In [None]:
s = s.range(field="area", gt=5.)
s.dump.body()

**OR** combinations can be archived with the [bool](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html) query itself or by applying the `|` operator to the query classes in `elastipy.query`:

In [None]:
s = s | (query.Term(field="color", value="red") & query.Range(field="area", gt=8.))
s.dump.body()

Better execute the search now before the body get's too complicated:

In [None]:
response = s.execute()
response.dump()

The response object is a small wrapper around `dict` that has some convenience properties.

In [None]:
response.documents

How many documents are there at all?

In [None]:
Search(index="elastipy-example-shapes").execute().total_hits

---

The functions and properties are tried to make chainable in a way that allows for short and powerful oneliners: 

In [None]:
Search(index="elastipy-example-shapes") \
    .size(20).sort("-area").execute().documents

So this was the document response. Let's dive inside and start aggregating on fields.

## agitated aggregation

Aggregations can be created using the `agg_`, `metric_` and `pipeline_` prefixes. An aggregation is **attached** to the `Search` instance, so there is no copying like with the queries above.

In [None]:
s = Search(index="elastipy-example-shapes").size(0)

agg = s.agg_terms(field="shape")

s.dump.body()

As we can see, a [terms aggregation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html)
has been added to the search body. The names of aggregations are auto-generated, but can be explicitly stated:

In [None]:
s = Search(index="elastipy-example-shapes").size(0)

agg = s.agg_terms("shapes", field="shape")

s.dump.body()

Let's look at the result from elasticsearch:

In [None]:
s.execute()
s.dump.response()

### valuable access

Because we kept the `agg` variable, we can use it's interface to access the values more conveniently:

In [None]:
agg.to_dict()

It supports the `items()`, `keys()` and `values()` generators as known from the `dict` type:

In [None]:
for key, value in agg.items():
    print(f"{key:12} {value}")

It also has a `dict_rows()` generator which preserves the **names** and **keys** of the aggregation:

In [None]:
for row in agg.dict_rows():
    print(row)

The `rows()` generator flattens the `dict_rows()` into a CSV-style list:

In [None]:
for row in agg.rows():
    print(row)

And we can print a nice table to the command-line:

In [None]:
agg.dump.table(colors=False)

(The `colors=False` parameter disables console colors because they do not work in this documentation)  

---
Obviously, at this point a couple of users would not understand why there is no conversion to a 
[pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) built in:

In [None]:
agg.to_pandas()  # or simply agg.df()

The **index** and **columns** are assigned automatically. Also columns containing ISO-formatted date strings will be converted to `pandas.Timestamp`.

With `matplotlib` installed we can access the [pandas plotting interface](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html):  

In [None]:
agg.df().plot.bar()

Satisfied with a little graphic we feel more confident and look into the details of **metrics** and nested **bucket** aggregations.

### deeper aggregation agitation 

In [None]:
agg = Search(index="elastipy-example-shapes") \
    .agg_terms("shapes", field="shape") \
    .agg_terms("colors", field="color") \
    .metric_sum("area", field="area") \
    .metric_avg("area-avg", field="area") \
    .execute()

A few notes:

  - `agg_` methods always return the newly created aggregation, so the `colors` aggregation is nested inside the `shapes` aggregation.
  - `metric_` methods return their parent aggregation (because metrics do not allow a nested aggregation), so we can just continue to call `metric_*` and each time we add a metric to the `colors` aggregation. If you need to get access to the metric object itself add the `return_self=True` parameter.
  - The `execute` method on an aggregation does not return the response but the aggregation itself.
  
Now, what does the `to_dict` output look like?

In [None]:
agg.to_dict()

It has put the **keys** that lead to each value into tuples. Without a lot of thinking we can say:

In [None]:
data = agg.to_dict()
print(f"There are {data[('triangle', 'red')]} red triangles in the database!")

But where are the metrics gone?

Generally, `keys()`, `values()`, `items()`, `to_dict()` and `to_matrix()` only access the values of the **current aggregation** (which is `colors` in the example). Although all the keys of the parent **bucket** aggregations that lead to the values are included.

The methods `dict_rows()`, `rows()`, `to_pandas()` and `.dump.table()` will access **all values** from the whole aggregation branch. In this example the branch looks like this:

- shapes
  - colors
    - area
    - area-avg

In [None]:
agg.dump.table(digits=3, colors=False)

Now all information is in the table. Note that the `shapes.doc_count` column contains the same value multiple times. This is because each `colors` aggregation bucket splits the `shapes` bucket into multiple results, without changing the overall count of the shapes, of course. 

Now what is this method with the awesome name `to_matrix`?

In [None]:
names, keys, matrix = agg.to_matrix()
print("names ", names)
print("keys  ", keys)
print("matrix", matrix)

It produces a heatmap! At least in two dimensions. In this example we have two dimensions from the **bucket** aggregations `shapes` and `colors`. `to_matrix()` will produce a matrix with any number of dimensions, but if it's one or two, we can also convert it to a `DataFrame`:

In [None]:
agg.df_matrix()

And having something like [seaborn](https://seaborn.pydata.org/) installed we can easily plot it:

In [None]:
import seaborn as sns

sns.heatmap(agg.df_matrix())