# Elasticsearch

Elasticsearch is a popular schema-less document store. It's scalable, highly available, and very fast. It achieves this by splitting data across *nodes*, which can be split into *shards* (we will explain this soon) - all of these are automatically replicated to protect against data loss.

Elasticsearch can be combined with Logstash, Kibana, and Beats - to produce the Elastic (*or ELK* stack.

We won't be diving too deeply into what exactly Elasticsearch is, and it isn't wholly required for building Q&A models. But it will help.

## How it Works

The top-level of a Elasticsearch deployment is called a cluster. This it the contain within which everything else is contained:

![The Elasticsearch cluster acts as our instance container](../../assets/images/elasticsearch_cluster.jpeg)

Inside our cluster we have nodes. We can think of these as individual processing units, often we would try to place each node on separate hardware — we do this to provide fault tolerance — if one component fails and a node goes down, we have other nodes to take its place.

The next layer down gives us our indices. A single index is like a dataset. And within these indices, we store data as schema-less ‘documents’:

![The Elasticsearch index that contains schema-less documents](../../assets/images/elasticsearch_index.jpeg)

Schema-less refers to the lack of a common data structure. This is the opposite of schema-based SQL databases — where every entry is assigned a set of values that correspond to a strict set of fields defined by the table schema.

Rather than following a strict schema, our indices are built of ‘documents’ where each entry can contain an entirely different set of fields. These documents are stored as JSON objects, which look like this:

```json
{
    'id': 'abc',
    'project': 'homegrown cucumbers',
    'codename': 'X',
    'cool-factor': '10'
}
```

And another document in the same index might look like this:

```json
{
    'id': 'def',
    'project': 'making faster lawnmowers',
    'codename': 'go-fast',
    'notes': 'adding stripes does not work',
    'ideas': 'rockets'
}
```

Indices can be split down into shards — which can then be distributed across our nodes to improve processing speed by enabling *parallel* queries, and improving fault tolerance (if one node goes down, we still have another running).

## Elasticsearch and Haystack

We will not be directly interfacing with Elasticsearch, this will instead be handled by Haystack. However, we will be building our document store using the typical Elasticsearch API calls, which we'll get around to very soon.

[Read more about Elasticsearch here](https://towardsdatascience.com/next-gen-data-storage-and-analytics-with-elasticsearch-a833ca7ca54a?sk=ac69a52fa04714900cc4f2dd74e5cd68)