# Getting Started with Elasticsearch for Spatial Analysis

<span style="background:yellow">Introduction & background on Elasticsearch and document databases...</span>

## 1) Launch Elasticsearch locally using Docker Compose

Elasticsearch contains multiple services/components that need to communicate with each other.  This is hard to accomplish when using isolated Docker containers, as these containers are generally not set up to be mutually accessible to each other.  Instead, it is easier to use Docker Compose, a container orchestration utility that allows you to run multiple, linked services within networked containers that can communicate with each other.  This repository contains an `elasticsearch-docker-compose.yml` file that defines the parameters for launching Elasticsearch using Docker Compose.  The basic structure of the file looks like this:

```
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:6.4.0
    container_name: elasticsearch
    ...

  kibana:
    image: docker.elastic.co/kibana/kibana:6.4.0
    container_name: kibana
    ...

volumes:
  ...

networks:
  ...
```

The `services` section lists the different services that we are trying to launch and coordinate together; in this case, we'll be launching Elasticsearch (the database itself) and Kibana (Elasticsearch's web-based access interface).  The `volumes` section creates persistent data volume to hold the database files.  The `networks` section creates a virtual networking interface that the Elasticsearch and Kibana containers can use to interface with each other.

To launch the Elasticsearch stack, open up your command line and `cd` into the folder containing the `elasticsearch-docker-compose.yml` file.  Then, run the following command:

```bash
docker-compose -f elasticsearch-docker-compose.yml up -d
```

What does this command do?  Breaking it down, here's what each argument means:

- **-f** : specifies the filename of the .yml file that describes the cluster of services we want to run
- **up** : initializes the containers and launches the services specified in the Docker Compose file
- **-d** : tells Docker Compose to run the cluster in detached mode, so it runs in the background even if you quit your console

As the stack launches, you should see the following confirmation messages in your command window:

![Elasticsearch Docker Compose up message](img_elasticsearch/elasticsearch_up.png)

To check that the stack is running, execute the following command in the command line:

```docker container ls```

You should see something like this, indicating that the services are successfully running in two separate, but "orchestrated" containers:

```
CONTAINER ID    IMAGE               COMMAND                CREATED         STATUS          PORTS
blahblahblah    kibana:6.4.0        "/usr/local/bin/kiba…" X seconds ago   Up X seconds    0.0.0.0:5601->5601/tcp

blahblahblah    elasticsearch:6.4.0 "/usr/local/bin/dock…" X seconds ago   Up X seconds    0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp
```

<hr>

## 2) Connect to the database

Elasticsearch has a web-based interface called Kibana that you can access using a web browser. Launch your web browser of choice and navigate to:

```http://localhost:5601/app/kibana```

If the Elasticsearch and Kibana containers are successfully initialized and communicating with each other, you should be able to see the following interface:

![Kibana browser launch page](img_elasticsearch/kibana_browser_launch.png)

You'll notice that this approach does not require you to authenticate or set a username or password.  This is because the Docker Compose file used to launch the cluster is set up to disable authentication features to make it easier to access the database for testing.  You will need to enable an authentication method and take other security precautions to secure the database installation if you want to use Elasticsearch in a production environment.

<hr/>

## 3) Load data into the database

### 3A) Create an "index" and define its mappings

Elasticsearch uses the term ["index"](https://www.elastic.co/blog/what-is-an-elasticsearch-index) to describe a set of records that are stored together.  This is similar to a "table" in a relational database or a "collection" in databases like MongoDB.  The term "index" is a natural fit, because Elasticsearch was originally invented to power search engine-style projects: its goal was to "index" records, web/server logs, or other large datasets so they could be easily searched later.  

By default, Elasticsearch "indexes" are, well...indexed.  This means that, when reading data in, Elasticsearch will automatically try to detect the field types in your data and store/index the data in a way that can be easily searched later.  This process of reading in data and assigning field types is called [mapping](https://www.elastic.co/blog/found-elasticsearch-mapping-introduction).  Sometimes, Elasticsearch can guess wrong when detecting field types, so it can be helpful to define an explicit mapping--at least for some of the trickier fields--when reading data in.

The mapping definition below is the one we'll use for the Twitter data.  It does a few things: first, it explicitly defines some of the more uncommon data types--like the timestamp, centroid, and bounding box--to help Elasticsearch recognize these data types properly.  Second, the `"_source": { "enabled": true }` setting specifies that we should save the original JSON representing the tweet when reading it into the database.  If we don't enable [source mapping](https://www.elastic.co/guide/en/elasticsearch/reference/6.x/mapping-source-field.html) like this, Elasticsearch will only store the metadata of the tweet, and will not allow you to conduct any deep search or retrieval on the full tweet.  It can be useful to forego source mappings when you are indexing large files (such as raster images or other multimedia) and only wish to use Elasticsearch as a metadata organizer, and will be storing the actual source data elsewhere.  This, however, is not the case with our tweets, which are simple text data and thus don't have a large storage overhead.  So in this scenario, you will want to enable source mapping so you can query the body of the tweets after they have been stored in Elasticsearch.

In [None]:
PUT twitter_sample
{
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "tweet": {
            "_source": { "enabled": true },
            "properties": {
                "text": {"type": "text" },
                "timestamp_ms": { "type": "date", "format": "epoch_millis" },
                "user": { 
                    "properties": { 
                        "location": { "type": "text" },
                        "description": { "type": "text" }
                    }
                },
                "place": { 
                    "properties": { 
                        "name": { "type": "keyword" },
                        "full_name": { "type": "keyword" },
                        "centroid": { "type": "geo_shape" },
                        "better_bounding_box": { "type": "geo_shape" },
                        "centroid_geohash": { "type": "geo_point" }
                    }
                }
            }
        }
    }
}

Before loading any data, copy the mapping definition shown above.  Then, navigate to the "Dev Tools" section in Kibana.  Copy and paste the mapping definition into the console and click the green "play" button to set up the mapping:

![Elasticsearch mapping](img_elasticsearch/elasticsearch_mapping.png)

### 3B) Execute load scripts

Now it's time to load in the data. We'll use tweets that come in the form of .txt files that have been split into chunks of ~5000 tweets per file. Each line of each file is formatted as a JSON object representing a single tweet, and each tweet is separated by a newline within the .txt files. To make loading easier, this repository contains a pre-baked script called Clean_Load_Scripts.py that you can use to load data into Elasticsearch.  Run the following code to import the script:

In [2]:
import Clean_Load_Scripts as cleanNLoad

Next, define a variable called data_folder that points to the location of the demo data on your computer. Also define a variable called logs_folder that points to the path where you want to store log files about the load. This folder can be located directly inside of your data folder, or it can be an entirely separate path. The load script will keep track of files that need to be loaded, load time for each file, error counts, and some other diagnostic data for each file as it loads.

In [7]:
data_folder = '/path/to/data/twitter_data/data_small_5000_split/'
logs_folder = '/path/to/data/twitter_data/data_small_5000_split/logs/'

Now, initialize the extractor to kick off the load logs and prep the files for load:

In [9]:
extractor = cleanNLoad.Extractor(data_folder, logs_folder, initialize=True)

Finally, run the `while` loop below that actually does the work of cleaning and loading each file into the database.

If the while loop is interrupted for some reason during the load, don't panic! The logs folder contains a files_to_load.txt file that keeps track of all the files that still need to be loaded into the database. To re-start the load, simply re-load the extractor above with the argument initialize=False. Then, re-run the while loop below and the load will pick up where it left off.

Ready to load? Okay, go!

In [None]:
while extractor.next_file_available():
    next_file_data, next_file_name = extractor.get_next_file() # read in the next file
    cleaner = cleanNLoad.Cleaner(next_file_data, next_file_name, logs_folder) # clean the data (fix bounding boxes, add centroids, etc.)
    cleaned_data = cleaner.clean_data() 
    loader = cleanNLoad.Loader(cleaned_data, next_file_name, logs_folder) # initialize the loader
    loader.get_connection("elasticsearch", "localhost", "9200", db_name="twitter_sample") # create a database connection
    loader.load_data()

Note that it will take ~1 minute to load each file, so be ready grab a coffee and be patient.  The load time takes longer in Elasticsearch than it does a database like MongoDB because the fields are being indexed and optimized for search as the load is occurring.  Because it was designed for indexing large-scale datasets, Elasticsearch offers a very robust bulk load API, which is what the load script is using to perform this load process.

<hr>

## 4) Query the data

### 4A) Basic queries

Now that the data is loaded, let's look at how to perform a few basic queries. Elasticsearch uses a query syntax that looks a little bit like a combination of a REST call used in making web page requests and a JSON object used to define data structures for the web.  Each query starts with the verb "GET", and then contains a series of nested brackets `[]` or `{}` that define the query parameters.
```
GET twitter_sample/_search
{ "_source": ["text"],
  "query": {
    "match": {
      "text": "search text goes here"
    }
  }
}
```

As you can see above, the beginning of the query starts with a `GET` request, along with the "index" (database) name, plus the `_search` operator.  The next line, the `"_source"` specification, is where you can define which fields from the original tweet data you want to return in your results set.  Then, the body of the query itself is contained entirely in brackets `{}`.  

You can copy and paste queries into the Kibana Dev Tools console and click the green "run" button that appears in the upper right corner of each query to run that particular query:

![Elasticsearch search interface](img_elasticsearch/elasticsearch_search_interface.png)

#### Basic Query 1: Which tweets contain a particular search term/phrase?

One of the basic query operators available in Elasticsearch is the ["match"](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/query-dsl-match-query.html) operator. This performs a basic text, numeric, or date search on your field of choice:

```
GET twitter_sample/_search
{ "_source": ["user.screen_name", "text"],
  "query": {
    "match": {
      "text": "blessed OR cursed"
    }
  }
}
```

When you run this query, you'll notice the results are ordered by score, but what exactly is this "score"?  Elasticsearch has a very robust scoring mechanism with some [interesting theory behind it](https://www.elastic.co/guide/en/elasticsearch/guide/master/scoring-theory.html).  It basically boils down to a mix of three factors: 1) How frequently do the search terms occur in a particular tweet text?; 2) How rare/unique are the search terms, relative to the entire set of words across all tweets stored in the database index?; 3) How prominent are the search terms, relative to the length of that particular tweet's text field?

#### Basic Query 2: How many tweets were tweeted at a place that is classified as a "point of interest"?

Another basic query operator is the ["term"](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/query-dsl-term-query.html) operator.  This query below performs an exact match query.  The `"from"` and `"size"` settings in the query above specify we want to return the first 100 results, and the `"_source"` setting specifies that we want to return the place type, place id, and place full name in the result set:

```
GET twitter_sample/_search
{ "_source": ["place.place_type", "place.id", "place.full_name"],
  "from" : 0, "size" : 100,
  "query": {
    "term": { "place.place_type": "poi" }
  }
}
```

#### Basic Query 3: How many tweets were tweeted within the last year?

A final basic query building block is the ["range"](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html) operator.  As the name suggests, this allows you to peform a range query, with "greater than" and "less than" operators.  Elasticsearch also offers some very intuitive [date math](https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#date-math) to help define the boundaries of a range query involving dates:

```
GET twitter_sample/_search
{ "_source": ["timestamp_ms", "text"],
  "query": {
    "range" : {
      "timestamp_ms": {
        "gte" : "now-1y",
        "lte" : "now"
      }
    }
  }
}
```

### 4B) Spatial queries

Elasticsearch supports two different types of geodata--`geo_point` and `geo_shape`.  It also offers a range of different [geo queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-queries.html).  On the whole, Elasticsearch is surprisingly flexible in how it allows you to format spatial queries--supporting standard lat/long values, well-known text, and even geohash values as inputs to queries.

#### Spatial query 1: How many tweets have centroids located within the Twin Cities metro region?

For this query, you'll use the `geo_bounding_box` search operator wrapped within a `filter` search operator.  At the beginning of the query, you also need some boilerplate query formatting to specify that this is a boolean search (`bool`), which will return any tweets for which the `filter` criteria evaluate to `true`:

```
GET twitter_sample/_search
{
    "query": {
        "bool" : {
            "must" : {
                "match_all" : {}
            },
            "filter" : {
                "geo_bounding_box" : {
                    "place.centroid_geohash" : {
                        "top": 45.427104,
                        "left": -93.801912, 
                        "bottom": 44.678982,
                        "right": -92.741222
                        }
                }
            }
        }
    }
}
```

The bounding box above can be formattted in several different ways.  For example, this is the exact same query, but using well-known text instead of "top", "left, "bottom", "right" values:

```
GET twitter_sample/_search
{
    "query": {
        "bool" : {
            "must" : {
                "match_all" : {}
            },
            "filter" : {
                "geo_bounding_box" : {
                    "place.centroid_geohash" : {
                        "wkt" : "BBOX (-93.80191, -92.74122, 45.427104, 44.67898)"
                        }
                }
            }
        }
    }
}
```

Look at the documentation on the [geo bounding box query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-geo-bounding-box-query.html) to learn more about the range of formatting options you have when authoring these kinds of queries.

#### Spatial Query 2: Which tweets are within 100 miles of Paul Bunyan Land in Brainerd, MN?

Elasticsearch offers a [geodistance query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-geo-distance-query.html) that can address distance range queries.  Elasticsearch also features a convenient set of [distance units](https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#distance-units) that can be used within a distance query when specifying a search radius:

```
GET twitter_sample/_search
{
    "query": {
        "bool" : {
            "must" : {
                "match_all" : {}
            },
            "filter" : {
                "geo_distance" : {
                    "distance" : "100mi",
                    "place.centroid_geohash": {
                        "lat" : 46.3512921,
                        "lon" : -94.0292184
                    }
                }
            }
        }
    }
}
```

#### Spatial Query 3: Which tweets mention "fishing" and are closest to Brainerd, MN?

One particularly cool feature of Elasticsearch that gives it an edge over databases like MongoDB is its ability to combine regular text-based searches with [geo-based distance sorting](https://www.elastic.co/blog/geo-location-and-search).  This is just one example of how easy it is to add convenient geo-enhancements to regular queries, and of how well-integrated geographic search syntax is into Elasticsearch's overall query structures:

```
GET twitter_sample/_search
{   "_source": ["place.full_name", "text"],
    "query" : {
        "match" : { "text": "fishing" }
    },
    "sort" : [
        {
            "_geo_distance" : {
                "place.centroid_geohash": [-94.0292184, 46.3512921],
                "order" : "asc",
                "unit" : "mi"
            }
        }
    ]
}
```

### 4C) Advanced queries

Elasticsearch also offers some interesting options for compound/complex queries...


#### Advanced Query: Visualizations

## Resources

* Install Elasticsearch with Docker. [Elasticsearch documentation] https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html

* Learning Elasticstack. [Packt Publishing]

* Building an Elasticstack Index with Python. https://qbox.io/blog/building-an-elasticsearch-index-with-python