## From Epicurious JSON to Elasticsearch

### Import JSON into Elasticsearch

#### Create a Bitnami Cluster on Google Compute Engine

Configure a [Bitnami Cluster on Google Compute Engine](https://console.cloud.google.com/marketplace/details/bitnami-launchpad/elasticsearch-cluster) with the number of nodes depending on your project.

SSH into the cluster. Transfer your JSON from local to the primary instance (either using *gcloud*, hosting and *wget* ...).

#### Convert JSON to new-line-delimited JSON

Elasticsearch expects new-line delimited JSON file for bulk import. 

    cat recipe_urls_final_v3.json | jq -c '.[]' > recipe_urls_final_v4.ndjson

Also, the lines need to follow the Bulk API format (add index information before every line).

    sed -e 's/^/{"index":{}\n/' recipe_urls_final_v4.ndjson > recipe_urls_final_v5.ndjson

#### Push to Elasticsearch

Use bulk import facilities: Put index and set "read_only_allow_delete" to False.

    curl -XPUT 'http://localhost:9200/recipes'

    curl -XPUT -H "Content-Type: application/json" http://localhost:9200/recipes/_settings -d '{"index.blocks.read_only_allow_delete": false}'

Now, we're ready to push the data:

    curl -s -XPOST localhost:9200/recipes/recipe/_bulk -H "Content-Type: application/x-ndjson" --data-binary @recipe_urls_final_v5.ndjson > elastic_logs.txt

Finally, let's list all indices and check if we have pushed the data successfully.

    curl -X GET "localhost:9200/_cat/indices?v"

Output should be: 
    
        health status index   [...]  pri rep docs.count docs.deleted store.size pri.store.size
        green  open   recipes [...]    5   1      35768            0    183.9mb         92.2mb

A lot of data cleaning was done in "03_elasticsearch_data_preperation" to convert the base JSON to a format that is schema-able by Elasticsearch. Still, somewhere during migration, two poor recipes were separated from the crowd. :(

### Annex: Configuration of Elasticsearch Cluster on Google Cloud (Bitnami)

From Google App Engine, you can directly connect with the cluster with the node's private IP, **given that**:
- you are using Google App Engine on flex environment
- that the application is actually **deployed** (connection will fail in web preview)

Also, in Google App Engine environments, it turns out to be necessary to send ES get requests with body as post, see also [Python Elasticsearch Documentation](https://elasticsearch-py.readthedocs.io/en/master/).

Therefore, we get the following connection in our Python code:

    es = Elasticsearch('http://10.128.0.10:9200', send_get_body_as='POST')


SSHing into your nodes, you can always check the connection health. If everything worked out, you should see "green" as output.

    curl -XGET 'localhost:9200/_cluster/health?pretty'

Output should be similar to: 
    
    {
      "cluster_name" : "es-cluster",
      "status" : "green",
      "timed_out" : false,
      "number_of_nodes" : 3,
      "number_of_data_nodes" : 3,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 0,
      "delayed_unassigned_shards" : 0,
      "number_of_pending_tasks" : 0,
      "number_of_in_flight_fetch" : 0,
      "task_max_waiting_in_queue_millis" : 0,
      "active_shards_percent_as_number" : 100.0
    }