# Query DSL

## Queries

### Scoring
This is the main concept to remember: in a query results are sorted by relevance (**score**).

ElasticSearch will return a score with all the matching documents from a query, this score determines how well the document matches the query.

The score is shown under the `_score` field in the results.

### Loading the datasets

NOTE: Before launching the notebook you have to load the environmental variables needed:

```
source bash/opensearch
```

In [7]:
%%bash

cd ${DATASET_LOCATION}/movielens-latest-small

# movies
curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X DELETE \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies?pretty"
    
curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X PUT -H "Content-Type: application/json" \
    --data-binary @movies-index.json \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies"
    
curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X PUT -H "Content-Type: application/json" \
    --data-binary @movies-bulk.json \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/_bulk" > /dev/null

{
  "acknowledged" : true
}
{"acknowledged":true,"shards_acknowledged":true,"index":"movies"}

In [12]:
%%bash

cd ${DATASET_LOCATION}/movielens-latest-small

# movies-tuned
curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X DELETE \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies-tuned?pretty"
    
curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X PUT -H "Content-Type: application/json" \
    --data-binary @movies-tuned-index.json \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies-tuned"
    
curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X PUT -H "Content-Type: application/json" \
    --data-binary @movies-tuned-bulk.json \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/_bulk" > /dev/null

{
  "acknowledged" : true
}
{"acknowledged":true,"shards_acknowledged":true,"index":"movies-tuned"}

In [13]:
%%bash

cd ${DATASET_LOCATION}/movielens-latest-small

# ratings
curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X DELETE \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/ratings?pretty"
    
curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X PUT -H "Content-Type: application/json" \
    --data-binary @ratings-index.json \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/ratings"
    
curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X PUT -H "Content-Type: application/json" \
    --data-binary @ratings-bulk.json \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/_bulk" > /dev/null

{
  "acknowledged" : true
}
{"acknowledged":true,"shards_acknowledged":true,"index":"ratings"}

## Let's explore the query language

We will do a walkthrough using the `Dev Tools` view in OpenSearch Dashboards (ie. Kibana).

All the examples of the walkthrough are in:

- [Query DSL walkthrough in Dev Tools](devtools_query_dsl_walkthrough.txt)

## Query DSL Reference

### match_all
This is the default query, if you do not give any query this is the one that will be run.

It returns all the documents.

In [5]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X GET -H "Content-Type: application/json" \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies/_search?pretty" -d '
{
    "query" : {
        "match_all" : {
        }
    }
}'

{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 9742,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 1.0,
        "_source" : {
          "movieId" : 5,
          "title" : "Father of the Bride Part II",
          "year" : "1995",
          "genres" : [
            "Comedy"
          ]
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "7",
        "_score" : 1.0,
        "_source" : {
          "movieId" : 7,
          "title" : "Sabrina",
          "year" : "1995",
          "genres" : [
            "Comedy",
            "Romance"
          ]
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "13",
        "_score" : 1.0,
        "_source

### match
```
GET /:index/_search
{
    "query": {
        "match": {
            "title": "harry"
        }
    }
}
```

Returns documents that match a provided text, number, date or boolean value. The provided text is analyzed before matching.

The match query is the standard query for performing a full-text search, including options for fuzzy matching.

Search for the term "harry" in the "title" field
```
"query": {
    "match": {
        "title": "harry"
    }
}
```

Since "title" is an **analyzed field** the search is not case sensitive. So the following query will return the same results:
```
"query": {
    "match": {
        "title": "HARRY"
    }
}
```


Search for the terms **"harry" or "potter"** in the "title" field:
```
"query": {
    "match": {
        "title": "harry potter"
    }
}
```

In [16]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X GET -H "Content-Type: application/json" \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies/_search?pretty" -d '
{
    "query" : {
        "match" : {
            "title" : "harry"
        }
    },
    "_source": "title"
}'

{
  "took" : 15,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 18,
      "relation" : "eq"
    },
    "max_score" : 7.638878,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "1701",
        "_score" : 7.638878,
        "_source" : {
          "title" : "Deconstructing Harry"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "74948",
        "_score" : 7.638878,
        "_source" : {
          "title" : "Harry Brown"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "4855",
        "_score" : 7.237429,
        "_source" : {
          "title" : "Dirty Harry"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "3389",
        "_score" : 6.4426374,
        "_source" : {
          "title" :

In [15]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X GET -H "Content-Type: application/json" \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies/_search?pretty" -d '
{
    "query" : {
        "match" : {
            "title" : "harry potter"
        }
    },
    "_source": "title"
}'

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 21,
      "relation" : "eq"
    },
    "max_score" : 9.364731,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "4896",
        "_score" : 9.364731,
        "_source" : {
          "title" : "Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone)"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "69844",
        "_score" : 9.340499,
        "_source" : {
          "title" : "Harry Potter and the Half-Blood Prince"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "8368",
        "_score" : 8.766846,
        "_source" : {
          "title" : "Harry Potter and the Prisoner of Azkaban"
        }
      },
      {
        "_index" : "movies"

#### _source
To get only certain fields in the response use the "_source" option (it is actually a filter, more on filters later on):
```
{
  "query": {
    "match": { "_all": "meaning" }
  },
  "_source": ["name", "surname", "age"]
}
```

In [18]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X GET -H "Content-Type: application/json" \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies/_search?pretty" -d '
{
    "query" : {
        "match" : {
            "title" : "harry"
        }
    },
    "_source": ["title", "year"]
}'

{
  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 18,
      "relation" : "eq"
    },
    "max_score" : 7.638878,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "1701",
        "_score" : 7.638878,
        "_source" : {
          "year" : "1997",
          "title" : "Deconstructing Harry"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "74948",
        "_score" : 7.638878,
        "_source" : {
          "year" : "2009",
          "title" : "Harry Brown"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "4855",
        "_score" : 7.237429,
        "_source" : {
          "year" : "1971",
          "title" : "Dirty Harry"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" 

### match_phrase (Phrase matching)
```
GET /:index/_search
"query": {
    "match_phrase": {
        "title": {
            "query": "harry potter",
            "slop": 1
        }
    }
}
```

Must find all terms in the right order.

Options:

- slop: the slop represents how separated the terms in the phrase can be (in either direction!). Transposed terms have a slop of 2.
        for example: "quick brown fox" would match "quick fox" with a slope of 1



In [4]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X GET -H "Content-Type: application/json" \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies/_search?pretty" -d '
{
    "query" : {
        "match_phrase" : {
            "title" : "harry potter"
        }
    },
    "_source": "title"
}'

{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 8,
      "relation" : "eq"
    },
    "max_score" : 9.364731,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "4896",
        "_score" : 9.364731,
        "_source" : {
          "movieId" : 4896,
          "title" : "Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone)",
          "year" : "2001",
          "genres" : [
            "Adventure",
            "Children",
            "Fantasy"
          ]
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "69844",
        "_score" : 9.340498,
        "_source" : {
          "movieId" : 69844,
          "title" : "Harry Potter and the Half-Blood Prince",
          "year" : "2009",
          "genres" : [
            "Adventure",
        

The `_source` option allows us to limit the fields returned in the results to only the ones indicated (we can give a list).

In [8]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X GET -H "Content-Type: application/json" \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies/_search?pretty" -d '
{
    "query" : {
        "match_phrase" : {
            "title" : {
                "query": "harry and",
                "slop": 1
            }                       
        }
    },
    "_source": "title"
}'

{
  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 9,
      "relation" : "eq"
    },
    "max_score" : 8.535532,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "3388",
        "_score" : 8.535532,
        "_source" : {
          "title" : "Harry and the Hendersons"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "4896",
        "_score" : 4.2084618,
        "_source" : {
          "title" : "Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone)"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "69844",
        "_score" : 3.952879,
        "_source" : {
          "title" : "Harry Potter and the Half-Blood Prince"
        }
      },
      {
        "_index" : "movies",
        "_type

#### Pagination
We can paginate results using the `from` and `size` fields.

Get documents from 100 to 119:

```
{
  "query": {
    "match": { "title": "harry" }
  },
  "from": 100,
  "size": 20
}
```

NOTE: Deep pagination can kill performace (every result must be retrieved, collected and sorted).
Use: "from" and "size"

#### Sorting
We can sort using the `sort` field in the query:
```
GET movies/_search
{
  "query": {
    "bool": {
      "must": [
        {"match_phrase": {"title": "star trek"}}
      ],
      "should": [
      ], 
      "filter": [
        {"term": {"genres": "action"}}
      ]
    } 
  },
  "sort": [
    {
      "year": {
        "order": "desc"
      }
    }
  ],
  "_source": ["title", "year"]
}
```

There is also a shorthand notation and we can aslo use it in the query string: `sort=field_name`, eg. 
```
GET /movies/_search?sort=year
```

A string filed that is **analyzed** for full-text search can't be used to sort documents. This is because it exists in the inverted index as individual terms, not as the entire string.

If we need to sort on an analyzed text field we would have to map a `keyword` copy of it that we can call `raw` and that we will access as `field_name.raw`, eg. `title.raw`:
```
PUT /movies
{
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "fields": {
                    "raw": {"type": "keyword"}
                }
            }
        }
    }
}
```

Now we will be able to sort using the new `title.raw` field:
```
curl --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} -XGET "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies/_search?sort=title.raw"
```

To do this we have to recreate the index.

Close the index (a closed index is blocked for read/write operations):

## Compound queries
```
GET :index/_search
{
  "query": {
    "bool": {
      "must": [
      ],
      "must_not": [
      ],
      "should": [
      ],
      "filter": [
      ]
    } 
  }
}
```
We can create more complex queries grouping different queries:
- must: documents must match the queries inside this block (AND)
- must_not: documents must not match any query inside this block (NOT)
- should: documents that match one or more of the queries in this block (OR)
- filter: filter only this documents (does not produce scores)



## Queries vs Filters
The difference between the two is that filters do not return a score, they just match or not match. So the results can be cached and they are  generally faster than queries because **they check only if a document matches and not how well it matches**. In other words, filters give a boolean answer whereas queries return a calculated score of how well a document matches a query.

## Filters

```
{
  "query": {
    "bool": {
        "must": "match": {"genre": "Sci-Fi"},
        "must_not": {"match": {"title": "trek"}},
        "filter": {"range": {"year": {"gte": 2010, "lt": 2015}}}
    }
  }
}
```

#### Fuzzy matches
Allows to account for typos and misspellings.

The [leventshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) is used. For example the following changes would all have a leventshtein distance of 1:
- Substitution of one characters: eg. potter -> pottir
- Insertion of one characters: eg. pottter
- Deletion of one characters: eg. poter


```
{
 "query": {
    "fuzzy": { 
        "title": {
            "value": "pottre",
            "fuzziness": 2
        }
    }
 }
}
```

```
# fuzzy: we made a typo and wrote "bottel" instead of "bottle"
GET /movies/_search
{
 "query": {
    "fuzzy": { "title": "bottel"}
 }
}

# More fuzziness: fuzziness=2
GET /movies/_search
{
 "query": {
    "fuzzy": { 
        "title": {
            "value": "bottel",
            "fuzziness": 2
        }
    }
 }
}
```

Possible values for `fuzziness`:
- 0 for 2 char strings
- 1 for 3-5 char strings
- 2 for anything else
When the fuzziness value is not given then it is AUTO

#### Partial matching
Prefix queries on strings:
(if year is a string)
```
{
 "query": {
    "prefix": { 
        "year": "201"
        }
    }
}
```

Wildcard queries
```
{
 "query": {
    "wildcard": { 
        "text": "makethe*"
        }
    }
}
```

Regexp queries
```
{
 "query": {
    "regexp": { 
        "text": "makethe......arealit[a-z]"
        }
    }
}
```
To check: the regexp should use lower case because the text field is analyzed and converted to lower case.

Lucene’s regular expression engine does not support anchor operators, such as ^ (beginning of line) or $ (end of line). To match a term, the regular expression must match the entire string.

Regular expression syntax:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html

#### Boosting queries
Returns documents matching a positive query while reducing the relevance score of documents that also match a negative query.

You can use the boosting query to demote certain documents without excluding them from the search results.

negative_boost: Floating point number between 0 and 1.0 used to decrease the relevance scores of documents matching the negative query.
```
GET /_search
{
  "query": {
    "boosting": {
      "positive": {
        "term": {
          "text": "apple"
        }
      },
      "negative": {
        "term": {
          "text": "pie tart fruit crumble tree"
        }
      },
      "negative_boost": 0.5
    }
  }
}
```

#### Constant Score Queries
This is a valuable tool for segmenting certain queries that you want to give a boost in score. The “constant_score”: {} code wrap isolates certain search terms and pairs them with a separate boost value.

```
GET /_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": { "type": "nginx" }
      },
      "boost": 1.5
    }
  }
}
```
So in this instance, you are giving any NGINX logs a greater value than others (presumably than other server logs like apache2 logs or IIS logs).

## Aggregations
We can also do aggregations (this is a very powerful feature).

The results are aggregated in `buckets` and we can then compute statistics per bucket, like the average of the ratings included in the bucket.

Example 1: Films per year since 2014
```
GET movies/_search
{
  "query": {
    "bool": {
      "filter": [
        {"range": {"year": {"gte": "2014"}}}
      ]
    }
  },
  "aggs": {
    "per_year": {
      "terms": {
        "field": "year"
      }
    }
  },
  "size": 0
}
```
NOTE: we use size=0 because we do not want the results, just the aggregations

Example 2: Films per year since 2014: show just 2 buckets (the other buckets are not returned)
```
GET movies/_search
{
  "query": {
    "bool": {
      "filter": [
        {"range": {"year": {"gte": "2014"}}}
      ]
    }
  },
  "aggs": {
    "per_year": {
      "terms": {
        "field": "year",
        "size": 2
      }
    }
  },
  "size": 0
}
```

#### Count
We can use the Count API to get the number of matches for a search query.

```
GET /:index/_count?q=title:potter
```

In [3]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X GET -H "Content-Type: application/json" \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies/_count"

{"count":9742,"_shards":{"total":3,"successful":3,"skipped":0,"failed":0}}

In [4]:
%%bash

curl --silent --insecure -u ${OPENSEARCH_USER}:${OPENSEARCH_PASSWD} \
    -X GET -H "Content-Type: application/json" \
    "https://${OPENSEARCH_HOST}:${OPENSEARCH_PORT}/movies/_count?q=title:potter"

{"count":11,"_shards":{"total":3,"successful":3,"skipped":0,"failed":0}}

## And there is much more

For a complete reference of all the types of queries available look at:
- [Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/query-dsl.html)

In the left menu you have an overview of the different types of queries and the options available in each of them (there are really a lot of them):

- Compound queries
  - Boolean
  - Boosting
  - Constant score
  - Disjunction max
  - Function score
- Full text queries
  - Intervals
  - Match
  - Match boolean prefix
  - Match phrase prefix
  - Combined fields
  - Multi-match
  - Query string
  - Simple query string
- Geo queries
  - Geo-bounding box
  - Geo-distance
  - Geo-polygon
  - Geo-shape
- Shape queries
- Joining queries
  - Nested
  - Has child
  - Has parent
  - Partent ID
- Match all
- Span queries
- Specialized queries
  - Distance feature
  - More like this
  - Percolate
  - Rank feature
  - Script
  - Script score
  - Wrapper
  - Pinned Query

- Term-level queries
  - Exists
  - Fuzzy
  - IDs
  - Prefix
  - Range
  - Regexp
  - Term
  - Terms
  - Terms set
  - Wildcard
  
You can also read more about the different `Agrregations` looking at:
- [Aggregations](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/search-aggregations.html)


## References
- [Elasticsearch Cheat Sheet for developers (Elasticsearch 7.X)](https://elasticsearch-cheatsheet.jolicode.com/#es7)
- [Query DSL ES 7.10](https://www.elastic.co/guide/en/elasticsearch/reference/7.10/query-dsl.html)
- [Official documenation Elasticsearch 7.X](https://www.elastic.co/guide/en/elasticsearch/reference/7.x/index.html)
- [Queries](https://logz.io/blog/elasticsearch-queries/)