<u><center><h1>Elasticsearch</h1></center></u>
---
---

<img src="images/elastic_logo.png" width=60%>

In this part of lesson about Elasticsearch you will know about working with graphs, analyzers, how to find synonims in a text, compute string metrics, about transformations from plural to singular, stemmer, lemmatization and how to avoid stopwords in a text when doing queries. We will start with graphs.

# Graph

Graph helps to discover how items in an Elasticsearch index are related. With graph you can find the most meaningful connections in terms. But first you need to install a special plugin for working with graphs (it is a part of the X-Pack plugin).

Stop the Elasticsearch and Kibana instance if they are run. To download and install the mentioned plugin, run this command in the Elasticsearch directory:

    `bin/elasticsearch-plugin install x-pack`

Then install X-Pack for Kibana. In Kibana folder run:
    
    `bin/kibana-plugin install x-pack`

Now you need to disable authentication for Elasticsearch and Kibana. Open **`/elasticsearch/config/elasticsearch.yml`** and **`/kibana/config/kibana.yml`** and the below row to these files:
    
    `xpack.security.enabled: false`

Then run Elasticsearch and Kibana again. 

In this example we will use the [IMDB 5000 Movie Dataset](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset).

Now we need to create the mapping, before import the data. For this we will use [dynamic templates](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/dynamic-templates.html).

In [None]:
# Import ES instance
from elasticsearch import Elasticsearch
es = Elasticsearch()

In [None]:
query = {
  "mappings": {
    "logs": {
      "dynamic_templates": [
        {
          "strings": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "keyword",
              "fielddata": True
            }
          }
        }
      ]
    }
  }
}
es.indices.create(index='movies', body=query)

[**`fielddata`**](https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html) allows sorting, doing aggregations and accessing the string values.

Then you ready to import the data in ES with **`logstash`** you already know how to do this.
<code>
input {  
  file {
    path => "/path/to/movie_metadata.csv"
    start_position => "beginning"    
  }
}
filter {  
  csv {
      separator => ","
      columns => ['color', 'director_name', 'num_critic_for_reviews', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes']
  }
}
output {  
    elasticsearch {
        action => "index"
        hosts => ["127.0.0.1:9200"]
        index => "movies"
        workers => 1
    }
    stdout {}
}
</code>

**Note: Don't forget to change `"/path/to/movie_metadata.csv"` the correct folder where you have saved the `"movie_metadata.csv"` file.**

In [None]:
# Check imported data
es.search(index="movies", size=1)

Like in the previous lesson go to the Kibana http://localhost:5601. You need to add new index **`movies`** and disable **`Index contains time-based events`**, then press `Create`.

<img src="images/g01.png">

After installation X-Pack in Kibana menu you will see a new button **`Graph`**, click on it.

<img src="images/g02.png">

Choose the **`movies`** index, press on **`plus`** and add country as `vetric`. In the Graph, the terms we want to include in the graph are called `vertices`. The relationship between any two vertices is a `connection`. The `connection` summarizes the documents that contain both vertices' terms.

<img src="images/g03.png">

When you added the vetric, you can configurate it set the color, icon and max terms.

<img src="images/g04.png">

Let's add one more vetric **`genres`**.

<img src="images/g05.png">

And configure it.

<img src="images/g06.png">

Now find relation between *`USA`* and other countries through *`genres`*.

<img src="images/g07.png">

You can select all vetrices and add more terms.

<img src="images/g08.png">

You will see something like that.

<img src="images/g09.png">

Also in the **`Settings`** you can configurate the graph. For example, let's disable **`Significant links`** and set **`Certainty`** to one. And see what we get.

<img src="images/g10.png">

<img src="images/g11.png">

As you can see, now the graph is hard to understand. To see connection of *`Italy`* you can choose it and press *`linked`* and you will get the list of terms.

<img src="images/g12.png">

---

# Analyzers

This section is devoted to an important part of Elasticsearch such as analyzers. Their creation and configuration are the main stages of the further efficiency search.

The main goal of any analyzer is splitting of a stream of characters, overloaded with unnecessary details, to squeeze out the basic essence and getting a list of tokens that would reflect it. Analyzers can be specified per-query, per-field or per-index. The analyzer consists of: 

* [Character filters](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-charfilters.html). It is used in the first stage before processing, to remove all unnecessary characters. The great option is that you may not specify your own filter and standard one will be used as default. An analyzer may have zero or more character filters.
* [Tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-tokenizers.html). It splits text into words, phrases, or character sets by a special character or a set. An analyzer must have exactly one tokenizer.
* [Token filters](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-tokenfilters.html). Such filters can modify tokens (e.g. fetching to lower case), delete some of them ("stopwords" list) or add a new one (list of used synonymous). An analyzer may have zero or more token filters.

Elasticsearch already have the precreated analyzers:

|Analyzer|Description|
|:---|------------|
|[`Standard`](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-standard-analyzer.html)|It removes most punctuation, lowercases terms, and supports removing stop words. It is built using the [Standard Tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-standard-tokenizer.html) with the [Standard Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-standard-tokenfilter.html), [Lower Case Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-lowercase-tokenfilter.html) and [Stop Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-stop-tokenfilter.html)(disabled by default).|
|[`Simple`](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-simple-analyzer.html)|It splits the text into words by any non-alphabetic characters and fetches tokens to lowercase. It consists of [Lower Case Tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-lowercase-tokenizer.html).|
|[`Whitespace`](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-whitespace-analyzer.html)|It splits the text into words by space characters and don't fetches tokens to lowercase. It consists of [Whitespace Tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-whitespace-tokenizer.html).|
|[`Stop`](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-stop-analyzer.html)|It is like the **`Simple`** analyzer, but support stop words. It consists of [Lower Case Tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-lowercase-tokenizer.html) and [Stop Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-stop-tokenfilter.html).|
|[`Keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-keyword-analyzer.html)|Output the text without any transformation. It consists of [Keyword Tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-keyword-tokenizer.html).|
|[`Pattern`](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-pattern-analyzer.html)|It can split the text by regular expression. It consists of [Pattern Tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-pattern-tokenizer.html), [Lower Case Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-lowercase-tokenfilter.html) and [Stop Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-stop-tokenfilter.html)(disabled by default).|
|[`Language`](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-lang-analyzer.html)|It splits the text into words, filters out "junk" and cuts off word endings considering the morphology of the current language.|
|[`Fingerprint`](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-fingerprint-analyzer.html)|This analyzer implement the [fingerprinting algorithm](https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth#fingerprint). It consists of [Standard Tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-standard-tokenizer.html), [Lower Case Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-lowercase-tokenfilter.html), [ASCII Folding Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-asciifolding-tokenfilter.html), [Stop Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-stop-tokenfilter.html)(disabled by default) and [Fingerprint Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-fingerprint-tokenfilter.html)|

If you don't find an analyzer what you need, you can create the [custom analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-custom-analyzer.html) with [character filters](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-charfilters.html), [tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-tokenizers.html) and [token filters](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-tokenfilters.html).

You can try each analyzer witout adding it to index. For example, let's test *`fingerprint analyzer`*

In [None]:
data = {
  "analyzer": "fingerprint",
  "text": "Mount   Everest, also known in Nepal as Sagarmāthā and in Tibet as Chomolungma, is Earth's highest mountain. Its peak is 8,848 metres above sea level"
}
es.indices.analyze(body=data)

The following things are done:
* removed leading and trailing whitespace
* changed all characters to their lowercase representation
* removed all punctuation and control characters
* splitted the string into whitespace-separated tokens
* sorted the tokens and remov duplicates
* joined the tokens back together
* normalized extended western characters to their ASCII representation ("Sagarmāthā" → "Sagarmatha")

To see all parameters of the index, use the **`get`** method.

In [None]:
es.indices.get(index='movies')

As you can see, no analyzer is defined. Now let's create a new index with different analyzers for each fields.

In [None]:
query = {
    "settings": {
        "analysis": {
            "analyzer": {
                "posts_analyzer":{
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": "asciifolding"
                } 
            }
        }
    },
    "mappings": {
        "post" : {
            "properties" : {
                "title": {"type": "text", "analyzer": "posts_analyzer", "fielddata": True},
                "content": {"type": "text", "analyzer": "stop", "fielddata": True},
                "full_text": {"type": "text", "analyzer": "keyword", "fielddata": True}
            }           
        }
    }
}
es.indices.create(index='posts', body=query)

So the *`title`* field will be have the custom *`posts_analyzer`* analyzer, *`content`* - *`stop`* analyzer and *`full_text`* - *`keyword`* analyzer. Let's add some data.

In [None]:
data = {
    "title": "About the mountains, part 1",
    "content": "Mount Everest, also known in Nepal as Sagarmāthā and in Tibet as Chomolungma, is Earth's highest mountain. Its peak is 8,848 metres (29,029 ft) above sea level. Mount Everest is located in the Mahalangur mountain range in Nepal. The international border between China (Tibet Autonomous Region) and Nepal runs across Everest's precise summit point. Its massif includes neighbouring peaks Lhotse, 8,516 m (27,940 ft); Nuptse, 7,855 m (25,771 ft) and Changtse, 7,580 m (24,870 ft).",
    'full_text': "Mount Everest, also known in Nepal as Sagarmāthā and in Tibet as Chomolungma, is Earth's highest mountain. Its peak is 8,848 metres (29,029 ft) above sea level. Mount Everest is located in the Mahalangur mountain range in Nepal. The international border between China (Tibet Autonomous Region) and Nepal runs across Everest's precise summit point. Its massif includes neighbouring peaks Lhotse, 8,516 m (27,940 ft); Nuptse, 7,855 m (25,771 ft) and Changtse, 7,580 m (24,870 ft)."
}
es.create(index='posts', doc_type='post', id=1, body=data)

To see how your data have been analyzed, you can use the **`docvalue_fields`** parameter. The next command will work if you set **`"fielddata": True`** when creating an index. Find the field ***fields*** in the output, this is the field which will show how your data were analyzed.

In [None]:
es.search(index='posts', docvalue_fields=['title', 'content', 'full_text'])

The *`content`* and *`title`* fields are splitted on tokens while the *`full_text`* is not. When you try to find for example the **`the`** word in *`content`* field, you don't find anything, but if you will search in *`title`* you get the matches.

In [None]:
query = {
    "query": {
        "match" : {
            "content" : "the"
        }
    }
}
es.search(index='posts', body=query)

In [None]:
query = {
    "query": {
        "match" : {
            "title" : "the"
        }
    }
}
es.search(index='posts', body=query)

*Note: Try to find **`the`** word in **`full_text`** field and understand why it was not found.*

Now when you know what is analyzers and how to use them, you are ready to learn about ***stop words***.

---

## Stop words

Stop words are the most common words in the text that search engines filter out after processing text. You can set your own stop words, path to the file with stop word or use already exist stop words 

    _arabic_, _armenian_, _basque_, _brazilian_, _bulgarian_, _catalan_, _czech_, _danish_, _dutch_, _english_, _finnish_,
    _french_, _galician_, _german_, _greek_, _hindi_, _hungarian_, _indonesian_, _irish_, _italian_, _latvian_, _norwegian_,
    _persian_, _portuguese_, _romanian_, _russian_, _sorani_, _spanish_, _swedish_, _thai_, _turkish_
    
For example, the most general _english_ stop words are: *a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with*. Let's add stop words to a **`posts_analyzer`**. But first you need to close the index and only then add the stop words.

In [None]:
es.indices.close(index='posts')

To change the setting you can use the [put_settings](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.put_settings) method. Then define the **`stopwords`** field in analyzer. You can use only *stop* analyzer and the code will be look like this:
<code>
"settings": {
        "analysis": {
            "analyzer": {
                "posts_analyzer":{
                    "type": "custom",
                    "tokenizer": "stop",
                    "stopwords": ["\_english_", "about", "part"]
                } 
            }
        }
    }
</code>
But if you want to use the previous analyzer, you neeed to define the stop filter and add it to analyzer.

In [None]:
settings = {
    "settings": {
        "analysis": {
            "analyzer": {
                "posts_analyzer":{
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["asciifolding", "my_stop_words"],
                } 
            },
            "filter": {
                "my_stop_words": {
                    "type": "stop",
                    "stopwords": ["_english_", "about", "part"]
                }
            }
        }
    }
}
es.indices.put_settings(index='posts', body=settings)

And open the closed index.

In [None]:
es.indices.open(index='posts')

Now add the new data to see difference.

In [None]:
data = {
    "title": "About the mountains, part 2",
    "content": "K2, also known as Mount Godwin-Austen or Chhogori is the second highest mountain in the world, after Mount Everest, at 8,611 metres (28,251 ft) above sea level. It is located on the China-Pakistan border between Baltistan, in the Gilgit–Baltistan region of northern Pakistan, and the Taxkorgan Tajik Autonomous County of Xinjiang, China. K2 is the highest point of the Karakoram range and the highest point in both Pakistan and Xinjiang.",
    'full_text': "K2, also known as Mount Godwin-Austen or Chhogori is the second highest mountain in the world, after Mount Everest, at 8,611 metres (28,251 ft) above sea level. It is located on the China-Pakistan border between Baltistan, in the Gilgit–Baltistan region of northern Pakistan, and the Taxkorgan Tajik Autonomous County of Xinjiang, China. K2 is the highest point of the Karakoram range and the highest point in both Pakistan and Xinjiang."
}
es.create(index='posts', doc_type='post', id=2, body=data)

In [None]:
es.search(index='posts', docvalue_fields=['title', 'content', 'full_text'])

If you compaire the *title* from `id=1` and `id=2` you can see that they are different. In the `id=2` only *'mountains'* were left.

So, what do you think you will find if you search the word **`mountain`** in the **`title`**. You are absolutly right! Nothing!

In [None]:
query = {
    "query": {
        "match" : {
            "title" : "mountain"
        }
    }
}
es.search(index='posts', body=query)

---

## Plural/Singular

To resolve the ***plural/singular*** problem in Elasticsearch exist two filters [snowball](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-snowball-tokenfilter.html) and [porter_stem](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-porterstem-tokenfilter.html). About snowball algorithm you can read [here](http://snowball.tartarus.org/texts/introduction.html) and about porter algorithm [here](https://tartarus.org/martin/PorterStemmer/). 

Let's change our analyzer and apply *snowball* filter.

In [None]:
# Close the index
es.indices.close(index='posts')

# Implement the snowball filter
settings = {
    "settings": {
        "analysis": {
            "analyzer": {
                "posts_analyzer":{
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["asciifolding", "my_stop_words", "my_snowball_filter"],
                } 
            },
            "filter": {
                "my_stop_words": {
                    "type": "stop",
                    "stopwords": ["_english_", "about", "part"]
                },
                "my_snowball_filter": {
                    "type": "snowball",
                    "language": "English"
                }
            }
        }
    }
}
es.indices.put_settings(index='posts', body=settings)

# Open the index
es.indices.open(index='posts')

Insert the data.

In [None]:
data = {
    "title": "About the mountains, part 3",
    "content": "Kangchenjunga, also spelled Kanchenjunga, is the third highest mountain in the world, and lies partly in Nepal and partly in Sikkim, India. It rises with an elevation of 8,586 m (28,169 ft) in a section of the Himalayas called Kangchenjunga Himal that is limited in the west by the Tamur River, in the north by the Lhonak Chu and Jongsang La, and in the east by the Teesta River.",
    'full_text': "Kangchenjunga, also spelled Kanchenjunga, is the third highest mountain in the world, and lies partly in Nepal and partly in Sikkim, India. It rises with an elevation of 8,586 m (28,169 ft) in a section of the Himalayas called Kangchenjunga Himal that is limited in the west by the Tamur River, in the north by the Lhonak Chu and Jongsang La, and in the east by the Teesta River."
}
es.create(index='posts', doc_type='post', id=3, body=data)

Check the changes.

In [None]:
query = {
    "query": {
        "match" : {
            "title" : "mountain"
        }
    }
}
es.search(index='posts', body=query)

All works fine and we found the matches. The ***porter_stem*** cuts word endings and suffixes and it is similar to filter ***snowball***. Remember the ***porter_stem*** filter works only if ***lowercase*** filter comes before it. With the ***porter_stem*** filter you will practice independently. Now you know how to create ***plural/singular*** search!

---
### Synonyms

Imagine that a user tries to find ***trousers*** on the website and in database only exist the ***pants***. User doesn't find them. So, the solution of this problem is in creating of a dictionary with synonyms and in using of [Synonym Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-synonym-tokenfilter.html). 

First you need to create new index with filter. You can set synonyms inside the filter like this

In [None]:
settings = {
    "settings": {
        "analysis": {
            "analyzer": {
                "synonyms_analyzer":{
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": "my_synonyms",
                } 
            },
            "filter": {
                "my_synonyms": {
                    "type": "synonym",
                    "synonyms": [
                        "trousers, jeans => pants",
                        "jacket, blue jacket, red jacket"
                    ]
                }
            }
        }
    },
    "mappings": {
        "type" : {
            "properties" : {
                "title": {"type": "text", "analyzer": "synonyms_analyzer", "fielddata": True},
            }           
        }
    }
}

es.indices.create(index='clothes', body=settings)

or use a path to synonyms. In filter, you need to replace the ***synonyms*** list to ***"synonyms_path" : "path/to/synonyms.txt"***.

***HINT:*** This file `analysis/synonyms.txt` must be in each node of the cluster. File format should be as follows:

<pre>
# Blank lines and lines starting with pound are comments.

# Explicit mappings match any token sequence on the LHS of "=>"
# and replace with all alternatives on the RHS.  These types of mappings
# ignore the expand parameter in the schema.
# Examples:
i-pod, i pod => ipod,
sea biscuit, sea biscit => seabiscuit

# Equivalent synonyms may be separated with commas and give
# no explicit mapping.  In this case the mapping behavior will
# be taken from the expand parameter in the schema.  This allows
# the same synonym file to be used in different synonym handling strategies.
# Examples:
ipod, i-pod, i pod
foozball , foosball
universe , cosmos

# If expand==true, "ipod, i-pod, i pod" is equivalent
# to the explicit mapping:
ipod, i-pod, i pod => ipod, i-pod, i pod
# If expand==false, "ipod, i-pod, i pod" is equivalent
# to the explicit mapping:
ipod, i-pod, i pod => ipod

# Multiple synonym mapping entries are merged.
foo => foo bar
foo => baz
# is equivalent to
foo => foo bar, baz
</pre>

Now add the data.

In [None]:
data = {
    "title": "pants",
}

es.create(index='clothes', doc_type='type', id=1, body=data)

data = {
    "title": "jacket",
}

es.create(index='clothes', doc_type='type', id=2, body=data)

Let's try to find the matches with synonyms.

In [None]:
query = {
    "query": {
        "match" : {
            "title" : "blue jacket"
        }
    }
}

es.search(index='clothes', body=query)

Done, we found it! Synonyms help to find best matches, so if you want to get matter result of the search, don't forget about it.

---

### Stemmer

Stemmer is the NLP algorithm that brings the word to the correct form, for example: *had -> have, mice -> mouse, going -> go*. You already know one, it is *porter_stem*, we talked about it in *plural/singular* problem. But you don't know that Elasticsearch has the [Stemmer Token Filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-stemmer-tokenfilter.html). Sometimes there exist words wich you don't want to stemming or stemming in different type. Let's create filter and costumize it.

In [None]:
settings = {
    "settings": {
        "analysis": {
            "analyzer": {
                "texts_analyzer":{
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["my_exceptions", "my_rulles", "my_stemmer"],
                } 
            },
            "filter": {
                "my_stemmer": {
                    "type": "stemmer",
                    "name" : "english",
                },
                "my_exceptions": {
                    "type": "keyword_marker",
                    "keywords": ["lhotse", "changtse", "autonomous"]
                },
                "my_rulles": {
                    "type": "stemmer_override",
                    "rules": [ 
                                "mountain => mount",
                             ]
                }
            }
        }
    },
    "mappings": {
        "text" : {
            "properties" : {
                "title": {"type": "text", "analyzer": "keyword", "fielddata": True},
                "content": {"type": "text", "analyzer": "texts_analyzer", "fielddata": True},
            }           
        }
    }
}

es.indices.create(index='texts', body=settings)

We created the filter stemmer with the name ***my_stemmer*** and set the ***name*** to ***english***. All list of support languages you can find [here](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-stemmer-tokenfilter.html). 

Next we created the ***my_exceptions*** filter, set the ***"type": "keyword_marker"*** and added words to the ***keywords*** list which stemmer will skip. 

And on the last step we created ***my_rulles*** filter to define our own stemming. In the example above, all words ***mountain*** will be converted to ***mount***.

Insert the data and check if all work as we wanted.

In [None]:
data = {
    "title": "Everest",
    "content": "Mount Everest, also known in Nepal as Sagarmāthā and in Tibet as Chomolungma, is Earth's highest mountain. Its peak is 8,848 metres (29,029 ft) above sea level. Mount Everest is located in the Mahalangur mountain range in Nepal. The international border between China (Tibet Autonomous Region) and Nepal runs across Everest's precise summit point. Its massif includes neighbouring peaks Lhotse, 8,516 m (27,940 ft); Nuptse, 7,855 m (25,771 ft) and Changtse, 7,580 m (24,870 ft).",
}

es.create(index='texts', doc_type='text', id=1, body=data)

In [None]:
es.search(index='texts', docvalue_fields=['title', 'content'])

As you can see all works fine. The stemmer is powerful and useful filter but you need to be careful with it.

---

## Similarity or string metrics

Similarity shows how much the search query matches with the found results. In Elasticsearch exist seven similarities (the full list you can find [here](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/index-modules-similarity.html#_available_similarities)). But only two of them you can set without creation additional options. They are ***BM25*** (used by default) and ***classic*** similarity. For others you need to create the similarity in setting and implement for the index or for the fields. To set ***classic*** similarity to all fields of the index you can do this:

<pre>
settings = {
    "settings": {
        "similarity" : {
          "default" : {
            "type" : "classic"
          }
        }
    },
    "mappings": {
        "text" : {
            "properties" : {
                "title": {"type": "text"},
                "content": {"type": "text"},
            }           
        }
    }
}

es.indices.create(index='texts', body=settings)
</pre>

Or you can set the different similarity to each fields:

<pre>
settings = {
    "mappings": {
        "text" : {
            "properties" : {
                "title": {"type": "text", "similarity": "BM25"},
                "content": {"type": "text", "similarity": "classic"},
            }           
        }
    }
}

es.indices.create(index='texts', body=settings)
</pre>

Now let's create configured similarity. In this example we will use the [IB similarity](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/index-modules-similarity.html#ib). But first, let's delete the privious index and create a new one.

In [None]:
# Deleting the index
es.indices.delete(index='texts')

# Creating a new one
settings = {
    "settings": {
        "similarity" : {
          "my_similarity" : {
            "type" : "IB",
            "distribution" : "spl",
            "lambda" : "df",
            "normalization" : "no"
          }
        }
    },
    "mappings": {
        "text" : {
            "properties" : {
                "title": {"type": "text"},
                "content": {"type": "text", "similarity": "my_similarity"},
            }           
        }
    }
}

es.indices.create(index='texts', body=settings)

Insert the data.

In [None]:
data = {
    "title": "Everest",
    "content": "Mount Everest, also known in Nepal as Sagarmāthā and in Tibet as Chomolungma, is Earth's highest mountain. Its peak is 8,848 metres (29,029 ft) above sea level. Mount Everest is located in the Mahalangur mountain range in Nepal. The international border between China (Tibet Autonomous Region) and Nepal runs across Everest's precise summit point. Its massif includes neighbouring peaks Lhotse, 8,516 m (27,940 ft); Nuptse, 7,855 m (25,771 ft) and Changtse, 7,580 m (24,870 ft).",
}

es.create(index='texts', doc_type='text', id=1, body=data)


data = {
    "title": "Kangchenjunga",
    "content": "Kangchenjunga, also spelled Kanchenjunga, is the third highest mountain in the world, and lies partly in Nepal and partly in Sikkim, India. It rises with an elevation of 8,586 m (28,169 ft) in a section of the Himalayas called Kangchenjunga Himal that is limited in the west by the Tamur River, in the north by the Lhonak Chu and Jongsang La, and in the east by the Teesta River.",
}

es.create(index='texts', doc_type='text', id=2, body=data)

And search the matches.

In [None]:
query = {
    "query": {
        "match" : {
            "content" : "Mount Everest is located in the Mahalangur mountain range in Nepal"
        }
    }
}

es.search(index='texts', body=query)

Look at the ***'_score'*** value for each mathes by using the different similarities the values will be different.

---

> # Exercise 1
> Find the relation between ***Australia*** and other countries through ***content_rating***. Note, It should be disabled the identification of terms that are "significant" rather than simply popular terms which have minimal number of documents (that are required as evidence before introducing a related term) equal to 2. You should get the next ***graph***.
<img src="images/l02_ex01.png">

---

> # Exercise 2
> Create the [synonyms filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-synonym-tokenfilter.html) by using the file. 

In [None]:
# type your code here 

---

> # Exercise 3
> Create the [stopwords filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.x/analysis-stop-tokenfilter.html) by using the file. You need to use the ***stopwords_path*** field except for the ***stopwords***. Words in the file must start from a new row.
<img src="images/l02_ex03.png">

In [None]:
# type your code here 