![arangodb](https://github.com/joerg84/ArangoDBUniversity/blob/master/img/ArangoDB_logo.png?raw=1)

# Fuzzy Search 

<a href="https://colab.research.google.com/github/joerg84/ArangoDBUniversity/blob/master/FuzzySearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ArangoSearch provides information retrieval features, natively integrated into ArangoDB’s query language and with support for all data models. It is primarily a full-text search engine, a much more powerful alternative to the full-text index type.
Check this [ArangoSearch notebook](https://colab.research.google.com/github/joerg84/ArangoDBUniversity/blob/master/ArangoSearch.ipynb) for an introduction to ArangoSearch.

When dealing with real-world text retrieval, we often not only care about exact matches to our search phrase but need to consider for example typos or alternative spellings.
“Fuzzy search” is an umbrella term referring to a set of algorithms for such approximate matching. Usually such algorithms evaluate some similarity measure showing how close a search term is to the items in a dictionary. Then a search engine can make a decision on which results have to be shown first.

In this notebook we will apply at two different implementation of fuzzy search in [ArangoSearch](https://www.arangodb.com/why-arangodb/full-text-search-engine-arangosearch/):
* [Levenshtein distance](https://www.arangodb.com/docs/devel/aql/functions-arangosearch.html#levenshtein_match
)
* [NGram similarity](https://www.arangodb.com/docs/devel/aql/functions-arangosearch.html#ngram_match)

# Setup 

Before getting started with ArangoSearch we need to prepare our environment and create a temporary database on ArangoDB's managed Service Oasis.

In [1]:
%%capture
!git clone https://github.com/joerg84/ArangoDBUniversity.git
!rsync -av ArangoDBUniversity/ ./ --exclude=.git
!pip3 install pyarango
!pip3 install "python-arango>=5.0"

In [2]:
import json
import requests
import sys
import oasis
import time

from pyArango.connection import *
from arango import ArangoClient

Create the temporary database:

In [3]:
# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(tutorialName="ArangoSearchIMDBTutorial", credentialProvider="https://d383fa0b596a.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB")

# Connect to the temp database
# Please note that we use the python-arango driver as it has better support for ArangoSearch 
database = oasis.connect_python_arango(login)

Requesting new temp credentials.
Temp database ready to use.


In [4]:
print("https://"+login["hostname"]+":"+str(login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
print("Database: " + login["dbName"])

https://d383fa0b596a.arangodb.cloud:8529
Username: TUT3wlz5tr4w4aetcuylucetk
Password: TUT1k1fer0o2ld0ytakem8m08e
Database: TUTza6c80ni8cjraqxvvr6web


Feel free to use to above URL to checkout the WebUI!

##  IMDB Example Dataset

![imdb](https://github.com/joerg84/ArangoDBUniversity/blob/master/img/IMDB_graph.png?raw=1)

Last, but not least we will import the [IMBD Example Dataset](https://github.com/arangodb/example-datasets/tree/master/Graphs/IMDB) including information about various movies, actors, directors, ... as a graph. 
*Note the included arangorestore will only work on Linux or Windows systems, if you want to run this notebook on a different OS please consider using the appropriate arangorestore from the [Download area](https://www.arangodb.com/download-major/).

## Linux:

In [29]:
! ./tools/arangorestore -c none --server.endpoint http+ssl://{login["hostname"]}:{login["port"]} --server.username {login["username"]} --server.database {login["dbName"]} --server.password {login["password"]} --default-replication-factor 3  --input-directory "data/imdb"

[0m2020-06-17T18:25:16Z [436] INFO [05c30] {restore} Connected to ArangoDB 'http+ssl://d383fa0b596a.arangodb.cloud:8529'
[0m[0m2020-06-17T18:25:16Z [436] INFO [3b6a4] {restore} no properties object
[0m[0m2020-06-17T18:25:16Z [436] INFO [9b414] {restore} # Re-creating document collection 'imdb_vertices'...
[0m[0m2020-06-17T18:25:16Z [436] INFO [9b414] {restore} # Re-creating edge collection 'imdb_edges'...
[0m[0m2020-06-17T18:25:17Z [436] INFO [6d69f] {restore} # Dispatched 2 job(s), using 2 worker(s)
[0m[0m2020-06-17T18:25:17Z [436] INFO [d88c6] {restore} # Creating indexes for collection 'imdb_vertices'...
[0m[0m2020-06-17T18:25:17Z [436] INFO [94913] {restore} # Loading data into edge collection 'imdb_edges', data size: 48957903 byte(s)
[0m[0m2020-06-17T18:25:17Z [436] INFO [94913] {restore} # Loading data into document collection 'imdb_vertices', data size: 22665786 byte(s)
[0m[0m2020-06-17T18:25:22Z [436] INFO [75e65] {restore} # Current restore progress: restored 

# Create First View

As discussed above, an ArangoSearch view contains references to documents stored in different collections. 
This makes it possible to perform complex federated searches, even over a complete graph including vertex and edge collections.

In [32]:
# Create an ArangoSearch view.
database.create_arangosearch_view(
    name='v_imdb',
    properties={'cleanupIntervalStep': 0}
)

ViewCreateError: ignored

Let us check it is actually there:

In [33]:
print(database["v_imdb"])

<StandardCollection v_imdb>


As of now this view is empty, so we need to link it to a collection (i.e., imdb_vertices).

In [42]:
print(database.analyzers())

[{'name': 'text_zh', 'type': 'text', 'properties': {'locale': 'zh.utf-8', 'case': 'lower', 'stopwords': [], 'accent': False, 'stemming': True}, 'revision': 0, 'features': ['position', 'norm', 'frequency']}, {'name': 'text_sv', 'type': 'text', 'properties': {'locale': 'sv.utf-8', 'case': 'lower', 'stopwords': [], 'accent': False, 'stemming': True}, 'revision': 0, 'features': ['position', 'norm', 'frequency']}, {'name': 'text_en', 'type': 'text', 'properties': {'locale': 'en.utf-8', 'case': 'lower', 'stopwords': [], 'accent': False, 'stemming': True}, 'revision': 0, 'features': ['position', 'norm', 'frequency']}, {'name': 'text_ru', 'type': 'text', 'properties': {'locale': 'ru.utf-8', 'case': 'lower', 'stopwords': [], 'accent': False, 'stemming': True}, 'revision': 0, 'features': ['position', 'norm', 'frequency']}, {'name': 'text_fi', 'type': 'text', 'properties': {'locale': 'fi.utf-8', 'case': 'lower', 'stopwords': [], 'accent': False, 'stemming': True}, 'revision': 0, 'features': ['pos

In [66]:
 # Retrieve list of analyzers.
database.analyzers()
database.delete_analyzer('test_analyzer', ignore_missing=True)

database.create_analyzer(
        name='test_analyzer',
        analyzer_type='text',
        properties={'locale': 'en.utf-8', 'case': 'lower', 'stopwords': [], 'accent': False, 'stemming': True, 'edgeNgram' : { 'min': 3, 'max': 8, 'preserveOriginal': True }},
        features=["frequency",'norm','position']
    )


#  locale: "en.utf-8",
# ........>   case: "lower",
# ........>   accent: false,
# ........>   stemming: false,
# ........>   stopwords: []
# ........> }, ["frequency","norm","position"])

# # Create an analyzer.
# database.create_analyzer(
#     name='bigram1',
#     analyzer_type='text',
#     properties={},
#     features=["frequency","norm","position"]
# )

#"edgeNgram" : { "min": 3, "max": 8, "preserveOriginal": True }
 
#  ("text_edge_ngrams", "text", {
# ........>   edgeNgram: { min: 3, max: 8, preserveOriginal: true },
# ........>   locale: "en.utf-8",
# ........>   case: "lower",
# ........>   accent: false,
# ........>   stemming: false,
# ........>   stopwords: [ "the" ]
# ........> }, ["frequency","norm","position"])

{'features': ['position', 'norm', 'frequency'],
 'name': 'TUTza6c80ni8cjraqxvvr6web::test_analyzer',
 'properties': {'accent': False,
  'case': 'lower',
  'edgeNgram': {'max': 8, 'min': 3, 'preserveOriginal': True},
  'locale': 'en.utf-8',
  'stemming': True,
  'stopwords': []},
 'revision': 13,
 'type': 'text'}

In [68]:
 link = { 
  "includeAllFields": True,
  "fields" : { "description" : { "analyzers" : [ "test_analyzer" ] } }
}


database.update_arangosearch_view(
    name='v_imdb',
    properties={'links': { 'imdb_vertices': link }}
)

{'cleanup_interval_step': 0,
 'commit_interval_msec': 1000,
 'consolidation_interval_msec': 10000,
 'consolidation_policy': {'min_score': 0,
  'segments_bytes_floor': 2097152,
  'segments_bytes_max': 5368709120,
  'segments_max': 10,
  'segments_min': 1,
  'type': 'tier'},
 'global_id': 'c60023776/',
 'id': '60023776',
 'links': [{}],
 'name': 'v_imdb',
 'primary_sort': [],
 'type': 'arangosearch',
 'writebuffer_active': 0,
 'writebuffer_idle': 64,
 'writebuffer_max_size': 33554432}

As the indexing might take a few seconds, let us have a brief look at what is actually going on.

![ArangoSearch](https://github.com/joerg84/ArangoDBUniversity/blob/master/img/ArangoSearch_Arch.jpg?raw=1)

By now our view should be ready, so let us issue the first query and look for short Drama Movies.

In [69]:
cursor = database.aql.execute(
  """
  FOR d IN v_imdb 
    SEARCH d.type == "Movie" 
    AND 
    d.genre == "Drama" 
    AND 
    d.runtime IN 10..50 
    RETURN d.title
  """
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

Wellcome
Rosemarie Nitribitt - Tod einer Edelhure
Wiatr
Primavera
Lücken im Gedankenstrom
Dr. Jekyll and Mr. Hyde
Breaking Glass
Pulsar
Frühlings Erwachen - Eine Kindertragödie
Glastage
Sunday in August
Land gewinnen
À San Remo
Carne
Dr. Jekyll and Mr. Hyde
Room 10
Zwischen Flieder wandern und singen
Alias
Antoine et Colette
Edison's Frankenstein
Silvester Home Run
Bis zur Unendlichkeit
Space Riders
True
Kurz:Ivan
Dreamcatcher
The Kolaborator
Rounds
Melissa
Hotel Chevalier
Another Lady Innocent
The Wiggles: Wiggle Bay
Good Night
Crin blanc: Le cheval sauvage
VeggieTales: An Easter Carol


**NGram Match**

Let us start by using the NGram match to find mispelled movie title.

In [73]:
cursor = database.aql.execute(
"""
FOR d IN v_imdb SEARCH NGRAM_MATCH(d.title, 'Str War', 0.7, 'test_analyzer')
SORT BM25(d) DESC
RETURN d.title"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))



AQLQueryExecuteError: ignored

In this query we setup a typical FOR loop and iterate over the previously created view with stored values(stored values not required for Fuzzy search). Then we use the NGRAM_MATCH Search function to search the description of the movies in our view to find movies with similar results. The .7 is the threshold amount, this is how much ‘fuzziness’ or wiggle-room we want to give the search.

The threshold indicates just how far from our supplied phrase the results should be allowed to go. The number must be between 0 and 1 and the closer to 1 you get the more accurate you are requesting the results to be. The next thing is the analyzer we are using and this deserves a little further explanation.

This ngram analyzer was configured with a min and max of 2, which means it looks at words 2 letters at a time. This is useful for determining the longest common sequence and context. The idea behind n-gram matching is searching for similar words, but not necessarily exact matches. One of the simplest ways of calculating similarity between two words is calculating the longest common sequence (LCS) of letters. The longer the LCS is the more similar the words are. However, this approach has one big disadvantage – absence of context. For example, words <connection> and <fonetica> have a long LCS (o-n-e-t-i) but very different meanings. To add some context, ngram sequences are used.

Each word is split into a series of letter groups and these groups are then matched. If we use the same words, but calculate similarity based on 3-grams, an ngram with max and min of 3, we will get a better similarity measure: con-onn-nne-nec-ect-cti-tio-ion vs. fon-one-net-eti-tic-ica gives shorter LCS ( zero matches). To get rid of length differences we normalize the LCS length by word length. We calculate these matches to get a rating with a value between 0 (no match at all) and 1(fully matched). The ability to use this rating generated with ngrams is implemented in ArangoSearch with the NGRAM_MATCH function.

This functionality is why we are still able to get relevant results even with misspelled words:

**bold text**## Levenshtein MATCH

Proximity searching is a way to search for two or more words that occur within a certain number of words from each other.
In the next example, we are looking for the word sequence "in <any word> galaxy" in the description of a movie.
Feel free to try other values!

In [None]:
# Execute the query
cursor = database.aql.execute(
  """
  FOR d IN v_imdb 
  SEARCH PHRASE(
    d.description, 
    "in", 
    1, 
    "galaxy", 
    "text_en"
    ) 
    RETURN {
      title: d.title, 
      description: d.description
      }
  """
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

In [None]:
# Execute the query
cursor = database.aql.execute(
  """
  FOR doc IN v_imdb
    SEARCH NGRAM_MATCH(
      doc.description, 
      'galaxy', 
      'text_en'
      )
    RETURN doc
  """
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

## Ranking and Document Relevance

Great, now we can identify documents containing a specific phrase,
but especially with large document bases we need to be able to rank documents based on the their relevance.
ArangoSearch supports the following two schemes:

* [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25)

* [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

You can learn more about ranking in the [documentation](https://www.arangodb.com/docs/3.6/aql/functions-arangosearch.html#scoring-functions).

So let us find movies with the following key-words: “amazing, action, world, alien, sci-fi, science, documental, galaxy”

In [None]:
cursor = database.aql.execute(
  """
  FOR d IN v_imdb 
    SEARCH ANALYZER(
      d.description 
      IN TOKENS('amazing action world alien sci-fi science documental galaxy', 
      'text_en'), 
      'text_en'
      ) 
    SORT BM25(d) DESC 
    LIMIT 10 
    RETURN {
      "title": d.title, 
      "description" : d.description
      }
    """
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

Another crucial point of ArangoSearch is the ability to fine-tune document scores evaluated by relevance models at query time. That functionality is exposed in AQL via the BOOST function.
So let us tweak our previous query to prefer “galaxy” amongst the others keywords.

In [None]:
cursor = database.aql.execute(
"""
  FOR d IN v_imdb 
    SEARCH ANALYZER(
      d.description 
      IN TOKENS('amazing action world alien sci-fi science documental', 
      'text_en'
      ) 
    ||
    BOOST(
      d.description 
      IN TOKENS('galaxy', 
      'text_en'), 
      5), 
      'text_en'
      ) 
    SORT BM25(d) DESC 
    LIMIT 10 
    RETURN {
      "title": d.title, 
      "description" : d.description
      }
"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

## ArangoSearch Meets Graph

One of the coolest features of ArangoDB, being a multi-model database, is that we can combine different data-model and query capabilites.
So, for example, we can easily combine ArangoSearch with a Graph traversal. Recall that our imdb dataset is a graph with edges connecting 
the movies we have been looking at to their respective actors, genres, or directors. Let us explore this and look up the director for each each of the Sci-fi movies above.

In [None]:
cursor = database.aql.execute(
"""
FOR d IN v_imdb 
   SEARCH ANALYZER(
     d.description 
     IN TOKENS('amazing action world alien sci-fi science documental', 
     'text_en'
     ) 
     ||
   BOOST(
     d.description 
     IN TOKENS('galaxy', 'text_en'), 
     5), 
     'text_en'
     ) 
     
   SORT BM25(d) DESC 

   LIMIT 10 

     FOR vertex, edge, path 
      IN 1..1 INBOUND  
      d imdb_edges
      FILTER path.edges[0].$label == "DIRECTED"
      RETURN DISTINCT {
        "director" : vertex.name, 
        "movie" : d.title
        } 
"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

# Further Links

* https://www.arangodb.com/docs/stable/arangosearch.html

* https://www.arangodb.com/arangodb-training-center/search/arangosearch/