![arangodb](https://github.com/joerg84/ArangoDBUniversity/blob/master/img/ArangoDB_logo.png?raw=1)

# Fuzzy Search 

<a href="https://colab.research.google.com/github/joerg84/ArangoDBUniversity/blob/master/FuzzySearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ArangoSearch provides information retrieval features, natively integrated into ArangoDB’s query language and with support for all data models. It is primarily a full-text search engine, a much more powerful alternative to the full-text index type.
Check this [ArangoSearch notebook](https://colab.research.google.com/github/joerg84/ArangoDBUniversity/blob/master/ArangoSearch.ipynb) for an introduction to ArangoSearch.

When dealing with real-world text retrieval, we often not only care about exact matches to our search phrase but need to consider for example typos or alternative spellings.
“Fuzzy search” is an umbrella term referring to a set of algorithms for such approximate matching. Usually such algorithms evaluate some similarity measure showing how close a search term is to the items in a dictionary. Then a search engine can make a decision on which results have to be shown first.

In this notebook we will apply at two different implementation of fuzzy search in [ArangoSearch](https://www.arangodb.com/why-arangodb/full-text-search-engine-arangosearch/):
* [Levenshtein distance](https://www.arangodb.com/docs/devel/aql/functions-arangosearch.html#levenshtein_match
)
* [NGram similarity](https://www.arangodb.com/docs/devel/aql/functions-arangosearch.html#ngram_match)

# Setup 

Before getting started with ArangoSearch we need to prepare our environment and create a temporary database on ArangoDB's managed Service Oasis.

In [None]:
%%capture
!git clone https://github.com/joerg84/ArangoDBUniversity.git
!rsync -av ArangoDBUniversity/ ./ --exclude=.git
!pip3 install pyarango
!pip3 install "python-arango>=5.0"

In [None]:
import json
import requests
import sys
import oasis
import time

from pyArango.connection import *
from arango import ArangoClient

Create the temporary database:

In [None]:
# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(tutorialName="FuzzyArangoSearch", credentialProvider="https://d383fa0b596a.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB")

# Connect to the temp database
# Please note that we use the python-arango driver as it has better support for ArangoSearch 
database = oasis.connect_python_arango(login)

In [None]:
print("https://"+login["hostname"]+":"+str(login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
print("Database: " + login["dbName"])

Feel free to use to above URL to checkout the WebUI!

##  IMDB Example Dataset

![imdb](https://github.com/joerg84/ArangoDBUniversity/blob/master/img/IMDB_graph.png?raw=1)

Last, but not least we will import the [IMBD Example Dataset](https://github.com/arangodb/example-datasets/tree/master/Graphs/IMDB) including information about various movies, actors, directors, ... as a graph. 
*Note the included arangorestore will only work on Linux or Windows systems, if you want to run this notebook on a different OS please consider using the appropriate arangorestore from the [Download area](https://www.arangodb.com/download-major/).

## Linux:

In [None]:
! ./tools/arangorestore -c none --server.endpoint http+ssl://{login["hostname"]}:{login["port"]} --server.username {login["username"]} --server.database {login["dbName"]} --server.password {login["password"]} --default-replication-factor 3  --input-directory "data/imdb"

# Create First View

As discussed above, an ArangoSearch view contains references to documents stored in different collections. 
This makes it possible to perform complex federated searches, even over a complete graph including vertex and edge collections.

In [None]:
# Create an ArangoSearch view.
database.create_arangosearch_view(
    name='v_imdb',
    properties={'cleanupIntervalStep': 0}
)

Let us check it is actually there:

In [None]:
print(database["v_imdb"])

Next, we will create a custom analyzer to fill this view: 

TODO Requirements
Note that, the analyzer must have the "position" and "frequency" features enabled.  

In [None]:
 # Retrieve list of analyzers.
database.analyzers()

# Delete test
database.delete_analyzer('test_analyzer', ignore_missing=True)

database.create_analyzer(
        name='test_analyzer',
        analyzer_type='text',
        properties={'locale': 'en.utf-8', 'case': 'lower', 'stopwords': [], 'accent': False, 'stemming': True, 'edgeNgram' : { 'min': 3, 'max': 8, 'preserveOriginal': True }},
        features=["frequency",'norm','position']
    )


#  locale: "en.utf-8",
# ........>   case: "lower",
# ........>   accent: false,
# ........>   stemming: false,
# ........>   stopwords: []
# ........> }, ["frequency","norm","position"])

# # Create an analyzer.
# database.create_analyzer(
#     name='bigram1',
#     analyzer_type='text',
#     properties={},
#     features=["frequency","norm","position"]
# )

#"edgeNgram" : { "min": 3, "max": 8, "preserveOriginal": True }
 
#  ("text_edge_ngrams", "text", {
# ........>   edgeNgram: { min: 3, max: 8, preserveOriginal: true },
# ........>   locale: "en.utf-8",
# ........>   case: "lower",
# ........>   accent: false,
# ........>   stemming: false,
# ........>   stopwords: [ "the" ]
# ........> }, ["frequency","norm","position"])

In [None]:
 link = { 
  "includeAllFields": True,
  "fields" : { "description" : { "analyzers" : [ "test_analyzer" ] } }
}


database.update_arangosearch_view(
    name='v_imdb',
    properties={'links': { 'imdb_vertices': link }}
)

As the indexing might take a few seconds, let us have a brief look at what is actually going on.

![ArangoSearch](https://github.com/joerg84/ArangoDBUniversity/blob/master/img/ArangoSearch_Arch.jpg?raw=1)

**NGram Match**

Ngram similarity a measure for the difference between two strings represented by counting how long the longest sequence of matching ngrams is, divided by target’s total ngram count. 

In [None]:
cursor = database.aql.execute(
"""
RETURN NGRAM_SIMILARITY(
"quick fox",
"quick foxx", 
2)"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)


With a ngram size of 2, the ngram similarity between both strings is 0.888. Feel free experiment with other combinations such as `NGRAM_SIMILARITY( "same string","same string", 2)` or vary the ngramSize.

Let us start by using the NGram match to find mispelled movie title.

In [None]:
cursor = database.aql.execute(
"""
FOR d IN v_imdb SEARCH NGRAM_MATCH(d.title, 'Str War', 0.7, 'test_analyzer')
SORT BM25(d) DESC
RETURN d.title"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

Here we use the NGRAM_MATCH Search function to search the description of the movies in our view to find movies with similar results. The .7 is the threshold amount, this is how much ‘fuzziness’ we are still considering a match.

The analyzer we used was configured with a min and max of 2, which means it looks at words 2 letters at a time. This is useful for determining the longest common sequence and context. The idea behind n-gram matching is searching for similar words, but not necessarily exact matches. One of the simplest ways of calculating similarity between two words is calculating the longest common sequence (LCS) of letters. The longer the LCS is the more similar the words are. However, this approach has one big disadvantage – absence of context. For example, words <connection> and <fonetica> have a long LCS (o-n-e-t-i) but very different meanings. To add some context, ngram sequences are used.

Each word is split into a series of letter groups and these groups are then matched. If we use the same words, but calculate similarity based on 3-grams, an ngram with max and min of 3, we will get a better similarity measure: con-onn-nne-nec-ect-cti-tio-ion vs. fon-one-net-eti-tic-ica gives shorter LCS ( zero matches). To get rid of length differences we normalize the LCS length by word length. We calculate these matches to get a rating with a value between 0 (no match at all) and 1(fully matched). 


While [NGRAM_SIMILARITY()](https://www.arangodb.com/docs/devel/aql/functions-string.html#ngram_similarity) only counts fully matching ngrams, [NGRAM_POSITIONAL_SIMILARITY()](https://www.arangodb.com/docs/devel/aql/functions-string.html#ngram_positional_similarity) also considers partially matching ones. Let us look at how that effects the returned scores:

In [None]:
cursor = database.aql.execute(
"""
RETURN
{"NGRAM_SIMILARITY" : NGRAM_SIMILARITY(
"quick fox jumps",
"quick foxx jups", 
2),
"NGRAM_POSITIONAL_SIMILARITY" : NGRAM_POSITIONAL_SIMILARITY(
"quick fox jumps",
"quick foxx jups", 
2)}"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

## Levenshtein MATCH

[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) is a another measure for the difference between two strings represented by the  minimum number of single-character transformations required to move from one string to the other. Let is consider a concrete example:

In [None]:
cursor = database.aql.execute(
"""
RETURN LEVENSHTEIN_DISTANCE(
"The quick brown fox jumps over the lazy dog", 
"The quick black dog jumps over the brown fox")"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)




Here we need a minimum of 13 transformations to move from one string to the other. 
Feel free to find a minimum sequence for this transformation or experiment with other combinations such as `LEVENSHTEIN_DISTANCE("a", "b")`.

In [None]:
# Execute the query
cursor = database.aql.execute(
  """
  FOR doc IN v_imdb
    SEARCH LEVENSHTEIN_MATCH(
      doc.description, 
      'galxy', 
      2
      )
    RETURN doc
  """
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

Note, that we do not require to specify the analyzer as the distance as the input does not need to be mmodified.

## Comparison 

When should we choose which?


# Further Links

* https://www.arangodb.com/docs/stable/arangosearch.html

* https://www.arangodb.com/arangodb-training-center/search/arangosearch/