![arangodb](https://github.com/joerg84/ArangoDBUniversity/blob/master/img/ArangoDB_logo.png?raw=1)

# Fuzzy Search 

<a href="https://colab.research.google.com/github/joerg84/ArangoDBUniversity/blob/master/FuzzySearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ArangoSearch provides information retrieval features, natively integrated into ArangoDB’s query language and with support for all data models. It is primarily a full-text search engine, a much more powerful alternative to the full-text index type.
Check this [ArangoSearch notebook](https://colab.research.google.com/github/joerg84/ArangoDBUniversity/blob/master/ArangoSearch.ipynb) for an introduction to ArangoSearch.

When dealing with real-world text retrieval, we often not only care about exact matches to our search phrase but need to consider for example typos or alternative spellings.
“Fuzzy search” is an umbrella term referring to a set of algorithms for such approximate matching. Usually such algorithms evaluate some similarity measure showing how close a search term is to the items in a dictionary. Then a search engine can make a decision on which results have to be shown first.

In this notebook we will apply at two different implementation of fuzzy search in [ArangoSearch](https://www.arangodb.com/why-arangodb/full-text-search-engine-arangosearch/):
* [Levenshtein distance](https://www.arangodb.com/docs/devel/aql/functions-arangosearch.html#levenshtein_match
)
* [NGram similarity](https://www.arangodb.com/docs/devel/aql/functions-arangosearch.html#ngram_match)

# Setup 

Before getting started with ArangoSearch we need to prepare our environment and create a temporary database on ArangoDB's managed Service Oasis.

In [0]:
%%capture
!git clone https://github.com/joerg84/ArangoDBUniversity.git
!rsync -av ArangoDBUniversity/ ./ --exclude=.git
!pip3 install pyarango
!pip3 install "python-arango>=5.0"

In [0]:
import json
import requests
import sys
import oasis
import time

from pyArango.connection import *
from arango import ArangoClient

Create the temporary database:

In [0]:
# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(tutorialName="ArangoSearchIMDBTutorial", credentialProvider="https://d383fa0b596a.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB")

# Connect to the temp database
# Please note that we use the python-arango driver as it has better support for ArangoSearch 
database = oasis.connect_python_arango(login)

In [0]:
print("https://"+login["hostname"]+":"+str(login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
print("Database: " + login["dbName"])

Feel free to use to above URL to checkout the WebUI!

##  IMDB Example Dataset

![imdb](https://github.com/joerg84/ArangoDBUniversity/blob/master/img/IMDB_graph.png?raw=1)

Last, but not least we will import the [IMBD Example Dataset](https://github.com/arangodb/example-datasets/tree/master/Graphs/IMDB) including information about various movies, actors, directors, ... as a graph. 
*Note the included arangorestore will only work on Linux or Windows systems, if you want to run this notebook on a different OS please consider using the appropriate arangorestore from the [Download area](https://www.arangodb.com/download-major/).

## Linux:

In [0]:
! ./tools/arangorestore -c none --server.endpoint http+ssl://{login["hostname"]}:{login["port"]} --server.username {login["username"]} --server.database {login["dbName"]} --server.password {login["password"]} --default-replication-factor 3  --input-directory "data/imdb"

# Create First View

As discussed above, an ArangoSearch view contains references to documents stored in different collections. 
This makes it possible to perform complex federated searches, even over a complete graph including vertex and edge collections.

In [0]:
# Create an ArangoSearch view.
database.create_arangosearch_view(
    name='v_imdb',
    properties={'cleanupIntervalStep': 0}
)

Let us check it is actually there:

In [0]:
print(database["v_imdb"])

As of now this view is empty, so we need to link it to a collection (i.e., imdb_vertices).

In [0]:
 link = { 
  "includeAllFields": True,
  "fields" : { "description" : { "analyzers" : [ "text_en" ] } }
}

database.update_arangosearch_view(
    name='v_imdb',
    properties={'links': { 'imdb_vertices': link }}
)

As the indexing might take a few seconds, let us have a brief look at what is actually going on.

![ArangoSearch](https://github.com/joerg84/ArangoDBUniversity/blob/master/img/ArangoSearch_Arch.jpg?raw=1)

By now our view should be ready, so let us issue the first query and look for short Drama Movies.

In [0]:
cursor = database.aql.execute(
  """
  FOR d IN v_imdb 
    SEARCH d.type == "Movie" 
    AND 
    d.genre == "Drama" 
    AND 
    d.runtime IN 10..50 
    RETURN d.title
  """
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

## Levenshtein MATCH

Proximity searching is a way to search for two or more words that occur within a certain number of words from each other.
In the next example, we are looking for the word sequence "in <any word> galaxy" in the description of a movie.
Feel free to try other values!

In [0]:
# Execute the query
cursor = database.aql.execute(
  """
  FOR d IN v_imdb 
  SEARCH PHRASE(
    d.description, 
    "in", 
    1, 
    "galaxy", 
    "text_en"
    ) 
    RETURN {
      title: d.title, 
      description: d.description
      }
  """
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

In [0]:
# Execute the query
cursor = database.aql.execute(
  """
  FOR doc IN v_imdb
    SEARCH NGRAM_MATCH(
      doc.description, 
      'galaxy', 
      'text_en'
      )
    RETURN doc
  """
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

## Ranking and Document Relevance

Great, now we can identify documents containing a specific phrase,
but especially with large document bases we need to be able to rank documents based on the their relevance.
ArangoSearch supports the following two schemes:

* [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25)

* [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

You can learn more about ranking in the [documentation](https://www.arangodb.com/docs/3.6/aql/functions-arangosearch.html#scoring-functions).

So let us find movies with the following key-words: “amazing, action, world, alien, sci-fi, science, documental, galaxy”

In [0]:
cursor = database.aql.execute(
  """
  FOR d IN v_imdb 
    SEARCH ANALYZER(
      d.description 
      IN TOKENS('amazing action world alien sci-fi science documental galaxy', 
      'text_en'), 
      'text_en'
      ) 
    SORT BM25(d) DESC 
    LIMIT 10 
    RETURN {
      "title": d.title, 
      "description" : d.description
      }
    """
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

Another crucial point of ArangoSearch is the ability to fine-tune document scores evaluated by relevance models at query time. That functionality is exposed in AQL via the BOOST function.
So let us tweak our previous query to prefer “galaxy” amongst the others keywords.

In [0]:
cursor = database.aql.execute(
"""
  FOR d IN v_imdb 
    SEARCH ANALYZER(
      d.description 
      IN TOKENS('amazing action world alien sci-fi science documental', 
      'text_en'
      ) 
    ||
    BOOST(
      d.description 
      IN TOKENS('galaxy', 
      'text_en'), 
      5), 
      'text_en'
      ) 
    SORT BM25(d) DESC 
    LIMIT 10 
    RETURN {
      "title": d.title, 
      "description" : d.description
      }
"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

## ArangoSearch Meets Graph

One of the coolest features of ArangoDB, being a multi-model database, is that we can combine different data-model and query capabilites.
So, for example, we can easily combine ArangoSearch with a Graph traversal. Recall that our imdb dataset is a graph with edges connecting 
the movies we have been looking at to their respective actors, genres, or directors. Let us explore this and look up the director for each each of the Sci-fi movies above.

In [0]:
cursor = database.aql.execute(
"""
FOR d IN v_imdb 
   SEARCH ANALYZER(
     d.description 
     IN TOKENS('amazing action world alien sci-fi science documental', 
     'text_en'
     ) 
     ||
   BOOST(
     d.description 
     IN TOKENS('galaxy', 'text_en'), 
     5), 
     'text_en'
     ) 
     
   SORT BM25(d) DESC 

   LIMIT 10 

     FOR vertex, edge, path 
      IN 1..1 INBOUND  
      d imdb_edges
      FILTER path.edges[0].$label == "DIRECTED"
      RETURN DISTINCT {
        "director" : vertex.name, 
        "movie" : d.title
        } 
"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

# Further Links

* https://www.arangodb.com/docs/stable/arangosearch.html

* https://www.arangodb.com/arangodb-training-center/search/arangosearch/