![arangodb](https://github.com/joerg84/ArangoDBUniversity/blob/master/img/ArangoDB_logo.png?raw=1)

# Fuzzy Search 

<a href="https://colab.research.google.com/github/joerg84/ArangoDBUniversity/blob/master/FuzzySearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ArangoSearch provides information retrieval features, natively integrated into ArangoDB’s query language and with support for all data models. It is primarily a full-text search engine, a much more powerful alternative to the full-text index type.
Check this [ArangoSearch notebook](https://colab.research.google.com/github/joerg84/ArangoDBUniversity/blob/master/ArangoSearch.ipynb) for an introduction to ArangoSearch.

When dealing with real-world text retrieval, we often not only care about exact matches to our search phrase but need to consider for example typos or alternative spellings.
“Fuzzy search” is an umbrella term referring to a set of algorithms for such approximate matching. Usually such algorithms evaluate some similarity measure showing how close a search term is to the items in a dictionary. Then a search engine can make a decision on which results have to be shown first.

In this notebook we will apply at two different implementation of fuzzy search in [ArangoSearch](https://www.arangodb.com/why-arangodb/full-text-search-engine-arangosearch/):
* [Levenshtein distance](https://www.arangodb.com/docs/devel/aql/functions-arangosearch.html#levenshtein_match
)
* [NGram similarity](https://www.arangodb.com/docs/devel/aql/functions-arangosearch.html#ngram_match)

# Setup 

Before getting started with ArangoSearch we need to prepare our environment and create a temporary database on ArangoDB's managed Service Oasis.

In [0]:
%%capture
!git clone https://github.com/joerg84/ArangoDBUniversity.git
!rsync -av ArangoDBUniversity/ ./ --exclude=.git
!pip3 install pyarango
!pip3 install "python-arango>=5.0"

In [0]:
import json
import requests
import sys
import oasis
import time

from pyArango.connection import *
from arango import ArangoClient

Create the temporary database:

In [38]:
# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(tutorialName="ArangoSearchIMDBTutorial", credentialProvider="https://d383fa0b596a.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB")

# Connect to the temp database
# Please note that we use the python-arango driver as it has better support for ArangoSearch 
database = oasis.connect_python_arango(login)

Requesting new temp credentials.
Temp database ready to use.


In [39]:
print("https://"+login["hostname"]+":"+str(login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
print("Database: " + login["dbName"])

https://d383fa0b596a.arangodb.cloud:8529
Username: TUT5n4jeb8adpj8hho4voj5kk
Password: TUT1q5ggtvxr2zxl78hwq74b
Database: TUTmpkhrsgnkn9vii1nvv1o4


Feel free to use to above URL to checkout the WebUI!

##  IMDB Example Dataset

![imdb](https://github.com/joerg84/ArangoDBUniversity/blob/master/img/IMDB_graph.png?raw=1)

Last, but not least we will import the [IMBD Example Dataset](https://github.com/arangodb/example-datasets/tree/master/Graphs/IMDB) including information about various movies, actors, directors, ... as a graph. 
*Note the included arangorestore will only work on Linux or Windows systems, if you want to run this notebook on a different OS please consider using the appropriate arangorestore from the [Download area](https://www.arangodb.com/download-major/).

## Linux:

In [40]:
! ./tools/arangorestore -c none --server.endpoint http+ssl://{login["hostname"]}:{login["port"]} --server.username {login["username"]} --server.database {login["dbName"]} --server.password {login["password"]} --default-replication-factor 3  --input-directory "data/imdb"

[0m2020-06-15T22:10:30Z [517] INFO [05c30] {restore} Connected to ArangoDB 'http+ssl://d383fa0b596a.arangodb.cloud:8529'
[0m[0m2020-06-15T22:10:31Z [517] INFO [3b6a4] {restore} no properties object
[0m[0m2020-06-15T22:10:31Z [517] INFO [9b414] {restore} # Re-creating document collection 'imdb_vertices'...
[0m[0m2020-06-15T22:10:31Z [517] INFO [9b414] {restore} # Re-creating edge collection 'imdb_edges'...
[0m[0m2020-06-15T22:10:32Z [517] INFO [6d69f] {restore} # Dispatched 2 job(s), using 2 worker(s)
[0m[0m2020-06-15T22:10:32Z [517] INFO [d88c6] {restore} # Creating indexes for collection 'imdb_vertices'...
[0m[0m2020-06-15T22:10:32Z [517] INFO [94913] {restore} # Loading data into edge collection 'imdb_edges', data size: 48957903 byte(s)
[0m[0m2020-06-15T22:10:32Z [517] INFO [94913] {restore} # Loading data into document collection 'imdb_vertices', data size: 22665786 byte(s)
[0m[0m2020-06-15T22:10:37Z [517] INFO [75e65] {restore} # Current restore progress: restored 

# Create First View

As discussed above, an ArangoSearch view contains references to documents stored in different collections. 
This makes it possible to perform complex federated searches, even over a complete graph including vertex and edge collections.

In [41]:
# Create an ArangoSearch view.
database.create_arangosearch_view(
    name='v_imdb',
    properties={'cleanupIntervalStep': 0}
)

{'cleanup_interval_step': 0,
 'commit_interval_msec': 1000,
 'consolidation_interval_msec': 10000,
 'consolidation_policy': {'min_score': 0,
  'segments_bytes_floor': 2097152,
  'segments_bytes_max': 5368709120,
  'segments_max': 10,
  'segments_min': 1,
  'type': 'tier'},
 'global_id': 'c16022453/',
 'id': '16022453',
 'links': [],
 'name': 'v_imdb',
 'primary_sort': [],
 'type': 'arangosearch',
 'writebuffer_active': 0,
 'writebuffer_idle': 64,
 'writebuffer_max_size': 33554432}

Let us check it is actually there:

In [42]:
print(database["v_imdb"])

<StandardCollection v_imdb>


As of now this view is empty, so we need to link it to a collection (i.e., imdb_vertices).

In [43]:
 link = { 
  "includeAllFields": True,
  "fields" : { "description" : { "analyzers" : [ "text_en" ] } }
}

database.update_arangosearch_view(
    name='v_imdb',
    properties={'links': { 'imdb_vertices': link }}
)

{'cleanup_interval_step': 0,
 'commit_interval_msec': 1000,
 'consolidation_interval_msec': 10000,
 'consolidation_policy': {'min_score': 0,
  'segments_bytes_floor': 2097152,
  'segments_bytes_max': 5368709120,
  'segments_max': 10,
  'segments_min': 1,
  'type': 'tier'},
 'global_id': 'c16022453/',
 'id': '16022453',
 'links': [{}],
 'name': 'v_imdb',
 'primary_sort': [],
 'type': 'arangosearch',
 'writebuffer_active': 0,
 'writebuffer_idle': 64,
 'writebuffer_max_size': 33554432}

As the indexing might take a few seconds, let us have a brief look at what is actually going on.

![ArangoSearch](https://github.com/joerg84/ArangoDBUniversity/blob/master/img/ArangoSearch_Arch.jpg?raw=1)

By now our view should be ready, so let us issue the first query and look for short Drama Movies.

In [44]:
cursor = database.aql.execute(
  """
  FOR d IN v_imdb 
    SEARCH d.type == "Movie" 
    AND 
    d.genre == "Drama" 
    AND 
    d.runtime IN 10..50 
    RETURN d.title
  """
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

Wellcome
Rosemarie Nitribitt - Tod einer Edelhure
Wiatr
Primavera
Lücken im Gedankenstrom
Dr. Jekyll and Mr. Hyde
Breaking Glass
Pulsar
Frühlings Erwachen - Eine Kindertragödie
Glastage
Sunday in August
Land gewinnen
À San Remo
Carne
Dr. Jekyll and Mr. Hyde
Room 10
Zwischen Flieder wandern und singen
Alias
Antoine et Colette
Edison's Frankenstein
Silvester Home Run
Bis zur Unendlichkeit
Space Riders
True
Kurz:Ivan
Dreamcatcher
The Kolaborator
Rounds
Melissa
Hotel Chevalier
Another Lady Innocent


## Levenshtein MATCH

Proximity searching is a way to search for two or more words that occur within a certain number of words from each other.
In the next example, we are looking for the word sequence "in <any word> galaxy" in the description of a movie.
Feel free to try other values!

In [45]:
# Execute the query
cursor = database.aql.execute(
  """
  FOR d IN v_imdb 
  SEARCH PHRASE(
    d.description, 
    "in", 
    1, 
    "galaxy", 
    "text_en"
    ) 
    RETURN {
      title: d.title, 
      description: d.description
      }
  """
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

{'title': 'Lost in Space', 'description': 'The prospects for continuing life on Earth in the year 2058 are grim. So the Robinsons are launched into space to colonize Alpha Prime, the only other inhabitable planet in the galaxy. But when a stowaway sabotages the mission, the Robinsons find themselves hurtling through uncharted space.'}
{'title': 'The Ice Pirates', 'description': 'The time is the distant future, where by far the most precious commodity in the galaxy is water. The last surviving water planet was somehow removed to the unreachable centre of the galaxy at the end of the galactic trade wars. The galaxy is ruled by an evil emperor (John Carradine) presiding over a trade oligarchy that controls all mining and sale of ice from asteroids and comets.'}
{'title': 'Alien Tracker', 'description': "In a galaxy far away, alien criminals organize a spectacular prison break. Pursued by their home planet's authorities, their leader Zin decides there's only one place to go: through a worm

In [46]:
# Execute the query
cursor = database.aql.execute(
  """
  FOR doc IN v_imdb
    SEARCH NGRAM_MATCH(
      doc.description, 
      'galaxy', 
      'text_en'
      )
    RETURN doc
  """
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

{'_key': '601', '_id': 'imdb_vertices/601', '_rev': '_apmGtwm-Am', 'description': '(22 Years Before Episode IV) Ten years after the events of the Battle of Naboo, not only has the galaxy undergone significant change, but so have Obi-Wan Kenobi, Padmé Amidala, and Anakin Skywalker as they are thrown together again for the first time since the Trade Federation invasion of Naboo. Anakin has grown into the accomplished Jedi apprentice of Obi-Wan, who himself has transitioned from student to teacher. The two Jedi are assigned to protect Padmé whose life is threatened by a faction o', 'genre': 'Action', 'homepage': '', 'id': '1894', 'imageUrl': 'http://cf1.imgobject.com/posters/206/4bccf5f0017a3c0f2c000206/star-wars-episode-ii-attack-of-the-clones-mid.jpg', 'imdbId': 'tt0121765', 'label': 'Star Wars: Episode II - Attack of the Clones', 'language': 'en', 'lastModified': '1299979002000', 'releaseDate': '1021500000000', 'released': '2000-2010', 'runtime': 142, 'studio': 'Lucasfilm', 'tagline': 

## Ranking and Document Relevance

Great, now we can identify documents containing a specific phrase,
but especially with large document bases we need to be able to rank documents based on the their relevance.
ArangoSearch supports the following two schemes:

* [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25)

* [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

You can learn more about ranking in the [documentation](https://www.arangodb.com/docs/3.6/aql/functions-arangosearch.html#scoring-functions).

So let us find movies with the following key-words: “amazing, action, world, alien, sci-fi, science, documental, galaxy”

In [47]:
cursor = database.aql.execute(
  """
  FOR d IN v_imdb 
    SEARCH ANALYZER(
      d.description 
      IN TOKENS('amazing action world alien sci-fi science documental galaxy', 
      'text_en'), 
      'text_en'
      ) 
    SORT BM25(d) DESC 
    LIMIT 10 
    RETURN {
      "title": d.title, 
      "description" : d.description
      }
    """
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

{'title': 'AVPR: Aliens vs. Predator - Requiem', 'description': "Prepare for more mayhem as warring aliens and predators return for Round 2 of their no-holds-barred slugfest. This time, the intergalactic creatures do battle in a small American town, throwing local residents into harm's way. To save the planet, the humans must oust both types of unwelcome guests. This sci-fi sequel features tons of spectacular action sequences full of nifty new gadgets and gooey monster gore."}
{'title': 'Moon 44', 'description': 'In 2038, at a remote outpost on Moon 44, Galactic Mining Corp. agent Felix Stone (Michael Paré) embarks on a dangerous mission to stop the hijacking of precious natural resources needed on Earth. To do so, he must battle a familiar foe and an alien enemy. Malcolm McDowell, Lisa Eichhorn and Dean Devlin star in this sci-fi thriller from action director Roland Emmerich (Independence Day).'}
{'title': 'Dark Star', 'description': 'A low-budget, sci-fi satire that focuses on a grou

Another crucial point of ArangoSearch is the ability to fine-tune document scores evaluated by relevance models at query time. That functionality is exposed in AQL via the BOOST function.
So let us tweak our previous query to prefer “galaxy” amongst the others keywords.

In [48]:
cursor = database.aql.execute(
"""
  FOR d IN v_imdb 
    SEARCH ANALYZER(
      d.description 
      IN TOKENS('amazing action world alien sci-fi science documental', 
      'text_en'
      ) 
    ||
    BOOST(
      d.description 
      IN TOKENS('galaxy', 
      'text_en'), 
      5), 
      'text_en'
      ) 
    SORT BM25(d) DESC 
    LIMIT 10 
    RETURN {
      "title": d.title, 
      "description" : d.description
      }
"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

{'title': 'Star Trek Collection', 'description': 'Star Trek a futuristic science fiction franchise. With a space ship that can travel faster than the speed of light, galaxies to explore, and cool skin tight suits to beam up in, Star Trek has been, since the 1960’s, a series of Films, various TV series, video games, the host of many spin-offs, and the possibly the first to use the cell phone. Star Trek fans are some of the most devout.'}
{'title': 'Alien Tracker', 'description': "In a galaxy far away, alien criminals organize a spectacular prison break. Pursued by their home planet's authorities, their leader Zin decides there's only one place to go: through a wormhole that allows instantaneous travel to Earth. Cole is the Alien Tracker who's in hot pursuit of the escaped convicts."}
{'title': 'Stitch! The Movie', 'description': 'The continuing adventures of Lilo, a little Hawaiian girl, and Stitch, the galaxy\'s most wanted extraterrestrial. Stitch, Pleakley, and Dr. Jumba are all part

## ArangoSearch Meets Graph

One of the coolest features of ArangoDB, being a multi-model database, is that we can combine different data-model and query capabilites.
So, for example, we can easily combine ArangoSearch with a Graph traversal. Recall that our imdb dataset is a graph with edges connecting 
the movies we have been looking at to their respective actors, genres, or directors. Let us explore this and look up the director for each each of the Sci-fi movies above.

In [49]:
cursor = database.aql.execute(
"""
FOR d IN v_imdb 
   SEARCH ANALYZER(
     d.description 
     IN TOKENS('amazing action world alien sci-fi science documental', 
     'text_en'
     ) 
     ||
   BOOST(
     d.description 
     IN TOKENS('galaxy', 'text_en'), 
     5), 
     'text_en'
     ) 
     
   SORT BM25(d) DESC 

   LIMIT 10 

     FOR vertex, edge, path 
      IN 1..1 INBOUND  
      d imdb_edges
      FILTER path.edges[0].$label == "DIRECTED"
      RETURN DISTINCT {
        "director" : vertex.name, 
        "movie" : d.title
        } 
"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

{'director': 'Garth Jennings', 'movie': "The Hitchhiker's Guide to the Galaxy"}
{'director': 'Robert C. Cooper', 'movie': 'Stargate: The Ark of Truth'}
{'director': 'Stewart Raffill', 'movie': 'The Ice Pirates'}
{'director': 'George Lucas', 'movie': 'Star Wars: Episode III: Revenge of the Sith'}
{'director': 'George Lucas', 'movie': 'Star Wars: Episode II - Attack of the Clones'}
{'director': 'J.J. Abrams', 'movie': 'Star Trek'}


# Further Links

* https://www.arangodb.com/docs/stable/arangosearch.html

* https://www.arangodb.com/arangodb-training-center/search/arangosearch/