![arangodb](https://github.com/joerg84/ArangoDBUniversity/raw/master/img/ArangoDB_logo.png)

# ArangoSearch

<a href="https://colab.research.google.com/github/joerg84/ArangoDBUniversity/blob/master/ArangoSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ArangoSearch provides information retrieval features, natively integrated into ArangoDB’s query language and with support for all data models. It is primarily a full-text search engine, a much more powerful alternative to the full-text index type.

ArangoSearch introduces the concept of Views which can be seen as virtual collections. Each View represents an inverted index to provide fast full-text searching over one or multiple linked collections and holds the configuration for the search capabilities, such as the attributes to index. It can cover multiple or even all attributes of the documents in the linked collections. Search results can be sorted by their similarity ranking to return the best matches first using popular scoring algorithms.

Configurable Analyzers are available for text processing, such as for tokenization, language-specific word stemming, case conversion, removal of diacritical marks (accents) from characters and more. Analyzers can be used standalone or in combination with Views for sophisticated searching.

# Setup 

Before getting started with ArangoSearch we need to prepare our environment and create a temporary database on ArangoDB's managed Service Oasis.

In [0]:
%%capture
!git clone https://github.com/joerg84/ArangoDBUniversity.git
!rsync -av ArangoDBUniversity/ ./ --exclude=.git
!pip3 install pyarango
!pip3 install "python-arango>=5.0"

In [0]:
import json
import requests
import sys
import oasis
import time

from pyArango.connection import *
from arango import ArangoClient

Create the temporary database:

In [3]:
# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials("ArangoSearchIMDBTutorial")

# Connect to the temp database
# Please note that we use the python-arango driver as it has better support for ArangoSearch 
database = oasis.connect_python_arango(login)

Requesting new temp credentials.
Temp database ready to use.


In [0]:
print("https://"+login["hostname"]+":"+str(login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
print("Database: " + login["dbName"])

Feel free to use to above URL to checkout the WebUI!

##  IMDB Example Dataset

![imdb](https://github.com/joerg84/ArangoDBUniversity/raw/master/img/IMDB_graph.png)

Last, but not least we will import the [IMBD Example Dataset](https://github.com/arangodb/example-datasets/tree/master/Graphs/IMDB) including information about various movies, actors, directors, ... as a graph. 
*Note the included arangorestore will only work on Linux or Windows systems, if you want to run this notebook on a different OS please consider using the appropriate arangorestore from the [Download area](https://www.arangodb.com/download-major/).

## Linux:

In [0]:
!./tools/arangorestore -c none --server.endpoint http+ssl://{login["hostname"]}:{login["port"]} --server.username {login["username"]} --server.database {login["dbName"]} --server.password {login["password"]} --default-replication-factor 3  --input-directory "data/imdb" 

# Create First View

As discussed above, an ArangoSearch view contains references to documents stored in different collections. 
This makes it possible to perform complex federated searches, even over a complete graph including vertex and edge collections.

In [6]:
# Create an ArangoSearch view.
database.create_arangosearch_view(
    name='v_imdb',
    properties={'cleanupIntervalStep': 0}
)

{'cleanup_interval_step': 0,
 'commit_interval_msec': 1000,
 'consolidation_interval_msec': 10000,
 'consolidation_policy': {'min_score': 0,
  'segments_bytes_floor': 2097152,
  'segments_bytes_max': 5368709120,
  'segments_max': 10,
  'segments_min': 1,
  'type': 'tier'},
 'global_id': 'c1086005617/',
 'id': '1086005617',
 'links': [],
 'name': 'v_imdb',
 'primary_sort': [],
 'type': 'arangosearch',
 'writebuffer_active': 0,
 'writebuffer_idle': 64,
 'writebuffer_max_size': 33554432}

Let us check it is actually there:

In [7]:
print(database["v_imdb"])

<StandardCollection v_imdb>


As of now this view is empty, so we need to link it to a collection (i.e., imdb_vertices).

In [8]:
 link = { 
  "includeAllFields": True,
  "fields" : { "description" : { "analyzers" : [ "text_en" ] } }
}

database.update_arangosearch_view(
    name='v_imdb',
    properties={'links': { 'imdb_vertices': link }}
)

{'cleanup_interval_step': 0,
 'commit_interval_msec': 1000,
 'consolidation_interval_msec': 10000,
 'consolidation_policy': {'min_score': 0,
  'segments_bytes_floor': 2097152,
  'segments_bytes_max': 5368709120,
  'segments_max': 10,
  'segments_min': 1,
  'type': 'tier'},
 'global_id': 'c1086005617/',
 'id': '1086005617',
 'links': [{}],
 'name': 'v_imdb',
 'primary_sort': [],
 'type': 'arangosearch',
 'writebuffer_active': 0,
 'writebuffer_idle': 64,
 'writebuffer_max_size': 33554432}

As the indexing might take a few seconds, let us have a brief look at what is actually going on.

![ArangoSearch](https://github.com/joerg84/ArangoDBUniversity/blob/master/img/ArangoSearch_Arch.jpg?raw=1)

In order to fill the View using the specified analyzer, (`"analyzers" : [ "text_en" ]`) in our case, analyzers parse input values and transform them into sets of sub-values. For example, by breaking up text into words with language specific tokenization and stemming.
Let us check how the `text_en` Analyzer tranforms an input into tokens:

In [9]:
cursor = database.aql.execute(
  'RETURN TOKENS("I like ArangoDB because it rocks!", "text_en")'
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

['i', 'like', 'arangodb', 'becaus', 'it', 'rock']


By now our view should be ready, so let us issue the first query and look for short Drama Movies.

In [10]:
cursor = database.aql.execute(
  'FOR d IN v_imdb SEARCH d.type == "Movie" AND d.genre == "Drama" AND d.runtime IN 10..50 RETURN d.title'
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

Wellcome
Rosemarie Nitribitt - Tod einer Edelhure
Wiatr
Primavera
Lücken im Gedankenstrom
Dr. Jekyll and Mr. Hyde
Breaking Glass
Pulsar
Frühlings Erwachen - Eine Kindertragödie
Glastage
Sunday in August
Land gewinnen
À San Remo
Carne
Dr. Jekyll and Mr. Hyde
Room 10
Zwischen Flieder wandern und singen
Alias
Antoine et Colette
Edison's Frankenstein
Silvester Home Run
Bis zur Unendlichkeit
Space Riders
True
Kurz:Ivan
Dreamcatcher
The Kolaborator
Rounds
Melissa
Hotel Chevalier
Another Lady Innocent
The Wiggles: Wiggle Bay
Good Night
Crin blanc: Le cheval sauvage
VeggieTales: An Easter Carol


At this point you might wonder whether you could have achieved the same results with a simple AQL Filter

In [11]:
cursor = database.aql.execute(
"""FOR d IN v_imdb 
SEARCH PHRASE(d.description, "Star wars", "text_en") 
RETURN {"title" :d.title, "description": d.description}"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

{'title': 'Fanboys', 'description': '"Star Wars" fans travel to Skywalker Ranch to steal an early copy of "Episode I: The Phantom Menace".'}
{'title': 'Star Wars: The Clone Wars', 'description': "Set between Episode II and III the Clone Wars is the first computer animated Star Wars film. Anakin and Obi Wan must find out who kidnapped Jabba the Hutts son and return him safely. The Seperatists will try anything to stop them and ruin any chance of a diplomatic agreement between the Hutt's and the Republic."}
{'title': 'Gymkata', 'description': 'Johnathan Cabot is a champion gymnast. In the tiny, yet savage, country of Parmistan, there is a perfect spot for a "star wars" site. For the US to get this site, they must compete in the brutal "Game". The government calls on Cabot, the son of a former operative, to win the game. Cabot must combine his gymnastics skills of the west with fighting secrets of the east and form GYMKATA!'}
{'title': 'Family Guy: Blue Harvest', 'description': 'With the 

The difference between both queries is that the `SEARCH` query is using the previosuly created view whereas the 'FILTER' query has to perform post-processing on the entire result set.
Furthermore, `SEARCH` queries allow us to do other cool things, which we will explore next.

In the next example we retrieve all movies mentioning “Star wars” in the description.

In [12]:
cursor = database.aql.execute(
"""FOR d IN v_imdb 
SEARCH PHRASE(d.description, "Star wars", "text_en") 
RETURN {title:d.title, description: d.description}"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

{'title': 'Fanboys', 'description': '"Star Wars" fans travel to Skywalker Ranch to steal an early copy of "Episode I: The Phantom Menace".'}
{'title': 'Star Wars: The Clone Wars', 'description': "Set between Episode II and III the Clone Wars is the first computer animated Star Wars film. Anakin and Obi Wan must find out who kidnapped Jabba the Hutts son and return him safely. The Seperatists will try anything to stop them and ruin any chance of a diplomatic agreement between the Hutt's and the Republic."}
{'title': 'Gymkata', 'description': 'Johnathan Cabot is a champion gymnast. In the tiny, yet savage, country of Parmistan, there is a perfect spot for a "star wars" site. For the US to get this site, they must compete in the brutal "Game". The government calls on Cabot, the son of a former operative, to win the game. Cabot must combine his gymnastics skills of the west with fighting secrets of the east and form GYMKATA!'}
{'title': 'Family Guy: Blue Harvest', 'description': 'With the 

## Proximity Search

Proximity searching is a way to search for two or more words that occur within a certain number of words from each other.
In the next example, we are looking for the word sequence "in <any word> galaxy" in the description of a movie.
Feel free to try other values!

In [13]:
# Execute the query
cursor = database.aql.execute(
  'FOR d IN v_imdb SEARCH PHRASE(d.description, "in", 1, "galaxy", "text_en") RETURN {title:d.title, description: d.description}'
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

{'title': 'Lost in Space', 'description': 'The prospects for continuing life on Earth in the year 2058 are grim. So the Robinsons are launched into space to colonize Alpha Prime, the only other inhabitable planet in the galaxy. But when a stowaway sabotages the mission, the Robinsons find themselves hurtling through uncharted space.'}
{'title': 'The Ice Pirates', 'description': 'The time is the distant future, where by far the most precious commodity in the galaxy is water. The last surviving water planet was somehow removed to the unreachable centre of the galaxy at the end of the galactic trade wars. The galaxy is ruled by an evil emperor (John Carradine) presiding over a trade oligarchy that controls all mining and sale of ice from asteroids and comets.'}
{'title': 'Alien Tracker', 'description': "In a galaxy far away, alien criminals organize a spectacular prison break. Pursued by their home planet's authorities, their leader Zin decides there's only one place to go: through a worm

## Ranking and Document Relevance

Great, now we can identify documents containing a specific phrase,
but especially with large document bases we need to be able to rank documents based on the their relevance.
ArangoSearch supports the following two schemes:

* [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25)

* [TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

You can learn more about ranking in the [documentation](https://www.arangodb.com/docs/3.6/aql/functions-arangosearch.html#scoring-functions).

So let us find movies with the following key-words: “amazing, action, world, alien, sci-fi, science, documental, galaxy”

In [14]:
cursor = database.aql.execute(
  """FOR d IN v_imdb 
  SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental galaxy', 'text_en'), 'text_en') 
  SORT BM25(d) DESC 
  LIMIT 10 
  RETURN {"title": d.title, "description" : d.description}"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

{'title': 'AVPR: Aliens vs. Predator - Requiem', 'description': "Prepare for more mayhem as warring aliens and predators return for Round 2 of their no-holds-barred slugfest. This time, the intergalactic creatures do battle in a small American town, throwing local residents into harm's way. To save the planet, the humans must oust both types of unwelcome guests. This sci-fi sequel features tons of spectacular action sequences full of nifty new gadgets and gooey monster gore."}
{'title': 'Moon 44', 'description': 'In 2038, at a remote outpost on Moon 44, Galactic Mining Corp. agent Felix Stone (Michael Paré) embarks on a dangerous mission to stop the hijacking of precious natural resources needed on Earth. To do so, he must battle a familiar foe and an alien enemy. Malcolm McDowell, Lisa Eichhorn and Dean Devlin star in this sci-fi thriller from action director Roland Emmerich (Independence Day).'}
{'title': 'Dark Star', 'description': 'A low-budget, sci-fi satire that focuses on a grou

Another crucial point of ArangoSearch is the ability to fine-tune document scores evaluated by relevance models at query time. That functionality is exposed in AQL via the BOOST function.
So let us tweak our previous query to prefer “galaxy” amongst the others keywords.

In [15]:
cursor = database.aql.execute(
"""FOR d IN v_imdb 
   SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental', 'text_en') ||
   BOOST(d.description IN TOKENS('galaxy', 'text_en'), 5), 'text_en') 
   SORT BM25(d) DESC 
   LIMIT 10 
   RETURN {"title": d.title, "description" : d.description}"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

{'title': 'Star Trek Collection', 'description': 'Star Trek a futuristic science fiction franchise. With a space ship that can travel faster than the speed of light, galaxies to explore, and cool skin tight suits to beam up in, Star Trek has been, since the 1960’s, a series of Films, various TV series, video games, the host of many spin-offs, and the possibly the first to use the cell phone. Star Trek fans are some of the most devout.'}
{'title': 'Alien Tracker', 'description': "In a galaxy far away, alien criminals organize a spectacular prison break. Pursued by their home planet's authorities, their leader Zin decides there's only one place to go: through a wormhole that allows instantaneous travel to Earth. Cole is the Alien Tracker who's in hot pursuit of the escaped convicts."}
{'title': 'Stitch! The Movie', 'description': 'The continuing adventures of Lilo, a little Hawaiian girl, and Stitch, the galaxy\'s most wanted extraterrestrial. Stitch, Pleakley, and Dr. Jumba are all part

## ArangoSearch Meets Graph

One of the coolest features of ArangoDB, being a multi-model database, is that we can combine different data-model and query capabilites.
So, for example, we can easily combine ArangoSearch with a Graph traversal. Recall that our imdb dataset is a graph with edges connecting 
the movies we have been looking at to their respective actors, genres, or directors. Let us explore this and look up the director for each each of the Sci-fi movies above.

In [16]:
cursor = database.aql.execute(
"""
FOR d IN v_imdb 
   SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental', 'text_en') ||
    BOOST(d.description IN TOKENS('galaxy', 'text_en'), 5), 'text_en') 
   SORT BM25(d) DESC 
   LIMIT 10 
     FOR vertex, edge, path IN 1..1 INBOUND d imdb_edges
     FILTER path.edges[0].$label == "DIRECTED"
     RETURN DISTINCT {"director" : vertex.name, "movie" : d.title} 
"""
)
# Iterate through the result cursor
for doc in cursor:
  print(doc)

{'director': 'Garth Jennings', 'movie': "The Hitchhiker's Guide to the Galaxy"}
{'director': 'Robert C. Cooper', 'movie': 'Stargate: The Ark of Truth'}
{'director': 'Stewart Raffill', 'movie': 'The Ice Pirates'}
{'director': 'George Lucas', 'movie': 'Star Wars: Episode III: Revenge of the Sith'}
{'director': 'George Lucas', 'movie': 'Star Wars: Episode II - Attack of the Clones'}
{'director': 'J.J. Abrams', 'movie': 'Star Trek'}


# Further Links

* https://www.arangodb.com/docs/stable/arangosearch.html

* https://www.arangodb.com/arangodb-training-center/search/arangosearch/