# This is review on the MongoDB Atlas Search

* What is "MongoDB Atlas Search": Is a way to search a text **quyickly**. The seach is relying on pre-existing  "MongoDB Atlas Search Index" (pre-built index that apply MongoDB to retrieve the required results more quickly).

##### After this tutotrial you will be able:
    1. Understanging what is seach index.
    2. Know what is "MongoDB Atlas Search" and his capabilities.
    3. Anayze the effectiveness of a query.
    4. Easily connect to MongoDB and be able to run all wanted queries(without the limitations of driver).

* This notebooks contains 3 parts:
    1. Look at example for the efficiency of indexing.
    2. Look at the capabilities of it, when indexing a textual field
    3. How to connect to MongoDB through HTTP

\*In order to run this notebook interactively, please go to "part 3" first, to set up the connection.

In [26]:
import numpy as np
import pandas as pd
from helper import *
from queries import *
from time import time
from pprint import pprint
from config_file import *
from pymongo import MongoClient
from create_db import  db_creation
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet as wn
from indices.index1 import create_index1
from indices.index2 import create_index2
from indices.index3 import create_index3

client = MongoClient(CONN_STRING)
mydb = client[DB_NAME]
collection = mydb[COLLECTION_NAME]

ps = PorterStemmer()

# Part 1 - Efficiency of indexing.

We will be using the sample dataset available with MongoDB Atlas clusters.

* Each database contains collections and each collection contains documnts(the actual records).
* Every MongoDB collection has a default index: "_id". This is created during the creation of the collection, every documnet created with unique "_id" and this index can not be deleted.

In [5]:
# connecting to the collection
client = MongoClient(CONN_STRING)
mydb = client["sample_restaurants"]
collection = mydb["restaurants"]

##### Existing index information

In [1224]:
# Example of document in the database
collection.find_one()

{'_id': ObjectId('5eb3d668b31de5d588f4292a'),
 'address': {'building': '2780',
  'coord': [-73.98241999999999, 40.579505],
  'street': 'Stillwell Avenue',
  'zipcode': '11224'},
 'borough': 'Brooklyn',
 'cuisine': 'American',
 'grades': [{'date': datetime.datetime(2014, 6, 10, 0, 0),
   'grade': 'A',
   'score': 5},
  {'date': datetime.datetime(2013, 6, 5, 0, 0), 'grade': 'A', 'score': 7},
  {'date': datetime.datetime(2012, 4, 13, 0, 0), 'grade': 'A', 'score': 12},
  {'date': datetime.datetime(2011, 10, 12, 0, 0), 'grade': 'A', 'score': 12}],
 'name': 'Riviera Caterer',
 'restaurant_id': '40356018'}

In [1226]:
# The default index
collection.index_information()

{'_id_': {'v': 2, 'key': [('_id', 1)]}}

##### Let's check how many records there are in total and how many for a specific query which: cuisine=American

In [1204]:
total_records = collection.count_documents({})
specific_records = collection.count_documents({'cuisine':'American'})
print(f'There are total of {total_records} documents and for cuisine=American thare are: {specific_records} documents.')

There are total of 25359 documents and for cuisine=American thare are: 6183 documents


##### Now that we know the amount of documents in the Lets dive deeper to understand the power of indexing

When we run a query in MongoDB, we can actually determine a lot of information about the query using the "explain()" function.
It returns a document that contains information about the query plans and the execution statistics.

Here, focus on the executionStats key which contains the execution statistics. Below are the important keys to focus on:[1]

* *explain.executionStats.nReturned* - returns the number of documents that match the query
* *explain.executionStats.executionTimeMillis* - returns the total time in milliseconds required for query plan selection and query execution
* *explain.executionStats.totalKeysExamined* - returns the number of index entries scanned
* *eplain.executionStats.totalDocsExamined* - returns the number of documents examined during query execution. These are not documents that are returned
* *executionStats.executionStages.stage* - COLLSCAN/IXSCAN


In [1216]:
pprint(collection.find({'cuisine':'American'}).explain()['executionStats'])

{'allPlansExecution': [],
 'executionStages': {'advanced': 6183,
                     'direction': 'forward',
                     'docsExamined': 25359,
                     'executionTimeMillisEstimate': 1,
                     'filter': {'cuisine': {'$eq': 'American'}},
                     'isEOF': 1,
                     'nReturned': 6183,
                     'needTime': 19177,
                     'needYield': 0,
                     'restoreState': 25,
                     'saveState': 25,
                     'stage': 'COLLSCAN',
                     'works': 25361},
 'executionSuccess': True,
 'executionTimeMillis': 12,
 'nReturned': 6183,
 'totalDocsExamined': 25359,
 'totalKeysExamined': 0}


we can notice that when we want to find all "cuisine"="American" in couple of things:
1. we are going over all the documents in the collection!("totalDocsExamined" = 25359).
2. The retuned documents are as we expected('nReturned': 6183).
3. 'stage': 'COLLSCAN' -  It means that the default "_id" index was used to scan all the collection.
4. The execution time is 12 miliseconds.

##### Now lets create an index and examine the same results again

In [1222]:
# Create new index
collection.create_index('cuisine')

print(f"There are now 2 indices: {collection.index_information()}")

There are now 2 indices: {'_id_': {'v': 2, 'key': [('_id', 1)]}, 'cuisine_1': {'v': 2, 'key': [('cuisine', 1)]}}


In [1223]:
# Examine the results
pprint(collection.find({'cuisine':'American'}).explain()['executionStats'])

{'allPlansExecution': [],
 'executionStages': {'advanced': 6183,
                     'alreadyHasObj': 0,
                     'docsExamined': 6183,
                     'executionTimeMillisEstimate': 1,
                     'inputStage': {'advanced': 6183,
                                    'direction': 'forward',
                                    'dupsDropped': 0,
                                    'dupsTested': 0,
                                    'executionTimeMillisEstimate': 0,
                                    'indexBounds': {'cuisine': ['["American", '
                                                                '"American"]']},
                                    'indexName': 'cuisine_1',
                                    'indexVersion': 2,
                                    'isEOF': 1,
                                    'isMultiKey': False,
                                    'isPartial': False,
                                    'isSparse': False,
           

Again:

1. we are going over **Only** on the relevant documents in the collection!("totalDocsExamined" = 6183).
2. The retuned documents are as we expected('nReturned': 6183).
3. 'stage': 'IXSCAN' -  It means that MongoDB has scanned the index key we generated.
4. The execution time is 8 miliseconds.
5. if "explain.executionStats.totalKeysExamine" == "explain.executionStats.nReturned" then we achieved maximum efficiency.

we save 4 miliseconds! - which sounds like neglectable amount, but we save 33%! and in bigger DB it can make a huge difference.

For summary, index creation save us time like jump to the relevant page in a book just by looking at its index.


# Part 2 -  capabilities of MongoDB Atlas.

* MongoDB Atlas offers full-text search capabilities. Like traditional search indexes(part 1), full-text search indexes help speed up the queries on text fields. [2]
* In order to search a text, MongoDB parsing the sentences by "Analyzer",  the default "Analyzer" is "Lucene" (Lucene - is a free and open-source search engine software library and supported by the Apache Software Foundation [3]).


##### Analyze the results of the "Standard Analyzer" of MonogoDb
* Most of the examples for textual search suggests use default ("lucene.standard").
* Standard Analyzer - It divides text into terms based on word boundaries, which makes it language-neutral for most use cases. It converts all terms to lower case and removes punctuation. It provides grammar-based tokenization that recognizes email addresses, acronyms, Chinese-Japanese-Korean characters, alphanumerics, and more[4]

Let s create a small collection and investigate it!

In [8]:
# connecting to the collection
client = MongoClient(CONN_STRING)
mydb = client[DB_NAME]
collection = mydb[COLLECTION_NAME]

db_creation(collection)

In [9]:
# see the collection
item_details = collection.find()
items_df = pd.DataFrame(item_details)
items_df

Unnamed: 0,_id,title,message
0,629f39712df15bd544f09840,fitness,"I am going to run, begin"
1,629f39712df15bd544f09841,exercise,"they GO to running, beginning"
2,629f39712df15bd544f09842,todo,"go go power rangers, begins"
3,629f39712df15bd544f09843,exercise,"they go to running, started"
4,629f39712df15bd544f09844,making,went for my beginning
5,629f39712df15bd544f09845,trying,I am in the 10th place
6,629f39712df15bd544f09846,place,Again the tenth place


##### Basic analyzer creation on the "message" field

In [1074]:
# If you get "BAD..." please wait and run it again!
delete_index(CLUSTER_NAME, DB_NAME, COLLECTION_NAME, INDEX_NAME)
response = create_index1(CLUSTER_NAME, DB_NAME, COLLECTION_NAME, INDEX_NAME)
pprint(response.json())

index not found
{'analyzer': 'lucene.standard',
 'collectionName': 'sample',
 'database': 'test',
 'indexID': '629e15b5e1655a375da2177d',
 'mappings': {'dynamic': True, 'fields': {'message': {'type': 'string'}}},
 'name': 'trying',
 'status': 'IN_PROGRESS',
 'synonyms': []}


##### Let's examine of results, when query  the word "run"

In [1077]:
# run not find word "running" only match "run"
# The double square brackets shows the found word in the query
WORD = "run"
query  = get_query1(WORD)
results = list(collection.aggregate(query))
display_highlights(results)

------------------- Instance:1------------------------
**score:** 0.7034
**Title:** fitness
 **Message:** I am going to run, begin
> I am going to [[run]], begin


In [1242]:
# We see that only one instance is returned out of 3 wanted instances(below).
items_df.iloc[[0,1,3]]

Unnamed: 0,_id,title,message
0,629df582c7210cdb42fa9910,fitness,"I am going to run, begin"
1,629df582c7210cdb42fa9911,exercise,"they GO to running, beginning"
3,629df582c7210cdb42fa9913,exercise,"they go to running, started"



### Analyze the results of the "Custom version 1" of MonogoDb

It is possible to assembly a custom analyzer by stacking different options one upon another[5], here we create a custom analyzer named "levitas_analyzer", that does the following:

* conversion to ASCII
* lowe case
* stemming
* removal of stopwords(nltk)

In [1252]:
# If you get "BAD..." please wait and run it again!
delete_index(CLUSTER_NAME, DB_NAME, COLLECTION_NAME, INDEX_NAME)
response = create_index2(CLUSTER_NAME, DB_NAME, COLLECTION_NAME, INDEX_NAME)
pprint(response.json())

index not found
{'analyzer': 'levitas_analyzer',
 'analyzers': [{'charFilters': [],
                'name': 'levitas_analyzer',
                'tokenFilters': [{'type': 'lowercase'},
                                 {'stemmerName': 'english',
                                  'type': 'snowballStemming'},
                                 {'tokens': ['i',
                                             'me',
                                             'my',
                                             'myself',
                                             'we',
                                             'our',
                                             'ours',
                                             'ourselves',
                                             'you',
                                             "you're",
                                             "you've",
                                             "you'll",
                                             "you'd",
 

In [3]:
# Now we matches all the "run" instances! (it is possible to achieve similar result with "lucene.english",but here we have more control)
WORD = "run"
query  = get_query1(WORD)
results = list(collection.aggregate(query))
display_highlights(results)

------------------- Instance:1------------------------
**score:** 0.3682
**Title:** fitness
 **Message:** I am going to run, begin
> I am going to [[run]], begin
------------------- Instance:2------------------------
**score:** 0.3682
**Title:** exercise
 **Message:** they GO to running, beginning
> they GO to [[running]], beginning
------------------- Instance:3------------------------
**score:** 0.3682
**Title:** exercise
 **Message:** they go to running, started
> they go to [[running]], started


### Analyze the results of the "Custom version 2" of MonogoDb
As before including synonyms. In total, it does:

* conversion to ASCII
* lowe case
* stemming
* removal of stopwords(nltk)
* synonyms

In order to make synonyms, we use "WordNet" - lexical database of semantic relations between words[6]
We search up to 3 most similar words for each given word(using the "Wu & Palmer’s similarity")[7] which have more than 90% similarity.
Then we inset the synonyms collection to the database[8].

In [1084]:
def get_synonyms(words_for_syn=None):
    terms_dict = {}

    ii = 0
    if words_for_syn is None:
        words_for_syn = [i for i in wn.all_lemma_names()]
    print(f"The number of words fir synonymns are: {len(words_for_syn)}")

    tic = time()
    while ii <  NUMBER_OF_SYNONYMS:
        word = words_for_syn.pop(0)
        if word not in nltk_stop_words:
            word_synsets = wn.synsets(word)
            single_synonyms = []

            for syn in word_synsets:
                for l in syn.lemmas():
                    examine_word_name = str.replace(l.name(), '-', '_')
                    if examine_word_name not in nltk_stop_words and ps.stem(examine_word_name) not in nltk_stop_words:
                        single_synonyms.append((examine_word_name, ps.stem(examine_word_name), syn.wup_similarity(word_synsets[0])))

            single_synonyms = sorted(single_synonyms, key=lambda x : x[-1], reverse=True)
            df = pd.DataFrame(single_synonyms, columns=["org", "stemmed", "similarity"])
            df = df.groupby('stemmed').first()
            df = df[df.similarity > SIMILARITY_THRESHOLD]
            df = df.sort_values("similarity" , ascending=False).head(NUMBER_OF_RELATIVE_SYNONYMS + 1)
            picked_terms = df.index.values.tolist()
            # if word in picked_terms:
            #     picked_terms.remove(word)
            if not((len(picked_terms) == 1) and (word in picked_terms)):
                terms_dict[word] = picked_terms
            ii +=1

    toc = time()
    print(f'{(toc - tic)/60} time in minutes')

    synonyms_collection_list = []
    for term, synonyms_list in terms_dict.items():

        synonyms_collection_list.append({
                                      "mappingType": "explicit",
                                      "input": [term],
                                      "synonyms": synonyms_list
                                    })
    return synonyms_collection_list

synonyms_collection_list = get_synonyms()

synonyms_collection = mydb[SYN_COLLECTION_NAME]
synonyms_collection.drop()
synonyms_collection.insert_many(synonyms_collection_list)

The number of words are: 147306
0.004533334573109945 time in minutes


<pymongo.results.InsertManyResult at 0x2569ac3ed68>

In [1085]:
# If you get "BAD..." please wait and run it again(until the deletion process is finish)!
delete_index(CLUSTER_NAME, DB_NAME, COLLECTION_NAME, INDEX_NAME)
response = create_index3(CLUSTER_NAME, DB_NAME, COLLECTION_NAME, INDEX_NAME)
pprint(response.json())

index not found
{'analyzer': 'levitas_analyzer',
 'analyzers': [{'charFilters': [],
                'name': 'levitas_analyzer',
                'tokenFilters': [{'type': 'asciiFolding'},
                                 {'type': 'lowercase'},
                                 {'stemmerName': 'english',
                                  'type': 'snowballStemming'},
                                 {'tokens': ['i',
                                             'me',
                                             'my',
                                             'myself',
                                             'we',
                                             'our',
                                             'ours',
                                             'ourselves',
                                             'you',
                                             "you're",
                                             "you've",
                                             "you'l

Let's examine what we got.
In the "sysnonms" collection there is the entry:

    mappingType: "explicit"
    input: "10th"
    synonyms: ["10th", "tenth"]

In the following example without the synonym collection we got only one instance when we search for"10th"(" No synonymns query - try single word"). In the second example we have two instances("Synonymns query - try single word")

In [10]:
# instance of the  word and it's synomym
items_df.iloc[[5, 6]]

Unnamed: 0,_id,title,message
5,629f39712df15bd544f09845,trying,I am in the 10th place
6,629f39712df15bd544f09846,place,Again the tenth place


In [1089]:
WORD = "10th"
no_syn_query  = get_query1(WORD)
no_syn_results = list(collection.aggregate(no_syn_query))

syn_query  = get_query2(WORD)
syn_results = list(collection.aggregate(syn_query))

print("---------------------  No synonymns query - try single word ---------------------------\n")
display_highlights(no_syn_results)

print("--------------------- Synonymns query - try single word---------------------------\n")
display_highlights(syn_results)

--------------------- No synonymns query ---------------------------

------------------- Instance:1------------------------
**score:** 0.8673
**Title:** trying
 **Message:** I am in the 10th place
> I am in the [[10th]] place
--------------------- Synonymns query ---------------------------

------------------- Instance:1------------------------
**score:** 0.8673
**Title:** trying
 **Message:** I am in the 10th place
> I am in the [[10th]] place
------------------- Instance:2------------------------
**score:** 0.8673
**Title:** place
 **Message:** Again the tenth place
> Again the [[tenth]] place


But if we search for a sentence we get nothing!
It is because a synonym query ruin the results, it is a known issue[9]

In [1094]:
WORD = "The rangers have power"
no_syn_query  = get_query1(WORD)
no_syn_results = list(collection.aggregate(no_syn_query))

syn_query  = get_query2(WORD)
syn_results = list(collection.aggregate(syn_query))

print("--------------------- No synonymns query - try single phrase ---------------------------\n")
display_highlights(no_syn_results)

print("--------------------- Synonymns query - try single phrase ---------------------------\n")
display_highlights(syn_results)

--------------------- No synonymns query - try single phrase ---------------------------

------------------- Instance:1------------------------
**score:** 1.1645
**Title:** todo
 **Message:** go go power rangers, begins
> go go [[power]] [[rangers]], begins
--------------------- Synonymns query - try single phrase ---------------------------



We can fix it by adding an option for search also without synonym in the same new query

In [1097]:
WORD = "The rangers have power"
no_syn_query  = get_query2(WORD)
no_syn_results = list(collection.aggregate(no_syn_query))

syn_query  = get_query3(WORD)
syn_results = list(collection.aggregate(syn_query))

print("--------------------- synonymns query previous - try single phrase ---------------------------\n")
display_highlights(no_syn_results)

print("--------------------- Synonymns query new - try single phrase ---------------------------\n")
display_highlights(syn_results)

--------------------- synonymns query previous - try single phrase ---------------------------

--------------------- Synonymns query new - try single phrase ---------------------------

------------------- Instance:1------------------------
**score:** 1.1645
**Title:** todo
 **Message:** go go power rangers, begins
> go go [[power]] [[rangers]], begins


### Part 3 - connect to MongoDB through HTTP
The connection bases on[10], with clarifications and the changes needed since it published
We are going to use HTTP connection as described in[11]. In order to so that we need
1. Private key - Please follow the instructions in [12].
2. Public key - we get it from [12] as well.
3. GROUP_ID - Groups and projects are synonymous terms. Please go to your "Projects"(left side menu) -> on the right side click on the three dots and select "visit project settings" -> copy "Project ID"
4. Create connection string -  Please follow the instructions in [13].

Resources:
[1] ```https://www.analyticsvidhya.com/blog/2020/09/mongodb-indexes-pymongo-tutorial/```
[2] ```https://www.mongodb.com/basics/search-index```
[3] ```https://en.wikipedia.org/wiki/Apache_Lucene```
[4] ```https://www.mongodb.com/docs/atlas/atlas-search/analyzers/standard/#:~:text=The%20standard%20analyzer%20is%20the,lower%20case%20and%20removes%20punctuation.```
[5] ```https://www.mongodb.com/docs/atlas/atlas-search/analyzers/custom/#std-label-custom-analyzers```
[6] ```https://wordnet.princeton.edu/```
[7] ```https://towardsdatascience.com/%EF%B8%8Fwordnet-a-lexical-taxonomy-of-english-words-4373b541cfff#:~:text=WordNet%20is%20a%20large%20lexical,such%20as%20hyponymy%20and%20antonymy.```
[8] ```https://www.mongodb.com/docs/atlas/atlas-search/synonyms/```
[9] ```https://www.mongodb.com/community/forums/t/synonym-search-not-working-when-searching-for-phrase/144068```
[10] ```https://www.youtube.com/watch?v=z-OPkB8fr0U&t=1039s&ab_channel=MongoDB```
[11] ```https://www.mongodb.com/docs/atlas/reference/api/fts-indexes-create-one/```
[12] ```https://www.mongodb.com/docs/atlas/configure-api-access/```
[13] ```https://www.mongodb.com/docs/manual/reference/connection-string/```
