# Text Indexes

---
### [Text Indexes](https://docs.mongodb.com/manual/core/index-text/#text-indexes)

- Create index for each unique stemmed word.
> - For example, for "texting" and "texted" the stemmed word is "text". Therefore, only "text" index will be stored.

- Words are first converted to lowercase and then used as index.
> - For example, only "hello" will be saved in place of "Hello", "HELLO", etc.

- It uses Dash, Hyphen, Pattern_Syntax, Quotation_Mark, Terminal_Punctuation, and White_Space as delimiters.

----

### Connect to local server

---

In [1]:
# Importing the required libraries
import pymongo

import pprint as pp
pp.sorted = lambda x, key=None: x

In [2]:
# Connect to local host
client = pymongo.MongoClient("mongodb://localhost:27017/")

In [3]:
# Connect to database
db = client['nyc']

In [4]:
# Sample document
pp.pprint(db.airbnb.find_one())

{'_id': ObjectId('60c21bf5b653d40e79b4a7d0'),
 'accom_id': 2595,
 'description': 'Skylit Midtown Castle',
 'host': {'id': 2845,
          'name': 'Jennifer',
          'listings_count': 3,
          'neighbourhood_list': ['Midtown', "Hell's Kitchen"]},
 'neighbourhood': {'name': 'Midtown', 'group': 'Manhattan'},
 'location': {'type': 'Point', 'coordinates': [-73.98559, 40.75356]},
 'room_type': 'Entire home/apt',
 'price': 150,
 'minimum_nights': 30,
 'reviews': {'number_of_reviews': 48,
             'last_review': datetime.datetime(2019, 11, 4, 0, 0),
             'reviews_per_month': 0.35},
 'availability_365': 365}


---
**Drop previous indexes.**

---

In [5]:
# Drop indexes
db.airbnb.drop_indexes()

---
**Create text index.**

Create text index on `description` field.

---

In [6]:
# Create text index
db.airbnb.create_index(
                        [('description', 'text')],
                        name = 'description_index'
                )

'description_index'

---
### $text operator

- Perform text search on collection with text index using [$text](https://docs.mongodb.com/manual/text-search/#-text-operator) operator.

- It tokenizes the search string using whitespace and most punctuation as delimiters.

- It performs a logical OR of all such tokens in the search string.


---
For example, retrieve documents that contain the term `wifi` in its description.

---

In [7]:
# Query using text index
cur = db.airbnb.find(
                    # query
                    {
                        '$text': {
                                    '$search': "wifi"
                                 }
                    },
                    # projection
                    {
                        'description': 1,
                        '_id': 0
                    })

for doc in cur:
    pp.pprint(doc)

{'description': 'Room in house with WIFI'}
{'description': 'Queens Modern Apartment with WiFi!'}
{'description': 'Comfortable Small Room with wifi'}
{'description': "You'll love this apartment. Free parking and Wifi"}
{'description': 'Great bedroom with tv, Sonos, WiFi,'}
{'description': 'Flathbush WIFI+WASHER+DRYER+AC'}
{'description': 'CONFORTABLE STUDIO ROOM INCLUDES WIFI'}
{'description': '♥♥♥ Entire House with Backyard & Superfast WiFi♥♥♥'}
{'description': 'Modern and Safe Place,Free Wifi'}
{'description': "FEMALE ONLY 'Heaven'PrivateBed/SharedSpace w/Wifi"}
{'description': 'Columbus Circle/Central Park - WIFI'}
{'description': 'Spacious 1bdrm furnished wifi tv'}
{'description': 'Best double Room all included wifi'}
{'description': 'Private single bed Room Wifi'}
{'description': 'Large Double Room  Queenbed Wifi'}
{'description': 'Private exit/entry*WIFI cozy room'}
{'description': 'My HOUSE is your house - 5 beds - fast WIFI'}
{'description': 'Stylish Riverview Condo with Free Pa

---
Match is made in case-insensitive manner.

---

MongoDB full text search uses logical **OR** search on the specified phrase words.

----
For example, count accomodations that have terms either `wifi` or `parking` in its description.

----

In [8]:
# Query
db.airbnb.find_one({
                        '$text': {
                                    '$search': "wifi parking"                                 }
                    },
                    {
                        'description': 1,
                        '_id': 0
                    })

{'description': '88 By The Park w/Parking Space'}

In [9]:
# Query
db.airbnb.find_one({
                        '$text': {
                                    '$search': "wifi, parking"
                                }
                    },
                    {
                        'description': 1,
                        '_id': 0
                    })

{'description': '88 By The Park w/Parking Space'}

---
**Query exact term using quotation marks.**

---

In [10]:
# Query 
cur = db.airbnb.find(
        {
            '$text': {
                        '$search': '"cozy room close to subway"'
                     }
        },
        {
            'description': 1,
            'accom_id':1,
            '_id': 0
        })

for doc in cur:
    pp.pprint(doc)

{'accom_id': 48410611,
 'description': 'Cozy room close to subway, 20 mins to Midtown'}
{'accom_id': 48326279,
 'description': 'Cozy room close to subway, 20 mins to Midtown'}
{'accom_id': 43180065,
 'description': 'Cozy room close to subway, 20 mins to Manhattan'}
{'accom_id': 41495736,
 'description': 'Cozy room close to subway, 20 mins to Time Square'}


---
Full text search also allows [negation](https://docs.mongodb.com/manual/text-search/#term-exclusion) of certain words from the search.

----

In [11]:
# Term exclusion
cur = db.airbnb.find({
                        '$text': {'$search': '"cozy room close to subway" -Midtown'}
                    },
                    {
                        'description': 1,
                        '_id': 0
                    })

for doc in cur:
    pp.pprint(doc)

{'description': 'Cozy room close to subway, 20 mins to Manhattan'}
{'description': 'Cozy room close to subway, 20 mins to Time Square'}


---
### $meta operator 

`$text` provides a score to each documents based on the relevance of the document to the text query provided. This done using the [$meta](https://docs.mongodb.com/manual/reference/operator/aggregation/meta/#-meta) that returns metadata wi document when performing text search.

---

In [12]:
# Keyword relevancy

cur = db.airbnb.find(
                    # query
                    {
                        '$text': {'$search': '"cozy room close to subway"'}
                    },
                    # projection
                    {
                        # relevancy score
                        'score': {
                                    '$meta': "textScore"
                                },
                        'description': 1,
                        '_id': 0
                    })

for doc in cur:
    pp.pprint(doc)

{'description': 'Cozy room close to subway, 20 mins to Midtown',
 'score': 2.2857142857142856}
{'description': 'Cozy room close to subway, 20 mins to Midtown',
 'score': 2.2857142857142856}
{'description': 'Cozy room close to subway, 20 mins to Time Square',
 'score': 2.25}
{'description': 'Cozy room close to subway, 20 mins to Manhattan',
 'score': 2.2857142857142856}


----

**[Text search using aggregation pipeline](https://docs.mongodb.com/manual/tutorial/text-search-in-aggregation/#text-search-in-the-aggregation-pipeline)**

We can perform text search using aggregation pipeline as well. This is done using the `$text` operator in the `$match` stage. And we can get the relevancy score using `$meta` operator.

----

In [13]:
# Aggregation pipeline

cur = db.airbnb.aggregate([
            # Stage 1 - match
            {
                '$match': {
                            '$text':{'$search': '"cozy room close to subway" -Midtown'}
                            }
            },
            # Stage 2 - project
            {
                '$project':{
                             'description':1,
                             '_id':0,
                             # Relevancy score
                             'score': {'$meta': "textScore"}
                            }
            }
])

for doc in cur:
    pp.pprint(doc)

{'description': 'Cozy room close to subway, 20 mins to Manhattan',
 'score': 2.2857142857142856}
{'description': 'Cozy room close to subway, 20 mins to Time Square',
 'score': 2.25}


---

**Restriction** :-
- Only one text index per collection.

---

In [14]:
# Only one text index per collection
db.airbnb.create_index([('host.name', 'text')])

OperationFailure: Index: { v: 2, key: { _fts: "text", _ftsx: 1 }, name: "host.name_text", weights: { host.name: 1 }, default_language: "english", language_override: "language", textIndexVersion: 3 } already exists with different options: { v: 2, key: { _fts: "text", _ftsx: 1 }, name: "description_index", weights: { description: 1 }, default_language: "english", language_override: "language", textIndexVersion: 3 }, full error: {'ok': 0.0, 'errmsg': 'Index: { v: 2, key: { _fts: "text", _ftsx: 1 }, name: "host.name_text", weights: { host.name: 1 }, default_language: "english", language_override: "language", textIndexVersion: 3 } already exists with different options: { v: 2, key: { _fts: "text", _ftsx: 1 }, name: "description_index", weights: { description: 1 }, default_language: "english", language_override: "language", textIndexVersion: 3 }', 'code': 85, 'codeName': 'IndexOptionsConflict'}

---
### Exercise 1 -

Count the number of documents that contain the words `park`, `subway`, and `city` in `description`.

---

---
### Exercise 2 -

Count the number of documents that contain either the words `park` or `subway` but not `city` in `description`.

---