# MongoDB/Pymongo Indexes and Text Search

Work in driver/navigator pairs with a single laptop. Talk through each idea before you code so both partners understand the plan.

## 1. Setup

Import the core libraries we will need for HTTP requests, JSON inspection, and quick analyses.

In [1]:
from pymongo import MongoClient
from pprint import pprint

  from pandas.core import (


# 2. Retrieve posts from your local MongoDB database

In [2]:
# create database connection
client = MongoClient()
db = client.mastodon_test
db

Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'mastodon_test')

In [3]:
# show collections in mastodon_test
db.list_collection_names()

['posts']

In [4]:
# assign variable to collection in mastodon_test
coll = db.posts
coll

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'mastodon_test'), 'posts')

In [6]:
# confirm number of documents
coll.count_documents({})

366

In [7]:
# show a sample post from the collection
sample = coll.find_one()
pprint(sample)

{'_id': ObjectId('68f5436b9bd0346fe634cd0a'),
 'account': {'acct': 'richpuchalsky@mastodon.social',
             'avatar': 'https://storage.googleapis.com/hci-social-storage/cache/accounts/avatars/112/010/066/220/976/631/original/f03fbc36a0b683f3.jpg',
             'avatar_static': 'https://storage.googleapis.com/hci-social-storage/cache/accounts/avatars/112/010/066/220/976/631/original/f03fbc36a0b683f3.jpg',
             'bot': False,
             'created_at': '2024-02-24T00:00:00.000Z',
             'discoverable': True,
             'display_name': 'Rich Puchalsky  ⩜⃝',
             'emojis': [],
             'fields': [{'name': 'blog',
                         'value': '<a href="https://rpuchalsky.blogspot.com/" '
                                  'target="_blank" rel="nofollow noopener" '
                                  'translate="no"><span '
                                  'class="invisible">https://</span><span '
                                  'class="">rpuchalsky.blogs

# 3. .explain()

In [19]:
# Start by using explain to see what MongoDB does in a basic .find() query.
exp = coll.find({}).explain()
pprint(exp)

{'command': {'$db': 'mastodon_test', 'filter': {}, 'find': 'posts'},
 'executionStats': {'allPlansExecution': [],
                    'executionStages': {'advanced': 366,
                                        'direction': 'forward',
                                        'docsExamined': 366,
                                        'executionTimeMillisEstimate': 0,
                                        'isCached': False,
                                        'isEOF': 1,
                                        'nReturned': 366,
                                        'needTime': 0,
                                        'needYield': 0,
                                        'restoreState': 0,
                                        'saveState': 0,
                                        'stage': 'COLLSCAN',
                                        'works': 367},
                    'executionSuccess': True,
                    'executionTimeMillis': 2,
                    'nRetur

Copy this output and paste it into ChatGPT, Claude, or Gemini, and ask it to explain what it sees in plain English, noting any indexes it sees and is using.

Things are funky in Pymongo when we're dealing with aggregation queries, though. _More complicated than in your reading on MongoDB!_ Let's unpack this with a pipeline aggregation query from last week's notebook.

In [22]:
pipeline = [
    {'$unwind': '$tags'},
    {'$project': {
        '_id': 0,
        'acct': '$account.acct',
        'created_at': 1,
        'hashtag': '$tags.name'
    }},
    {'$group': {
        '_id': '$hashtag',
        'count': {'$sum': 1}
    }},
    {'$sort': {'count': -1}},
    {'$limit': 10}
]

result = coll.aggregate(pipeline)
for doc in result:
    pprint(doc)

{'_id': 'nature', 'count': 8}
{'_id': 'nokings', 'count': 6}
{'_id': 'space', 'count': 6}
{'_id': 'Astronomy', 'count': 6}
{'_id': 'photography', 'count': 6}
{'_id': 'astro', 'count': 5}
{'_id': 'mars', 'count': 5}
{'_id': 'planets', 'count': 5}
{'_id': 'spacex', 'count': 5}
{'_id': 'nasa', 'count': 5}


If we do what we did above with the .find() query, it won't work.

In [24]:
exp = coll.aggregate(pipeline).explain()
pprint(exp)

AttributeError: 'CommandCursor' object has no attribute 'explain'

Here's how we have to do it. (Hang onto this code so you can reuse it in the future.)

In [26]:
plan = coll.database.command(
    "explain",
    {
        "aggregate": coll.name,
        "pipeline": pipeline,
        "cursor": {}
    },
    verbosity="executionStats"  # ✅ accepted here
)

pprint(plan)

{'command': {'$db': 'mastodon_test',
             'aggregate': 'posts',
             'cursor': {},
             'pipeline': [{'$unwind': '$tags'},
                          {'$project': {'_id': 0,
                                        'acct': '$account.acct',
                                        'created_at': 1,
                                        'hashtag': '$tags.name'}},
                          {'$group': {'_id': '$hashtag', 'count': {'$sum': 1}}},
                          {'$sort': {'count': -1}},
                          {'$limit': 10}]},
 'explainVersion': '1',
 'ok': 1.0,
 'queryShapeHash': '547B17851AE3498F6B69795F053BCD77795BFCEF65EEE99D0F40A18E607A0546',
 'serverInfo': {'gitVersion': '7f52660c14217ed2c8d3240f823a2291a4fe6abd',
                'host': 'Kriss-MacBook-Pro.local',
                'port': 27017,
                'version': '8.0.8'},
 'serverParameters': {'internalDocumentSourceGroupMaxMemoryBytes': 104857600,
                      'internalDocumentSour

Now copy that output, paste it into an AI chat, and ask it to explain it in plain English, specifically referencing any indexes it finds.

In [None]:
# now take ANY .aggregate() query from last week's notebook, run .explain() on it, and ask an AI chat to explain it in plain English.

...

## 4. Standard indexes

Let's practice indexing with the account.acct field for faster account searching.

In [29]:
# Pick a username from your database

username = coll.find_one()['account']['acct']
print(username)

richpuchalsky@mastodon.social


In [30]:
# search for all posts from that user and run .explain()
# Use AI (or your own eyes -- look at 'winningPlan', do you see 'COLLSCAN'?) to ensure that it is not using any indexes

posts = coll.find({'account.acct': username})
pprint(posts.explain())

{'command': {'$db': 'mastodon_test',
             'filter': {'account.acct': 'richpuchalsky@mastodon.social'},
             'find': 'posts'},
 'executionStats': {'allPlansExecution': [],
                    'executionStages': {'advanced': 1,
                                        'direction': 'forward',
                                        'docsExamined': 366,
                                        'executionTimeMillisEstimate': 0,
                                        'filter': {'account.acct': {'$eq': 'richpuchalsky@mastodon.social'}},
                                        'isCached': False,
                                        'isEOF': 1,
                                        'nReturned': 1,
                                        'needTime': 365,
                                        'needYield': 0,
                                        'restoreState': 0,
                                        'saveState': 0,
                                        'stage': 'COLL

In [31]:
# now let's index that field

coll.create_index([("account.acct", 1)],
                  name="acct_index") # name optional, but can help debug

'acct_index'

In [32]:
# rerun your explain and explore the output to ensure the index is being used
# use AI to help analyze your output if necessary (or just look for 'winningPlan' -- do you see an index? named acct_index?)

posts = coll.find({'account.acct': username})
pprint(posts.explain())

{'command': {'$db': 'mastodon_test',
             'filter': {'account.acct': 'richpuchalsky@mastodon.social'},
             'find': 'posts'},
 'executionStats': {'allPlansExecution': [],
                    'executionStages': {'advanced': 1,
                                        'alreadyHasObj': 0,
                                        'docsExamined': 1,
                                        'executionTimeMillisEstimate': 0,
                                        'inputStage': {'advanced': 1,
                                                       'direction': 'forward',
                                                       'dupsDropped': 0,
                                                       'dupsTested': 0,
                                                       'executionTimeMillisEstimate': 0,
                                                       'indexBounds': {'account.acct': ['["richpuchalsky@mastodon.social", '
                                                         

Now let's do one on followers_count. Since we'll typically be looking for accounts with the MOST followers, we'll want the index in DESCENDING order.

In [35]:
# first, run a find query that identifies the 10 posts (not accounts!) with the most followers, and add .explain().
# what method is it using?

exp = coll.find({}).sort('followers_count', -1).limit(10).explain()

pprint(exp)

{'command': {'$db': 'mastodon_test',
             'filter': {},
             'find': 'posts',
             'limit': 10,
             'singleBatch': True,
             'sort': {'followers_count': -1}},
 'executionStats': {'allPlansExecution': [],
                    'executionStages': {'advanced': 10,
                                        'executionTimeMillisEstimate': 0,
                                        'inputStage': {'advanced': 366,
                                                       'direction': 'forward',
                                                       'docsExamined': 366,
                                                       'executionTimeMillisEstimate': 0,
                                                       'isEOF': 1,
                                                       'nReturned': 366,
                                                       'needTime': 0,
                                                       'needYield': 0,
                           

In [36]:
# now add a DESCENDING index on that field

coll.create_index([("followers_count", -1)],
                  name="followers_desc_index")

'followers_desc_index'

In [37]:
# rerun the explain to ensure the index is being used

exp = coll.find({}).sort('followers_count', -1).limit(10).explain()

pprint(exp)

{'command': {'$db': 'mastodon_test',
             'filter': {},
             'find': 'posts',
             'limit': 10,
             'singleBatch': True,
             'sort': {'followers_count': -1}},
 'executionStats': {'allPlansExecution': [],
                    'executionStages': {'advanced': 10,
                                        'executionTimeMillisEstimate': 0,
                                        'inputStage': {'advanced': 10,
                                                       'alreadyHasObj': 0,
                                                       'docsExamined': 10,
                                                       'executionTimeMillisEstimate': 0,
                                                       'inputStage': {'advanced': 10,
                                                                      'direction': 'forward',
                                                                      'dupsDropped': 0,
                                              

## 5. Array indexes

To index every element in an array (like tags or media_attachments), just create an index on that field. MongoDB will automatically index each element in the arrays. Note, though, that these indexes will be more "expensive" (take up more space, add more time when adding new documents) than one that indexes one element per document (like account.acct or content).

In [38]:
# grab the grouping query from last week's notebook that returns the top 10 hashtags (there were two, either one is fine). Run just that query.

pipeline = [
    {'$unwind': '$tags'},
    {'$group': {
        '_id': '$tags.name',
        'count': {'$sum': 1}
    }},
    {'$sort': {'count': -1}},
    {'$limit': 10}
]

result = coll.aggregate(pipeline)
for doc in result:
    pprint(doc)

{'_id': 'nature', 'count': 8}
{'_id': 'space', 'count': 6}
{'_id': 'photography', 'count': 6}
{'_id': 'nokings', 'count': 6}
{'_id': 'Astronomy', 'count': 6}
{'_id': 'astrology', 'count': 5}
{'_id': 'spaceexploration', 'count': 5}
{'_id': 'solarsystem', 'count': 5}
{'_id': 'esa', 'count': 5}
{'_id': 'spacex', 'count': 5}


In [39]:
# now run explain() on this query (remember that aggregate requires a more involved method, but you can just copy and paste from above).
# This output will be more complicated, so you may want to use AI to help explain it to you and check for index usage.

plan = coll.database.command(
    "explain",
    {
        "aggregate": coll.name,
        "pipeline": pipeline,
        "cursor": {}
    },
    verbosity="executionStats"  # ✅ accepted here
)

pprint(plan)

{'command': {'$db': 'mastodon_test',
             'aggregate': 'posts',
             'cursor': {},
             'pipeline': [{'$unwind': '$tags'},
                          {'$group': {'_id': '$tags.name',
                                      'count': {'$sum': 1}}},
                          {'$sort': {'count': -1}},
                          {'$limit': 10}]},
 'explainVersion': '1',
 'ok': 1.0,
 'queryShapeHash': 'BC0B12B3189A69E070017BC31B17B9BD7BAD26AFB03B4527FDDCE5E300DEF7B5',
 'serverInfo': {'gitVersion': '7f52660c14217ed2c8d3240f823a2291a4fe6abd',
                'host': 'Kriss-MacBook-Pro.local',
                'port': 27017,
                'version': '8.0.8'},
 'serverParameters': {'internalDocumentSourceGroupMaxMemoryBytes': 104857600,
                      'internalDocumentSourceSetWindowFieldsMaxMemoryBytes': 104857600,
                      'internalLookupStageIntermediateDocumentMaxSizeBytes': 104857600,
                      'internalQueryFacetBufferSizeBytes': 1048576

In [43]:
# now add an index to the hashtag name field, ascending is fine.

coll.create_index(
    [("tags.name", 1)],
    name="tags_name_index"
)


'tags_name_index'

In [44]:
# now rerun the above explain to ensure the index is being used.

plan = coll.database.command(
    "explain",
    {
        "aggregate": coll.name,
        "pipeline": pipeline,
        "cursor": {}
    },
    verbosity="executionStats"  # ✅ accepted here
)

pprint(plan)

{'command': {'$db': 'mastodon_test',
             'aggregate': 'posts',
             'cursor': {},
             'pipeline': [{'$unwind': '$tags'},
                          {'$group': {'_id': '$tags.name',
                                      'count': {'$sum': 1}}},
                          {'$sort': {'count': -1}},
                          {'$limit': 10}]},
 'explainVersion': '1',
 'ok': 1.0,
 'queryShapeHash': 'BC0B12B3189A69E070017BC31B17B9BD7BAD26AFB03B4527FDDCE5E300DEF7B5',
 'serverInfo': {'gitVersion': '7f52660c14217ed2c8d3240f823a2291a4fe6abd',
                'host': 'Kriss-MacBook-Pro.local',
                'port': 27017,
                'version': '8.0.8'},
 'serverParameters': {'internalDocumentSourceGroupMaxMemoryBytes': 104857600,
                      'internalDocumentSourceSetWindowFieldsMaxMemoryBytes': 104857600,
                      'internalLookupStageIntermediateDocumentMaxSizeBytes': 104857600,
                      'internalQueryFacetBufferSizeBytes': 1048576

In [45]:
coll.find({'tags.name': 'nature'}).explain()

{'explainVersion': '1',
 'queryPlanner': {'namespace': 'mastodon_test.posts',
  'parsedQuery': {'tags.name': {'$eq': 'nature'}},
  'indexFilterSet': False,
  'queryHash': '8818BAD0',
  'planCacheShapeHash': '8818BAD0',
  'planCacheKey': 'C77E1090',
  'optimizationTimeMillis': 3,
  'maxIndexedOrSolutionsReached': False,
  'maxIndexedAndSolutionsReached': False,
  'maxScansToExplodeReached': False,
  'prunedSimilarIndexes': False,
  'winningPlan': {'isCached': False,
   'stage': 'FETCH',
   'inputStage': {'stage': 'IXSCAN',
    'keyPattern': {'tags.name': 1},
    'indexName': 'tags_name_index',
    'isMultiKey': True,
    'multiKeyPaths': {'tags.name': ['tags']},
    'isUnique': False,
    'isSparse': False,
    'isPartial': False,
    'indexVersion': 2,
    'direction': 'forward',
    'indexBounds': {'tags.name': ['["nature", "nature"]']}}},
  'rejectedPlans': []},
 'executionStats': {'executionSuccess': True,
  'nReturned': 8,
  'executionTimeMillis': 6,
  'totalKeysExamined': 8,
  'to

## 6. Text indexes and searches

To perform a keyword or partial text search (like SQL's `WHERE content like '%keyword%'`), we need to use `$regex`.

In [46]:
# perform a search for the word 'nature' somewhere in the post content

cursor = coll.find({'content': { '$regex': 'nature'}})

for post in cursor:
    pprint(post['content'])

('<p>Large fallen tree covered in moss and lichen in <a '
 'href="https://mstdn.ca/tags/Algonquin" class="mention hashtag" rel="nofollow '
 'noopener" target="_blank">#<span>Algonquin</span></a> Park. <a '
 'href="https://mstdn.ca/tags/Ontario" class="mention hashtag" rel="nofollow '
 'noopener" target="_blank">#<span>Ontario</span></a> <a '
 'href="https://mstdn.ca/tags/Canada" class="mention hashtag" rel="nofollow '
 'noopener" target="_blank">#<span>Canada</span></a> </p><p><a '
 'href="https://mstdn.ca/tags/nature" class="mention hashtag" rel="nofollow '
 'noopener" target="_blank">#<span>nature</span></a> <a '
 'href="https://mstdn.ca/tags/NaturePhotography" class="mention hashtag" '
 'rel="nofollow noopener" '
 'target="_blank">#<span>NaturePhotography</span></a></p>')
('<p>Large fallen tree covered in moss and lichen in <a '
 'href="https://mstdn.ca/tags/Algonquin" class="mention hashtag" rel="nofollow '
 'noopener" target="_blank">#<span>Algonquin</span></a> Park. <a '
 'href="

In [47]:
# add `$options: 'i'` to make it case-insensitive

cursor = coll.find({'content': { '$regex': 'nature',
                                 '$options': 'i'}})

for post in cursor:
    pprint(post['content'])

('<p>Large fallen tree covered in moss and lichen in <a '
 'href="https://mstdn.ca/tags/Algonquin" class="mention hashtag" rel="nofollow '
 'noopener" target="_blank">#<span>Algonquin</span></a> Park. <a '
 'href="https://mstdn.ca/tags/Ontario" class="mention hashtag" rel="nofollow '
 'noopener" target="_blank">#<span>Ontario</span></a> <a '
 'href="https://mstdn.ca/tags/Canada" class="mention hashtag" rel="nofollow '
 'noopener" target="_blank">#<span>Canada</span></a> </p><p><a '
 'href="https://mstdn.ca/tags/nature" class="mention hashtag" rel="nofollow '
 'noopener" target="_blank">#<span>nature</span></a> <a '
 'href="https://mstdn.ca/tags/NaturePhotography" class="mention hashtag" '
 'rel="nofollow noopener" '
 'target="_blank">#<span>NaturePhotography</span></a></p>')
('<p>Large fallen tree covered in moss and lichen in <a '
 'href="https://mstdn.ca/tags/Algonquin" class="mention hashtag" rel="nofollow '
 'noopener" target="_blank">#<span>Algonquin</span></a> Park. <a '
 'href="

For the most robust text searches in MongoDB, we'd have to use their Atlas platform. Most significant is that Atlas is required to do partial-word matches with wildcards. However, as long as we're doing full-word searches, we can run that locally, and we can create a text index to support it.

In [48]:
# first, run the above search with .explain() to see it without an index

exp = coll.find({'content': { '$regex': 'nature',
                              '$options': 'i'}})

pprint(exp.explain())

{'command': {'$db': 'mastodon_test',
             'filter': {'content': {'$options': 'i', '$regex': 'nature'}},
             'find': 'posts'},
 'executionStats': {'allPlansExecution': [],
                    'executionStages': {'advanced': 12,
                                        'direction': 'forward',
                                        'docsExamined': 366,
                                        'executionTimeMillisEstimate': 0,
                                        'filter': {'content': {'$options': 'i',
                                                               '$regex': 'nature'}},
                                        'isCached': False,
                                        'isEOF': 1,
                                        'nReturned': 12,
                                        'needTime': 354,
                                        'needYield': 0,
                                        'restoreState': 0,
                                        'saveState':

In [52]:
# now create a TEXT index on content
# note the import statement -- you only need to do that once
# Also note the default_language and language_override fields. This is because Mastodon includes a language field, but it's not always populated, so MongoDB gets confused trying to decide what language(s) to use to support some of their advanced language parsing features. Just use this code if you want to avoid that problem with this data.

from pymongo import TEXT
coll.create_index([("content", TEXT)],
                  name="content_text_index",
                  default_language="english",
                  language_override="__no_override__"
                  )

'content_text_index'

In [54]:
# now rerun the above keyword search to ensure the index is being used

exp = coll.find({'content': { '$regex': 'nature'}})

pprint(exp.explain())

{'command': {'$db': 'mastodon_test',
             'filter': {'content': {'$regex': 'nature'}},
             'find': 'posts'},
 'executionStats': {'allPlansExecution': [],
                    'executionStages': {'advanced': 11,
                                        'direction': 'forward',
                                        'docsExamined': 366,
                                        'executionTimeMillisEstimate': 0,
                                        'filter': {'content': {'$regex': 'nature'}},
                                        'isCached': False,
                                        'isEOF': 1,
                                        'nReturned': 11,
                                        'needTime': 355,
                                        'needYield': 0,
                                        'restoreState': 0,
                                        'saveState': 0,
                                        'stage': 'COLLSCAN',
                                

In [56]:
# $regex doesn't work with the TEXT index we created. We have to do the text search a little differently. (This method ONLY works with a TEXT index.)
# Again, just copy and paste this code, modifying as necessary, for future projects, including the midterm.

cursor = coll.find(
    {"$text": {"$search": "nature"}},
    {"content": 1}
)

pprint(cursor.explain())

{'command': {'$db': 'mastodon_test',
             'filter': {'$text': {'$search': 'nature'}},
             'find': 'posts',
             'projection': {'content': 1}},
 'executionStats': {'allPlansExecution': [],
                    'executionStages': {'advanced': 12,
                                        'executionTimeMillisEstimate': 0,
                                        'inputStage': {'advanced': 12,
                                                       'docsRejected': 0,
                                                       'executionTimeMillisEstimate': 0,
                                                       'indexName': 'content_text_index',
                                                       'indexPrefix': {},
                                                       'inputStage': {'advanced': 12,
                                                                      'alreadyHasObj': 0,
                                                                      'docsExamined

In [58]:
# Now let's use it in an aggregation pipeline.
# Perform the same search as a $match, then return the three matching posts with the highest followers_count.
# Return only the content and followers_count fields.

pipeline = [
    {"$match": {"$text": {"$search": "nature"}}},   # uses the text index
    {"$sort": {"followers_count": -1}},             # sort top followers
    {"$limit": 3},
    {"$project": {"_id": 0, "content": 1, "followers_count": 1}}
]

cursor = coll.aggregate(pipeline)

for doc in cursor:
    pprint(doc)

{'content': '<p>Large fallen tree covered in moss and lichen in <a '
            'href="https://mstdn.ca/tags/Algonquin" class="mention hashtag" '
            'rel="nofollow noopener" '
            'target="_blank">#<span>Algonquin</span></a> Park. <a '
            'href="https://mstdn.ca/tags/Ontario" class="mention hashtag" '
            'rel="nofollow noopener" target="_blank">#<span>Ontario</span></a> '
            '<a href="https://mstdn.ca/tags/Canada" class="mention hashtag" '
            'rel="nofollow noopener" target="_blank">#<span>Canada</span></a> '
            '</p><p><a href="https://mstdn.ca/tags/nature" class="mention '
            'hashtag" rel="nofollow noopener" '
            'target="_blank">#<span>nature</span></a> <a '
            'href="https://mstdn.ca/tags/NaturePhotography" class="mention '
            'hashtag" rel="nofollow noopener" '
            'target="_blank">#<span>NaturePhotography</span></a></p>'}
{'content': '<p>Reports on the homosexual behaviour 

In [60]:
# Now run that same pipeline with explain to see if it's using the text index.
# By the way, ... is it using the followers_count index? Why or why not?

plan = coll.database.command(
    "explain",
    {
        "aggregate": coll.name,
        "pipeline": pipeline,
        "cursor": {}
    },
    verbosity="executionStats"  # ✅ accepted here
)

pprint(plan)

{'command': {'$db': 'mastodon_test',
             'aggregate': 'posts',
             'cursor': {},
             'pipeline': [{'$match': {'$text': {'$search': 'nature'}}},
                          {'$sort': {'followers_count': -1}},
                          {'$limit': 3},
                          {'$project': {'_id': 0,
                                        'content': 1,
                                        'followers_count': 1}}]},
 'executionStats': {'executionStages': {'advanced': 3,
                                        'executionTimeMillisEstimate': 0,
                                        'inputStage': {'advanced': 3,
                                                       'executionTimeMillisEstimate': 0,
                                                       'inputStage': {'advanced': 12,
                                                                      'docsRejected': 0,
                                                                      'executionTimeMillisEst

# Save this file WITH ALL OUTPUT SHOWING and submit to Canvas