<a href="https://colab.research.google.com/github/nehadacherla/skills-introduction-to-github/blob/main/mongodb_sample_mflix_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MongoDB Atlas with Python: Using the `sample_mflix` Dataset
This notebook introduces the basics of querying, filtering, and performing aggregations with MongoDB using the **sample_mflix** dataset. The dataset contains movie-related information such as movies, comments, theaters, and users. You'll learn how to:
- Connect to MongoDB Atlas
- Perform basic queries and filtering
- Execute advanced operations like aggregation
- Create indexes and update/delete documents

Let's get started!

In [3]:
!pip install --upgrade pymongo certifi

Collecting pymongo
  Downloading pymongo-4.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.6.1-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.6.1-py3-none-any.whl (307 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.6.1 pymongo-4.10.1


## 1. Setup and Connection to MongoDB Atlas


In [4]:

# Install pymongo for MongoDB connection


# Import necessary libraries
from pymongo import MongoClient
import pprint

# Replace with your MongoDB Atlas connection string
connection_string = "mongodb+srv://nehadacherla:Elephant444@ndcluster.nqvkw.mongodb.net/?retryWrites=true&w=majority&appName=NDCluster"

# Connect to MongoDB Atlas
client = MongoClient(connection_string)

# Access the sample_mflix database and the movies collection
db = client['sample_mflix']
collection = db['movies']
print(collection.find_one())


{'_id': ObjectId('573a1393f29313caabcdcb42'), 'plot': 'Kate and her actor brother live in N.Y. in the 21st Century. Her ex-boyfriend, Stuart, lives above her apartment. Stuart finds a space near the Brooklyn Bridge where there is a gap in time....', 'genres': ['Comedy', 'Fantasy', 'Romance'], 'runtime': 118, 'metacritic': 44, 'rated': 'PG-13', 'cast': ['Meg Ryan', 'Hugh Jackman', 'Liev Schreiber', 'Breckin Meyer'], 'poster': 'https://m.media-amazon.com/images/M/MV5BNmNlN2VlOTctYTdhMS00NzUxLTg0ZGMtYWE2ZTJmMThlMTk2XkEyXkFqcGdeQXVyMzI0NDc4ODY@._V1_SY1000_SX677_AL_.jpg', 'title': 'Kate & Leopold', 'fullplot': "Kate and her actor brother live in N.Y. in the 21st Century. Her ex-boyfriend, Stuart, lives above her apartment. Stuart finds a space near the Brooklyn Bridge where there is a gap in time. He goes back to the 19th Century and takes pictures of the place. Leopold -- a man living in the 1870s -- is puzzled by Stuart's tiny camera, follows him back through the gap, and they both ended 

## 2. Basic MongoDB Commands
### Searching for Documents (Basic Query)


In [4]:

# Find one document from the movies collection
document = collection.find_one()
pprint.pprint(document)


{'_id': ObjectId('573a1390f29313caabcd4eaf'),
 'awards': {'nominations': 0, 'text': '1 win.', 'wins': 1},
 'cast': ['Jane Gail', 'Ethel Grandin', 'William H. Turner', 'Matt Moore'],
 'countries': ['USA'],
 'directors': ['George Loane Tucker'],
 'genres': ['Crime', 'Drama'],
 'imdb': {'id': 3471, 'rating': 6.0, 'votes': 371},
 'languages': ['English'],
 'lastupdated': '2015-09-15 02:07:14.247000000',
 'num_mflix_comments': 1,
 'plot': 'A woman, with the aid of her police officer sweetheart, endeavors to '
         'uncover the prostitution ring that has kidnapped her sister, and the '
         'philanthropist who secretly runs it.',
 'poster': 'https://m.media-amazon.com/images/M/MV5BYzk0YWQzMGYtYTM5MC00NjM2LWE5YzYtMjgyNDVhZDg1N2YzXkEyXkFqcGdeQXVyMzE0MjY5ODA@._V1_SY1000_SX677_AL_.jpg',
 'rated': 'TV-PG',
 'released': datetime.datetime(1913, 11, 24, 0, 0),
 'runtime': 88,
 'title': 'Traffic in Souls',
 'tomatoes': {'dvd': datetime.datetime(2008, 8, 26, 0, 0),
              'lastUpdated':

### Searching with a Filter (Filtering)


In [5]:

# Find all movies where the genre contains "Action"
action_movies = collection.find({"genres": "Action"}).limit(5)

# Print the results
for movie in action_movies:
    pprint.pprint(movie)


{'_id': ObjectId('573a1391f29313caabcd8319'),
 'awards': {'nominations': 1, 'text': '1 nomination.', 'wins': 0},
 'cast': ['Harold Lloyd', 'Jobyna Ralston', 'Noah Young', 'Jim Mason'],
 'countries': ['USA'],
 'directors': ['Sam Taylor'],
 'fullplot': 'The Uptown Boy, J. Harold Manners (Lloyd) is a millionaire '
             'playboy who falls for the Downtown Girl, Hope (Ralston) who '
             "works in Brother Paul's (Weigel) mission. In order to build up "
             "attendance, and win Hope's attention, Harold runs through town "
             'causing trouble, and winds up with a crowd chasing him right '
             'into the mission. He eventually wins the girl and they marry, '
             'but not without some interference from his high-brow friends.',
 'genres': ['Action', 'Comedy', 'Romance'],
 'imdb': {'id': 16895, 'rating': 7.6, 'votes': 918},
 'languages': ['English'],
 'lastupdated': '2015-08-19 00:35:01.953000000',
 'num_mflix_comments': 0,
 'plot': 'An irrespon

### Sorting Results


In [None]:

# Find and sort movies by release year in descending order
sorted_movies = collection.find().sort("year", -1).limit(5)

# Print the sorted results
for movie in sorted_movies:
    pprint.pprint(movie)


{'_id': ObjectId('573a13eaf29313caabdcfbc1'),
 'awards': {'nominations': 4,
            'text': 'Nominated for 2 Primetime Emmys. Another 1 win & 4 '
                    'nominations.',
            'wins': 3},
 'cast': ['Meryl Streep',
          'Edward Herrmann',
          'Doris Kearns Goodwin',
          'Franklin D. Roosevelt'],
 'countries': ['USA'],
 'fullplot': 'A documentary that weaves together the stories of Theodore, '
             'Franklin and Eleanor Roosevelt, three members of one of the most '
             'prominent and influential families in American politics.',
 'genres': ['Documentary'],
 'imdb': {'id': 3400010, 'rating': 8.8, 'votes': 682},
 'languages': ['English'],
 'lastupdated': '2015-08-23 00:10:24.657000000',
 'num_mflix_comments': 1,
 'plot': 'A documentary that weaves together the stories of Theodore, Franklin '
         'and Eleanor Roosevelt, three members of one of the most prominent '
         'and influential families in American politics.',
 'poster'

### Searching with Multiple Conditions


In [None]:

# Find movies where the genre is "Action" and the rating is greater than 8
multi_condition_query = {"genres": "Action", "imdb.rating": {"$gt": 8}}

# Execute the query
results = collection.find(multi_condition_query).limit(5)

# Print the results
for result in results:
    pprint.pprint(result)


{'_id': ObjectId('573a1395f29313caabce2498'),
 'awards': {'nominations': 1, 'text': '1 win & 1 nomination.', 'wins': 1},
 'cast': ['Clint Eastwood',
          'Marianne Koch',
          'Gian Maria Volontè',
          'Wolfgang Lukschy'],
 'countries': ['Italy', 'Spain', 'West Germany'],
 'directors': ['Sergio Leone'],
 'fullplot': 'An anonymous, but deadly man rides into a town torn by war '
             "between two factions, the Baxters and the Rojo's. Instead of "
             'fleeing or dying, as most other would do, the man schemes to '
             'play the two sides off each other, getting rich in the bargain.',
 'genres': ['Action', 'Drama', 'Western'],
 'imdb': {'id': 58461, 'rating': 8.1, 'votes': 126585},
 'languages': ['Italian', 'Spanish', 'English'],
 'lastupdated': '2015-09-02 00:17:22.303000000',
 'num_mflix_comments': 0,
 'plot': 'A wandering gunfighter plays two rival families against each other '
         'in a town torn apart by greed, pride, and revenge.',
 'pos

## 3. Advanced MongoDB Operations
### Aggregation Example: Average IMDb Rating by Genre


In [None]:

# Aggregation pipeline to calculate average IMDb rating by genre
aggregation_pipeline = [
    {"$unwind": "$genres"},  # Separate each movie's genres into individual documents
    {"$group": {"_id": "$genres", "avg_rating": {"$avg": "$imdb.rating"}}},
    {"$sort": {"avg_rating": -1}},
    {"$limit": 5}
]

# Execute the aggregation
aggregated_data = collection.aggregate(aggregation_pipeline)

# Print the results
for data in aggregated_data:
    pprint.pprint(data)


### Indexing: Creating an Index


In [None]:

# Create an index on the "year" field to improve query performance for year-related searches
collection.create_index([("year", 1)])

# Show existing indexes
indexes = collection.index_information()
pprint.pprint(indexes)


### Updating Documents


In [None]:

# Update a movie's IMDb rating (change the rating of a specific movie)
collection.update_one({"title": "The Godfather"}, {"$set": {"imdb.rating": 9.3}})


### Deleting Documents


In [None]:

# Delete a movie based on a condition (delete movies that were released before 1950)
#collection.delete_many({"year": {"$lt": 1950}})


## 4. Exercises for Hands-On Practice
### Exercise 1: Searching and Filtering
**Task**: Find all movies where the genre is 'Comedy' and the IMDb rating is greater than 7.


In [None]:

# Your task: Write a query to find comedies with an IMDb rating greater than 7
comedies = collection.find({"genres": "Comedy", "imdb.rating": {"$gt": 7}}).limit(5)

# Print the first 5 results
for comedy in comedies:
    pprint.pprint(comedy)


{'_id': ObjectId('573a1390f29313caabcd4803'),
 'awards': {'nominations': 0, 'text': '1 win.', 'wins': 1},
 'cast': ['Winsor McCay'],
 'countries': ['USA'],
 'directors': ['Winsor McCay', 'J. Stuart Blackton'],
 'fullplot': 'Cartoonist Winsor McCay agrees to create a large set of drawings '
             'that will be photographed and made into a motion picture. The '
             'job requires plenty of drawing supplies, and the cartoonist must '
             'also overcome some mishaps caused by an assistant. Finally, the '
             'work is done, and everyone can see the resulting animated '
             'picture.',
 'genres': ['Animation', 'Short', 'Comedy'],
 'imdb': {'id': 1737, 'rating': 7.3, 'votes': 1034},
 'languages': ['English'],
 'lastupdated': '2015-08-29 01:09:03.030000000',
 'num_mflix_comments': 0,
 'plot': 'Cartoon figures announce, via comic strip balloons, that they will '
         'move - and move they do, in a wildly exaggerated style.',
 'poster': 'https://m.me

### Exercise 2: Aggregation Pipeline
**Task**: Write an aggregation pipeline to find the top 5 directors by the average IMDb rating of their movies.


In [None]:

# Your task: Write an aggregation pipeline to calculate average IMDb rating by director
pipeline = [
    {"$group": {"_id": "$directors", "avg_rating": {"$avg": "$imdb.rating"}}},
    {"$sort": {"avg_rating": -1}},
    {"$limit": 5}
]

# Execute the pipeline and print results
avg_rating_by_director = collection.aggregate(pipeline)
for data in avg_rating_by_director:
    pprint.pprint(data)


{'_id': ['Sara Hirsh Bordo'], 'avg_rating': 9.4}
{'_id': ['Kevin Derek'], 'avg_rating': 9.3}
{'_id': ['Michael Benson'], 'avg_rating': 9.0}
{'_id': ['Slobodan Sijan'], 'avg_rating': 8.95}
{'_id': ['Sundar C.'], 'avg_rating': 8.9}


### Exercise 3: Create an Index and Measure Performance
**Task**: Create an index on the `imdb.rating` field. Measure performance before and after creating the index.


In [None]:

# Task: Create an index on imdb.rating and query before and after indexing

# Query without index
from time import time

start_time = time()
no_index_result = collection.find({"imdb.rating": {"$gt": 8}}).limit(5)
print("Time without index:", time() - start_time)

# Create an index
collection.create_index([("imdb.rating", 1)])

# Query with index
start_time = time()
with_index_result = collection.find({"imdb.rating": {"$gt": 8}}).limit(5)
print("Time with index:", time() - start_time)

# Print the results
for result in with_index_result:
    pprint.pprint(result)


Time without index: 0.0001518726348876953
Time with index: 0.00015044212341308594
{'_id': ObjectId('573a1391f29313caabcd72f0'),
 'awards': {'nominations': 0, 'text': '2 wins.', 'wins': 2},
 'cast': ['Richard Barthelmess',
          'Gladys Hulette',
          'Walter P. Lewis',
          'Ernest Torrence'],
 'countries': ['USA'],
 'directors': ['Henry King'],
 'fullplot': 'When three thuggish men are responsible for the death of his '
             'father and the crippling of his brother, young David must choose '
             'between supporting his family or risking his life and exacting '
             'vengeance.',
 'genres': ['Drama'],
 'imdb': {'id': 12763, 'rating': 8.1, 'votes': 1455},
 'lastupdated': '2015-08-23 01:12:08.943000000',
 'num_mflix_comments': 0,
 'plot': 'When three thuggish men are responsible for the death of his father '
         'and the crippling of his brother, young David must choose between '
         'supporting his family or risking his life and exacting 

## **Homework**

In [12]:
#Exercise 1: Basic Searching and Filtering

#Write a query to find the first movie that has the genre "Action".
first_action_movie = collection.find_one({"genres": "Action"})
pprint.pprint(first_action_movie)

#Write a query to find all movies released after the year 2000 (Return the first 5 results).
movies_after_2000 = collection.find({"year": {"$gt": 2000}}).limit(5)
for movie in movies_after_2000:
    pprint.pprint(movie)

#Write a query to find all movies where the IMDb rating is greater than 8.5 (Return the
#first 5 results).
high_rated_movies = collection.find({"imdb.rating": {"$gt": 8.5}}).limit(5)
for movie in high_rated_movies:
    pprint.pprint(movie)

#Write a query to find all movies where the genre contains both "Action" and "Adventure".
action_adventure_movies = collection.find({ "genres": { "$all": ["Action", "Adventure"] } }).limit(5)
for movie in action_adventure_movies:
    pprint.pprint(movie)

{'_id': ObjectId('573a1394f29313caabcde741'),
 'awards': {'nominations': 1, 'text': '1 nomination.', 'wins': 0},
 'cast': ['Margaret Lockwood', 'Dane Clark', 'Marius Goring', 'Naunton Wayne'],
 'countries': ['UK'],
 'directors': ['Roy Ward Baker'],
 'fullplot': 'When British Intelligence discovers that an Iron Curtian country '
             'is developing insects as weapons they persuade eminent '
             'entomologist Frances Gray to get into the country to collect '
             'some specimens. On arrival her cover is almost immediately blown '
             'and her contact murdered. The future looks grim for her and also '
             'perhaps for the world.',
 'genres': ['Action', 'Thriller'],
 'imdb': {'id': 42553, 'rating': 5.8, 'votes': 321},
 'languages': ['English'],
 'lastupdated': '2015-09-17 04:50:07.010000000',
 'num_mflix_comments': 0,
 'plot': 'A British lady entomologist travels to a Balkan country to look into '
         'germ warfare trials using various bugs a

In [7]:
#Exercise 2: Sorting Results

#Write a query to find all movies where the genre is "Comedy" and sort them by IMDb
#rating in descending order (Return the first 5 results).
comedies_by_rating = collection.find({"genres": "Comedy"}).sort("imdb.rating", -1).limit(5)
for movie in comedies_by_rating:
    pprint.pprint(movie)

#Write a query to find all movies where the genre is "Drama" and sort them by release year
#in ascending order (Return the first 5 results).
drama_movies_by_year = collection.find({"genres": "Drama"}).sort("year", 1).limit(5)
for movie in drama_movies_by_year:
    pprint.pprint(movie)

{'_id': ObjectId('573a13dcf29313caabdb2dec'),
 'awards': {'nominations': 1, 'text': '1 nomination.', 'wins': 0},
 'cast': ['Jennifer Jason Leigh', 'David Thewlis', 'Tom Noonan'],
 'countries': ['USA'],
 'directors': ['Duke Johnson', 'Charlie Kaufman'],
 'fullplot': "Charlie Kaufman's first stop-motion film about a man crippled by "
             'the mundanity of his life.',
 'genres': ['Animation', 'Comedy', 'Fantasy'],
 'imdb': {'id': 2401878, 'rating': '', 'votes': ''},
 'languages': ['English'],
 'lastupdated': '2015-08-31 00:00:23.967000000',
 'num_mflix_comments': 0,
 'plot': "Charlie Kaufman's first stop-motion film about a man crippled by the "
         'mundanity of his life.',
 'rated': 'R',
 'released': datetime.datetime(2015, 9, 8, 0, 0),
 'runtime': 90,
 'title': 'Anomalisa',
 'tomatoes': {'lastUpdated': datetime.datetime(2015, 7, 26, 18, 15, 38),
              'production': 'Walt Disney Pictures International',
              'viewer': {'numReviews': 12, 'rating': 2.5}},
 '

In [8]:
#Exercise 3: Aggregation Pipeline

#Write an aggregation pipeline that calculates the average IMDb rating for each genre
#(Return the top 5 genres).
pipeline1 = [
    {"$unwind": "$genres"},
    {"$group": {"_id": "$genres", "avg_rating": {"$avg": "$imdb.rating"}}},
    {"$sort": {"avg_rating": -1}},
    {"$limit": 5}
]

avg_rating_by_genre = collection.aggregate(pipeline1)
for genre in avg_rating_by_genre:
    pprint.pprint(genre)

#Write an aggregation pipeline to find the top 5 directors by the average IMDb rating of
#their movies.
pipeline2 = [
    {"$unwind": "$directors"},
    {"$group": {"_id": "$directors", "avg_rating": {"$avg": "$imdb.rating"}}},
    {"$sort": {"avg_rating": -1}},
    {"$limit": 5}
]

avg_rating_by_director = collection.aggregate(pipeline2)
for data in avg_rating_by_director:
    pprint.pprint(data)

#Write an aggregation pipeline to calculate the total number of movies released in each
#year (Sort the results by the year).
pipeline3 = [
    {"$group": {"_id": "$year", "total_movies": {"$sum": 1}}},
    {"$sort": {"_id": 1}}
]

total_movies_per_year = collection.aggregate(pipeline3)
for data in total_movies_per_year:
    pprint.pprint(data)

{'_id': 'Film-Noir', 'avg_rating': 7.396774193548388}
{'_id': 'Short', 'avg_rating': 7.390625}
{'_id': 'Documentary', 'avg_rating': 7.365130483064964}
{'_id': 'News', 'avg_rating': 7.252272727272728}
{'_id': 'History', 'avg_rating': 7.171942446043165}
{'_id': 'Sara Hirsh Bordo', 'avg_rating': 9.4}
{'_id': 'Kevin Derek', 'avg_rating': 9.3}
{'_id': 'Michael Benson', 'avg_rating': 9.0}
{'_id': 'Slobodan Sijan', 'avg_rating': 8.95}
{'_id': 'Sundar C.', 'avg_rating': 8.9}
{'_id': 1950, 'total_movies': 55}
{'_id': 1951, 'total_movies': 54}
{'_id': 1952, 'total_movies': 45}
{'_id': 1953, 'total_movies': 65}
{'_id': 1954, 'total_movies': 47}
{'_id': 1955, 'total_movies': 67}
{'_id': 1956, 'total_movies': 67}
{'_id': 1957, 'total_movies': 71}
{'_id': 1958, 'total_movies': 75}
{'_id': 1959, 'total_movies': 71}
{'_id': 1960, 'total_movies': 73}
{'_id': 1961, 'total_movies': 68}
{'_id': 1962, 'total_movies': 70}
{'_id': 1963, 'total_movies': 69}
{'_id': 1964, 'total_movies': 86}
{'_id': 1965, 'tot

In [15]:
#Exercise 4: Updating and Deleting Documents

#Write a query to update the IMDb rating of a movie with the title "The Godfather" to 9.5.
collection.update_one({"title": "The Godfather"}, {"$set": {"imdb.rating": 9.5}})

#Write a query to update all movies where the genre is "Horror" and set their IMDb rating
#to 6.0 if it is currently null.
collection.update_many(
    {"genres": "Horror", "imdb.rating": None},
    {"$set": {"imdb.rating": 6.0}}
)

#Write a query to delete all movies that were released before the year 1950.
collection.delete_many({"year": {"$lt": 1950}})

DeleteResult({'n': 0, 'electionId': ObjectId('7fffffff000000000000011e'), 'opTime': {'ts': Timestamp(1727989907, 7457), 't': 286}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1727989908, 3768), 'signature': {'hash': b'\x8c[<\xb1\x9a6yK\x16H\x8c\xe6FX\x1a\xb6\x10\xf3\xbf`', 'keyId': 7363326094432272385}}, 'operationTime': Timestamp(1727989907, 7457)}, acknowledged=True)

In [11]:
#Exercise 5: Text Search
#Ensure the title field is indexed for text search in MongoDB and write a query to search
#for movies that contain the word "love" in their title.
collection.drop_indexes()
collection.create_index([("title", "text")])
search_results = collection.find({"$text": {"$search": "love"}})
for result in search_results:
    pprint.pprint(result)

#Write a text search query to find movies where the word "war" appears in the title or plot,
#sorted by IMDb rating (Return only the top 5 results).
collection.drop_indexes()
collection.create_index([("title", "text"), ("plot", "text")])
search_results = collection.find({"$text": {"$search": "war"}}).sort("imdb.rating", -1).limit(5)
for result in search_results:
    pprint.pprint(result)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
             'accurate, he tells Claude of his plan to kill Flannagan. '
             "Claude's daughter Ariane overhears the threat and warns Frank of "
             'the coming trouble. She then plays the part of a worldly '
             "socialite with a list of conquests as long as Flannagan's. The "
             "bemused ladies' man returns to America the next day and Ariane, "
             'completely in love, follows his romantic escapades in the news. '
             'She sees him again in Paris the following year, and resumes her '
             'worldly guise, telling tales of former lovers when they meet at '
             'his hotel in the afternoon. Frank, amazed by the mystery girl '
             'and surprised to find himself jealous of her past, hires Claude '
             'to uncover more information about her. When the detective '
             'realizes what has happened, he asks Frank not to break his '
  