## Import Data 

Download the following json files 
- https://github.com/neelabalan/mongodb-sample-dataset/raw/main/sample_mflix/movies.json
- https://github.com/neelabalan/mongodb-sample-dataset/raw/main/sample_mflix/comments.json

Import data from movies.json file into MongoDB. It contains movie information, including release year, director, and reviews.

**mongoimport --db horizondb2 --collection movies --file movies.json**

Import data from comments.json file into MongoDB. It contains comments associated with specific movies.

**mongoimport --db horizondb2 --collection comments --file comments.json**

In [None]:
import pymongo
from pprint import pprint
from pymongo import MongoClient 
myclient = MongoClient('localhost', 27017)
horizondb = myclient['horizondb2']
movie_collection = horizondb["movies"]
comment_collection = horizondb["comments"]

##  Look at the data

In [None]:
# find one movie  
movie = movie_collection.find_one()
pprint(movie)

## Match and Sort

In [None]:
# Match title = "A Star Is Born":
stage_match_title = {
   "$match": {
         "title": "A Star Is Born"
   }
}

# Sort by year, ascending:
stage_sort_year_ascending = {
   "$sort": { "year": pymongo.ASCENDING }
}

# Now the pipeline is easier to read:
pipeline = [
   stage_match_title,
   stage_sort_year_ascending,
]

## Execute the pipeline

In [18]:
results = movie_collection.aggregate(pipeline)
for movie in results:
    print(" * {title}, {first_castmember}, {year}".format(
         title=movie["title"],
         first_castmember=movie["cast"][0],
         year=movie["year"],
   ))

 * A Star Is Born, Janet Gaynor, 1937
 * A Star Is Born, Judy Garland, 1954
 * A Star Is Born, Barbra Streisand, 1976


## Limit the Number of Results

In [19]:
# Sort by year, descending:
stage_sort_year_descending = {
   "$sort": { "year": pymongo.DESCENDING }
}

# Limit to 1 document:
stage_limit_1 = { "$limit": 1 }

pipeline = [
   stage_match_title,
   stage_sort_year_descending,
   stage_limit_1,
]
results = movie_collection.aggregate(pipeline)
for movie in results:
    print(" * {title}, {first_castmember}, {year}".format(
         title=movie["title"],
         first_castmember=movie["cast"][0],
         year=movie["year"],
   ))

 * A Star Is Born, Barbra Streisand, 1976


## Look Up Related Data in Other Collections

In [20]:
# find one movie  
comment = comment_collection.find_one()
pprint(comment)

{'_id': ObjectId('5a9427648b0beebeb69579cc'),
 'date': datetime.datetime(2012, 3, 26, 23, 20, 16),
 'email': 'andrea_le@fakegmail.com',
 'movie_id': ObjectId('573a1390f29313caabcd418c'),
 'name': 'Andrea Le',
 'text': 'Rem officiis eaque repellendus amet eos doloribus. Porro dolor '
         'voluptatum voluptates neque culpa molestias. Voluptate unde nulla '
         'temporibus ullam.'}


The stage called **stage_lookup_comments** is a **\$lookup** stage. This **$lookup** stage will look up documents from the comments collection that have the same movie id. The matching comments will be listed as an array in a field named **related_comments**, with an array value containing all of the comments that have this movie's **_id** value as **movie_id**.

In [21]:
# Look up related documents in the 'comments' collection:
stage_lookup_comments = {
   "$lookup": {
         "from": "comments",
         "localField": "_id",
         "foreignField": "movie_id",
         "as": "related_comments",
   }
}

# Limit to the first 5 documents:
stage_limit_5 = { "$limit": 5 }

pipeline = [
   stage_lookup_comments,
   stage_limit_5,
]

results = movie_collection.aggregate(pipeline)
for movie in results:
    pprint(movie)

{'_id': ObjectId('573a1390f29313caabcd4135'),
 'awards': {'nominations': 0, 'text': '1 win.', 'wins': 1},
 'cast': ['Charles Kayser', 'John Ott'],
 'countries': ['USA'],
 'directors': ['William K.L. Dickson'],
 'fullplot': 'A stationary camera looks at a large anvil with a blacksmith '
             'behind it and one on either side. The smith in the middle draws '
             'a heated metal rod from the fire, places it on the anvil, and '
             'all three begin a rhythmic hammering. After several blows, the '
             'metal goes back in the fire. One smith pulls out a bottle of '
             'beer, and they each take a swig. Then, out comes the glowing '
             'metal and the hammering resumes.',
 'genres': ['Short'],
 'imdb': {'id': 5, 'rating': 6.2, 'votes': 1189},
 'lastupdated': '2015-08-26 00:03:50.133000000',
 'num_mflix_comments': 1,
 'plot': 'Three men hammer on an anvil and pass a bottle of beer around.',
 'rated': 'UNRATED',
 'related_comments': [],
 'rel

## AddFields

In [22]:
# Calculate the number of comments for each movie:
stage_add_comment_count = {
   "$addFields": {
         "comment_count": {
            "$size": "$related_comments"
         }
   }
}

# Match movie documents with more than 2 comments:
stage_match_with_comments = {
   "$match": {
         "comment_count": {
            "$gt": 2
         }
   }
}

In [24]:
pipeline = [
   stage_lookup_comments,
   stage_add_comment_count,
   stage_match_with_comments,
   stage_limit_5,
]

In [None]:
results = movie_collection.aggregate(pipeline)
for movie in results:
    print(movie["title"])
    print("Comment count:", movie["comment_count"])
    
    # Loop through the first 5 comments and print the name and text:
    for comment in movie["related_comments"][:5]:
        print(" * {name}: {text}".format(
            name=comment["name"],
            text=comment["text"]))

## Grouping Documents

In [13]:
stage_group_year = {
   "$group": {
         "_id": "$year",
         # Count the number of movies in the group:
         "movie_count": { "$sum": 1 },
       "movie_titles": { "$push": "$title" }
   }
}

pipeline = [
   stage_group_year,
]
results = movie_collection.aggregate(pipeline)

# Loop through the 'year-summary' documents:
for year_summary in results:
    pprint(year_summary)

In [14]:
stage_match_years = {
   "$match": {
         "year": {
            "$type": "number",
         }
   }
}

stage_sort_year_ascending = {
  "$sort": {"_id": pymongo.ASCENDING}
}

pipeline = [
   stage_match_years,         # Match numeric years
   stage_group_year,
   stage_sort_year_ascending, # Sort by year
]
results = movie_collection.aggregate(pipeline)

# Loop through the 'year-summary' documents:
for year_summary in results:
    pprint(year_summary)