Aggregation framework

```json
{
     "_id" : ObjectId("5b107bec1d2952d0da904dd7"),
     "title" : "Titan A.E.",
     "year" : 2000,
     "rated" : "PG",
     "runtime" : 94,
     "countries" : [
             "USA"
     ],
     "genres" : [
             "Animation",
             "Action",
             "Adventure"
     ],
     "director" : "Don Bluth, Gary Goldman, Art Vitello",
     "writers" : [
             "Hans Bauer",
             "Randall McCormick",
             "Ben Edlund",
             "John August",
             "Joss Whedon"
     ],
     "actors" : [
             "Matt Damon",
             "Bill Pullman",
             "John Leguizamo",
             "Nathan Lane"
     ],
     "plot" : "A young man learns that he has to find a hidden Earth ship before an enemy alien species does in order to secure the survival of humanity.",
     "poster" : "http://ia.media-imdb.com/images/M/MV5BMjE0NTU0ODg4NV5BMl5BanBnXkFtZTcwNzY3MTQyMQ@@._V1_SX300.jpg",
     "imdb" : {
             "id" : "tt0120913",
             "rating" : 6.6,
             "votes" : 50875
     },
     "tomato" : {
             "meter" : 52,
             "image" : "rotten",
             "rating" : 5.7,
             "reviews" : 99,
             "fresh" : 51,
             "consensus" : "Great visuals, but the story feels like a cut-and-paste job of other sci-fi movies.",
             "userMeter" : 60,
             "userRating" : 3.2,
             "userReviews" : 69055
     },
     "metacritic" : 48,
     "awards" : {
             "wins" : 1,
             "nominations" : 7,
             "text" : "1 win & 7 nominations."
     },
     "type" : "movie"
}
```

In [1]:
#Connect to database
!pip install pymongo

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
from pymongo import MongoClient
from pprint import pprint as pp
client = MongoClient('mongodb://localhost:27017')

In [3]:
db = client.datascience

agregation operators
----------------------

* \$project -> schape


```json
{
 "title" : "Titan A.E.",
 "year" : 2000,
 "imdb" : {
             "id" : "tt0120913",
             "rating" : 6.6,
             "votes" : 50875
     }
}
```

```json
{
 "title" : "Titan A.E.",
 "year" : 2000,
 "rating" : 6.6
}
```

* \$match -> filtering
* \$group
* \$sort
* \$skip
* \$limit
* \$unwind

```json
{
 "title" : "Titan A.E.",
 "year" : 2000,
 "actors" : [
     "Matt Damon",
     "Bill Pullman",
     "John Leguizamo"
 ],
}
```
unwind result in
```json
{
 "title" : "Titan A.E.",
 "year" : 2000,
 "actors" : "Matt Damon"
},
{
 "title" : "Titan A.E.",
 "year" : 2000,
 "actors" : "Bill Pullman"
},
{
 "title" : "Titan A.E.",
 "year" : 2000,
 "actors" : "John Leguizamo"
}
```


Which movie has the highest nomination to winning ratio?
----------------------------------------------------

In [5]:
pp(db.movies.find().next())

{'_id': ObjectId('5692a13b24de1e0ce2dfcea8'),
 'actors': ['Rose McGowan',
            'Freddy Rodríguez',
            'Josh Brolin',
            'Marley Shelton'],
 'awards': {'nominations': 2, 'text': '2 nominations.', 'wins': 0},
 'countries': ['USA'],
 'director': 'Robert Rodriguez',
 'genres': ['Action', 'Comedy', 'Horror'],
 'imdb': {'id': 'tt1077258', 'rating': 7.2, 'votes': 160542},
 'plot': 'After an experimental bio-weapon is released, turning thousands into '
         "zombie-like creatures, it's up to a rag-tag group of survivors to "
         'stop the infected and those behind its release.',
 'poster': 'http://ia.media-imdb.com/images/M/MV5BMTI0NDQ5MTM2MV5BMl5BanBnXkFtZTcwOTIwMjk2MQ@@._V1_SX300.jpg',
 'rated': 'NOT RATED',
 'released': datetime.datetime(2007, 6, 21, 4, 0),
 'runtime': 105,
 'title': 'Planet Terror',
 'type': 'movie',
 'writers': ['Robert Rodriguez'],
 'year': 2007}


In [4]:
pp(list(db.movies.aggregate([
    {"$match": {"awards": {"$exists": "1"}}},
    {"$match": {"awards.wins": {"$gt": 0}, "awards.nominations": {"$gt": 0}}},
    {
        "$project": {
            "title": 1,
            "awards": 1,
            'rating': '$imdb.rating',
            "ratio": {
                "$divide": ['$awards.wins', '$awards.nominations']
            }
        }
    },
    {"$sort": {"ratio": -1}},
    {"$limit": 2}
])))

[{'_id': ObjectId('5692a53024de1e0ce2dfdca5'),
  'awards': {'nominations': 1, 'text': '22 wins & 1 nomination.', 'wins': 22},
  'rating': 7.6,
  'ratio': 22.0,
  'title': 'Au bout du monde'},
 {'_id': ObjectId('5692a47c24de1e0ce2dfdb63'),
  'awards': {'nominations': 1, 'text': '22 wins & 1 nomination.', 'wins': 22},
  'rating': 8.1,
  'ratio': 22.0,
  'title': 'Everything Will Be Ok'}]


Which actor has participated in the highest amount of movies?
-------------------------------------------------------------


In [7]:
#single movie
pp(list(db.movies.aggregate([
    {"$match": {"actors": {"$exists": "1"}}},
    {"$unwind": "$actors"},
    {"$project": {"actor": "$actors"}},
    {"$sort": {"rating": -1}},
    {"$limit": 5}
])))

[{'_id': ObjectId('5692a13b24de1e0ce2dfcea8'), 'actor': 'Marley Shelton'},
 {'_id': ObjectId('5692a13b24de1e0ce2dfcea9'), 'actor': 'Charlton Heston'},
 {'_id': ObjectId('5692a13b24de1e0ce2dfcea8'), 'actor': 'Freddy Rodríguez'},
 {'_id': ObjectId('5692a13b24de1e0ce2dfcea8'), 'actor': 'Rose McGowan'},
 {'_id': ObjectId('5692a13b24de1e0ce2dfcea8'), 'actor': 'Josh Brolin'}]


In [9]:
pp(list(db.movies.aggregate([
    {"$match": {"actors": {"$exists": "1"}}},
    {"$unwind": "$actors"},
    {"$project": {"actor": "$actors"}},
    {"$group": {"_id": "$actor", "count": {"$sum": 1}}},
    {"$sort": {"count": -1}},
    {"$limit": 5}
])))

[{'_id': 'Tom Hanks', 'count': 8},
 {'_id': 'Natalie Portman', 'count': 8},
 {'_id': 'Louis C.K.', 'count': 8},
 {'_id': 'Scarlett Johansson', 'count': 7},
 {'_id': 'B.B. King', 'count': 7}]


Whoo hoo Tom hanks and Natalie Portman taking the lead

Which actor had highest average movie rating value?
----------------------------------------------------

In [10]:
pp(list(db.movies.aggregate([
    {"$match": {"actors": {"$exists": "1"}}},
    {"$unwind": "$actors"},
    {"$project": {"actor": "$actors", "rating": "$imdb.rating"}},
    {"$group": {"_id": "$actor", "avg_rating": {"$avg": "$rating"}}},
    {"$sort": {"avg_rating": -1}},
    {"$limit": 5}
])))

[{'_id': 'Nikita Devine', 'avg_rating': 9.6},
 {'_id': 'Tony DeSergio', 'avg_rating': 9.6},
 {'_id': 'Nichola Holt', 'avg_rating': 9.6},
 {'_id': 'Michelle Banks', 'avg_rating': 9.6},
 {'_id': 'Milan Baros', 'avg_rating': 9.5}]


hmm, are they one movie star?!

In [11]:
pp(list(db.movies.aggregate([
    {"$match": {"actors": {"$exists": "1"}}},
    {"$unwind": "$actors"},
    {"$project": {"actor": "$actors", "rating": "$imdb.rating"}},
    {"$group": {"_id": "$actor", "avg_rating": {"$avg": "$rating"}, "count": {"$sum": 1}}},
    {"$sort": {"avg_rating": -1}},
    {"$limit": 5}
])))

[{'_id': 'Nikita Devine', 'avg_rating': 9.6, 'count': 1},
 {'_id': 'Tony DeSergio', 'avg_rating': 9.6, 'count': 1},
 {'_id': 'Nichola Holt', 'avg_rating': 9.6, 'count': 1},
 {'_id': 'Michelle Banks', 'avg_rating': 9.6, 'count': 1},
 {'_id': 'Milan Baros', 'avg_rating': 9.5, 'count': 1}]


indeed, avg trap, lets check with at least 2 movies

In [18]:
pp(list(db.movies.aggregate([
    {"$match": {"actors": {"$exists": "1"}}},
    {"$unwind": "$actors"},
    {"$project": {"actor": "$actors", "rating": "$imdb.rating"}},
    {"$group": {
        "_id": "$actor",
        "avg_rating": {
            "$avg": "$rating"
        },
        "count": {
            "$sum": 1
        },
        "max_score": {
            "$max": "$rating"
        }
    }},
    {"$match": {"count": {"$gt": 2}}},
    {"$sort": {"avg_rating": -1}},
    {"$limit": 5}
])))

[{'_id': 'Mark Hamill',
  'avg_rating': 8.633333333333333,
  'count': 3,
  'max_score': 8.8},
 {'_id': 'Carrie Fisher',
  'avg_rating': 8.633333333333333,
  'count': 3,
  'max_score': 8.8},
 {'_id': 'Harrison Ford', 'avg_rating': 8.6, 'count': 4, 'max_score': 8.8},
 {'_id': 'Brian Johnson', 'avg_rating': 8.5, 'count': 6, 'max_score': 8.6},
 {'_id': 'AC/DC', 'avg_rating': 8.5, 'count': 3, 'max_score': 8.6}]


Hmm Harrison Ford !!! 

![](../images/ford.jpg)

Grouping operators
------------------
* \$first
* \$last
* \$max
* \$min
* \$avg