# Quiz 1

In [5]:
"""
The tweets in our twitter collection have a field called "source". This field describes the application
that was used to create the tweet. Following the examples for using the $group operator, your task is 
to modify the 'make-pipeline' function to identify most used applications for creating tweets. 
As a check on your query, 'web' is listed as the most frequently used application.
'Ubertwitter' is the second most used. The number of counts should be stored in a field named 'count'
(see the assertion at the end of the script).

Please modify only the 'make_pipeline' function so that it creates and returns an aggregation pipeline
that can be passed to the MongoDB aggregate function. As in our examples in this lesson, the aggregation 
pipeline should be a list of one or more dictionary objects. 
Please review the lesson examples if you are unsure of the syntax.

Your code will be run against a MongoDB instance that we have provided. 
If you want to run this code locally on your machine, you have to install MongoDB, 
download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.

Please note that the dataset you are using here is a smaller version of the twitter dataset 
used in examples in this lesson. 
If you attempt some of the same queries that we looked at in the lesson examples,
your results will be different.
"""


def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

### Previous step: insert the data into the database

In [25]:
import json
import sys
from pymongo import MongoClient

In [36]:
with open('data/twitter.json', 'r') as f:
    num_lines = sum(1 for line in f)
print('Reading {} lines.'.format(num_lines))

with open('data/twitter.json', 'r') as f:
    data = list()
    line = f.readline()
    count = 0
    while line:
        data.append(json.loads(line))
        line = f.readline()
        count += 1
        if (count % 100 == 0) or count == num_lines:
            sys.stdout.write('Read line {} of {}\r'.format(count, num_lines))

client = MongoClient("mongodb://localhost:27017")
db = client.twitter
db.tweets.insert_many(data)

Reading 51428 lines.
Read line 51428 of 51428

<pymongo.results.InsertManyResult at 0x13a10e128>

In [37]:
next(db.tweets.find())

{u'_id': ObjectId('5ca75a57873d8102a8126532'),
 u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Thu Sep 02 18:11:23 +0000 2010',
 u'entities': {u'hashtags': [], u'urls': [], u'user_mentions': []},
 u'favorited': False,
 u'geo': None,
 u'id': 22819396900L,
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_user_id': None,
 u'place': None,
 u'retweet_count': None,
 u'retweeted': False,
 u'source': u'web',
 u'text': u'eu preciso de terminar de fazer a minha tabela, est\xe1 muito foda **',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
  u'created_at': u'Fri Jul 03 21:44:05 +0000 2009',
  u'description': u's\xf3 os loucos sabem (:',
  u'favourites_count': 1,
  u'follow_request_sent': None,
  u'followers_count': 102,
  u'following': None,
  u'friends_count': 73,
  u'geo_enabled': False,
  u'id': 53507833,
  u'lang': u'en',
  u'listed_count': 0,
  u'location': u'',
  u'name': u'Beatriz Helena Cunha',
  u'notifications': None,


In [38]:
client.close()

### Now the exercises

In [39]:
def make_pipeline():
    pipeline = [{"$group": {"_id": "$source", "count": {"$sum": 1}}},
                {"$sort": {"count": -1}}
               ]
    return pipeline

In [42]:
def tweet_sources(db, pipeline):
    return [doc for doc in db.tweets.aggregate(pipeline)]

db = get_db('twitter')
pipeline = make_pipeline()
result = tweet_sources(db, pipeline)
import pprint
pprint.pprint(result[0])
#assert result[0] == {u'count': 868, u'_id': u'web'}

{u'_id': u'web', u'count': 23136}


In [44]:
result[:10]

[{u'_id': u'web', u'count': 23136},
 {u'_id': u'<a href="http://www.ubertwitter.com/bb/download.php" rel="nofollow">\xdcberTwitter</a>',
  u'count': 3393},
 {u'_id': u'<a href="http://www.tweetdeck.com" rel="nofollow">TweetDeck</a>',
  u'count': 3370},
 {u'_id': u'<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry\xae</a>',
  u'count': 2249},
 {u'_id': u'<a href="http://twitter.com/" rel="nofollow">Twitter for iPhone</a>',
  u'count': 2009},
 {u'_id': u'<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>',
  u'count': 1774},
 {u'_id': u'<a href="http://twitterfeed.com" rel="nofollow">twitterfeed</a>',
  u'count': 1652},
 {u'_id': u'<a href="http://mobile.twitter.com" rel="nofollow">mobile web</a>',
  u'count': 1374},
 {u'_id': u'<a href="/devices" rel="nofollow">txt</a>', u'count': 1085},
 {u'_id': u'<a href="http://www.hootsuite.com" rel="nofollow">HootSuite</a>',
  u'count': 706}]

### Operators
[link to docs](https://docs.mongodb.com/manual/reference/operator/aggregation/)

 - $group
 
 - $sum
 
 - $project
 
 - $match
 
 - $sort
 
 - $skip
 
 - $limit
 
 - $unwind

$project can be used to:
 - Include fields from the original document
 - Insert computed fields
 - Rename fields
 - Create fields that hold subdocuments

# Quiz 2

In [88]:
"""
Write an aggregation query to answer this question:

Of the users in the "Brasilia" timezone who have tweeted 100 times or more,
who has the largest number of followers?

The following hints will help you solve this problem:
- Time zone is found in the "time_zone" field of the user object in each tweet.
- The number of tweets for each user is found in the "statuses_count" field.
  To access these fields you will need to use dot notation (from Lesson 4)
- Your aggregation query should return something like the following:
{u'ok': 1.0,
 u'result': [{u'_id': ObjectId('52fd2490bac3fa1975477702'),
                  u'followers': 2597,
                  u'screen_name': u'marbles',
                  u'tweets': 12334}]}
Note that you will need to create the fields 'followers', 'screen_name' and 'tweets'.

Please modify only the 'make_pipeline' function so that it creates and returns an aggregation 
pipeline that can be passed to the MongoDB aggregate function. As in our examples in this lesson,
the aggregation pipeline should be a list of one or more dictionary objects. 
Please review the lesson examples if you are unsure of the syntax.

Your code will be run against a MongoDB instance that we have provided. If you want to run this code
locally on your machine, you have to install MongoDB, download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.

Please note that the dataset you are using here is a smaller version of the twitter dataset used 
in examples in this lesson. If you attempt some of the same queries that we looked at in the lesson 
examples, your results will be different.
"""

def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

In [89]:
def make_pipeline():
    # complete the aggregation pipeline
    pipeline = [
        {"$match": {"user.time_zone": {"$eq": u"Brasilia"},
                    "user.statuses_count": {"$gte": 100}
                   }},
        {"$project": {"followers": "$user.followers_count", 
                      "screen_name": "$user.screen_name",
                      "tweets": "$user.statuses_count"
                     }},
        {"$sort": {"followers": -1}},
        {"$limit": 1}
    ]
    return pipeline

In [93]:
# Playing around cell 
import pprint
def aggregate(db, pipeline):
    return [doc for doc in db.tweets.aggregate(pipeline)]

    
db = get_db('twitter')
result = aggregate(db, [
    {"$match": {"user.time_zone": {"$eq": u"Brasilia"},
                "user.statuses_count": {"$gte": 100}
               }},
    {"$project": {"followers": "$user.followers_count", 
                  "screen_name": "$user.screen_name",
                  "tweets": "$user.statuses_count"
                 }},
    {"$sort": {"followers": -1}},
    {"$limit": 1}
])

pprint.pprint(result)

[{u'_id': ObjectId('5ca75a57873d8102a812926c'),
  u'followers': 259760,
  u'screen_name': u'otaviomesquita',
  u'tweets': 10997}]


In [92]:
def aggregate(db, pipeline):
    return [doc for doc in db.tweets.aggregate(pipeline)]

    
db = get_db('twitter')
pipeline = make_pipeline()
result = aggregate(db, pipeline)
import pprint
pprint.pprint(result)
assert len(result) == 1
# assert result[0]["followers"] == 17209   # The classroom dataset is different

[{u'_id': ObjectId('5ca75a57873d8102a812926c'),
  u'followers': 259760,
  u'screen_name': u'otaviomesquita',
  u'tweets': 10997}]


# Quiz 3

In [105]:
"""
For this exercise, let's return to our cities infobox dataset. The question we would like you to answer
is as follows:  Which region or district in India contains the most cities? (Make sure that the count of
cities is stored in a field named 'count'; see the assertions at the end of the script.)

As a starting point, use the solution for the example question we looked at -- "Who includes the most
user mentions in their tweets?"

One thing to note about the cities data is that the "isPartOf" field contains an array of regions or 
districts in which a given city is found. See the example document in Instructor Comments below.

Please modify only the 'make_pipeline' function so that it creates and returns an aggregation pipeline 
that can be passed to the MongoDB aggregate function. As in our examples in this lesson, the aggregation 
pipeline should be a list of one or more dictionary objects. Please review the lesson examples if you 
are unsure of the syntax.

Your code will be run against a MongoDB instance that we have provided. If you want to run this code 
locally on your machine, you have to install MongoDB, download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.

Please note that the dataset you are using here is a smaller version of the cities collection used in 
examples in this lesson. If you attempt some of the same queries that we looked at in the lesson 
examples, your results may be different.
"""

def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

### Fill the database with the cities data

In [149]:
import csv
db = get_db('examples')

with open('data/cities/cities.csv', 'r') as f:
    reader = csv.DictReader(f)
    for i in range(3):
        _ = next(reader)
    data = list()
    for doc in reader:
        if doc['isPartOf'] != "NULL":
            doc['isPartOf'] = map(str.strip, doc['isPartOf_label'].strip('{}').split("|"))
        else:
            doc['isPartOf'] = None
        doc['country'] = doc['country_label']
        data.append(doc)

db.cities.drop()
db.cities.insert_many(data)

<pymongo.results.InsertManyResult at 0x127858dd0>

In [150]:
[doc for doc in db.cities.find().limit(3)]

[{u'22-rdf-syntax-ns#type': u'{http://dbpedia.org/ontology/City|http://dbpedia.org/ontology/Place|http://dbpedia.org/ontology/PopulatedPlace|http://dbpedia.org/ontology/Settlement|http://schema.org/City|http://schema.org/Place|http://www.opengis.net/gml/_Feature|http://www.w3.org/2002/07/owl#Thing}',
  u'22-rdf-syntax-ns#type_label': u'{city|place|populated place|municipality|City|Place|_Feature|owl#Thing}',
  u'URI': u'http://dbpedia.org/resource/Kud',
  u'_id': ObjectId('5ca78b8c873d8102a8150875'),
  u'administrativeDistrict': u'NULL',
  u'administrativeDistrict_label': u'NULL',
  u'anthem': u'NULL',
  u'anthem_label': u'NULL',
  u'area': u'NULL',
  u'areaCode': u'NULL',
  u'areaLand': u'NULL',
  u'areaMetro': u'NULL',
  u'areaRural': u'NULL',
  u'areaTotal': u'NULL',
  u'areaUrban': u'NULL',
  u'areaWater': u'NULL',
  u'city': u'NULL',
  u'city_label': u'NULL',
  u'code': u'NULL',
  u'country': u'India',
  u'country_label': u'India',
  u'daylightSavingTimeZone': u'NULL',
  u'dayligh

In [156]:
def aggregate(db, pipeline):
    return [doc for doc in db.cities.aggregate(pipeline)]

db = get_db('examples')

aggregate(db, [
    {"$match": {"country": {"$eq": u"India"}}},
    {"$project": {"isPartOf": 1, "name": 1}},
    {"$unwind": "$isPartOf"},
    {"$group": {"_id": "$isPartOf", "count": {"$sum": 1}}},
    {"$sort": {"count": -1}},
    {"$limit": 10}
])

[{u'_id': u'Uttar Pradesh', u'count': 623},
 {u'_id': u'Tamil Nadu', u'count': 450},
 {u'_id': u'Madhya Pradesh', u'count': 359},
 {u'_id': u'Maharashtra', u'count': 337},
 {u'_id': u'Gujarat', u'count': 229},
 {u'_id': u'Rajasthan', u'count': 213},
 {u'_id': u'Karnataka', u'count': 169},
 {u'_id': u'Andhra Pradesh', u'count': 150},
 {u'_id': u'Punjab India', u'count': 141},
 {u'_id': u'Jharkhand', u'count': 129}]

In [161]:
def make_pipeline():
    pipeline = [
        {"$match": {"country": {"$eq": u"India"}}},
        {"$project": {"isPartOf": 1, "name": 1}},
        {"$unwind": "$isPartOf"},
        {"$group": {"_id": "$isPartOf", "count": {"$sum": 1}}},
        {"$sort": {"count": -1}},
        {"$limit": 10}
    ]
    return pipeline

In [162]:
def aggregate(db, pipeline):
    return [doc for doc in db.cities.aggregate(pipeline)]

if __name__ == '__main__':
    db = get_db('examples')
    pipeline = make_pipeline()
    result = aggregate(db, pipeline)
    print "Printing the first result:"
    import pprint
    pprint.pprint(result[0])
    assert result[0]["_id"] == "Uttar Pradesh"
    assert result[0]["count"] == 623

Printing the first result:
{u'_id': u'Uttar Pradesh', u'count': 623}


In [163]:
result[:5]

[{u'_id': u'Uttar Pradesh', u'count': 623},
 {u'_id': u'Tamil Nadu', u'count': 450},
 {u'_id': u'Madhya Pradesh', u'count': 359},
 {u'_id': u'Maharashtra', u'count': 337},
 {u'_id': u'Gujarat', u'count': 229}]

### \$group operators

 - \$sum
 - \$first
 - \$last
 - \$max
 - \$min
 - \$avg
 
 
 _Arrays_
 
 - \$push
 - \$addToSet  # Adds values to an array as if it was a set (uniquely).

# Quiz 4

In [167]:
"""
$push is similar to $addToSet. The difference is that rather than accumulating only unique values 
it aggregates all values into an array.

Using an aggregation query, count the number of tweets for each user. In the same $group stage, 
use $push to accumulate all the tweet texts for each user. Limit your output to the 5 users
with the most tweets. 
Your result documents should include only the fields:
"_id" (screen name of user), 
"count" (number of tweets found for the user),
"tweet_texts" (a list of the tweet texts found for the user).  

Please modify only the 'make_pipeline' function so that it creates and returns an aggregation 
pipeline that can be passed to the MongoDB aggregate function. As in our examples in this lesson, 
the aggregation pipeline should be a list of one or more dictionary objects. 
Please review the lesson examples if you are unsure of the syntax.

Your code will be run against a MongoDB instance that we have provided. If you want to run this code 
locally on your machine, you have to install MongoDB, download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.

Please note that the dataset you are using here is a smaller version of the twitter dataset used in 
examples in this lesson. If you attempt some of the same queries that we looked at in the lesson 
examples, your results will be different.
"""

def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

In [174]:
# Let's look at one record
db = get_db('twitter')
record = next(db.tweets.find().limit(1))
import pprint
pprint.pprint(record)
print('-'*100)
print('\n')
pprint.pprint(record['user'])

{u'_id': ObjectId('5ca75a57873d8102a8126532'),
 u'contributors': None,
 u'coordinates': None,
 u'created_at': u'Thu Sep 02 18:11:23 +0000 2010',
 u'entities': {u'hashtags': [], u'urls': [], u'user_mentions': []},
 u'favorited': False,
 u'geo': None,
 u'id': 22819396900L,
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_user_id': None,
 u'place': None,
 u'retweet_count': None,
 u'retweeted': False,
 u'source': u'web',
 u'text': u'eu preciso de terminar de fazer a minha tabela, est\xe1 muito foda **',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
           u'created_at': u'Fri Jul 03 21:44:05 +0000 2009',
           u'description': u's\xf3 os loucos sabem (:',
           u'favourites_count': 1,
           u'follow_request_sent': None,
           u'followers_count': 102,
           u'following': None,
           u'friends_count': 73,
           u'geo_enabled': False,
           u'id': 53507833,
           u'lang': u'en',
           u'l

In [184]:
db = get_db('twitter')

def aggregate(db, pipeline):
    return [doc for doc in db.tweets.aggregate(pipeline)]

aggregate(db, [ 
        {"$group": {"_id": "$user.screen_name",
                    "count": {"$sum": 1},
                    "tweet_texts": {"$push": "$text"}
                   }},
        {"$sort": {"count": -1}},
        {"$limit": 5}
    ])

[{u'_id': u'behcolin',
  u'count': 8,
  u'tweet_texts': [u'RT @VouConfessarQue: #VouConfessarQue j\xe1 aprendi uma mat\xe9ria inteira poucos minutos antes de uma prova.',
   u'RT @TweetGargalhada: Geisy Arruda nos TT? Achei que tinha finalmente sumido, mas pelo jeito a mar\xe9 trouxe ela de volta!',
   u'RT @TweetGargalhada: Geisy Arruda nos TT? Achei que tinha finalmente sumido, mas pelo jeito a mar\xe9 trouxe ela de volta!',
   u'RT @TweetGargalhada: Geisy Arruda nos TT? Achei que tinha finalmente sumido, mas pelo jeito a mar\xe9 trouxe ela de volta!',
   u'RT @TweetGargalhada: Geisy Arruda nos TT? Achei que tinha finalmente sumido, mas pelo jeito a mar\xe9 trouxe ela de volta!',
   u'RT @TweetGargalhada: Geisy Arruda nos TT? Achei que tinha finalmente sumido, mas pelo jeito a mar\xe9 trouxe ela de volta!',
   u'RT @TweetGargalhada: Geisy Arruda nos TT? Achei que tinha finalmente sumido, mas pelo jeito a mar\xe9 trouxe ela de volta!',
   u'volto jah']},
 {u'_id': u'mysterytrick',
  u

In [185]:
def make_pipeline():
    pipeline = [ 
        {"$group": {"_id": "$user.screen_name",
                    "count": {"$sum": 1},
                    "tweet_texts": {"$push": "$text"}
                   }},
        {"$sort": {"count": -1}},
        {"$limit": 5}
    ]
    return pipeline

In [191]:
def aggregate(db, pipeline):
    return [doc for doc in db.tweets.aggregate(pipeline)]

db = get_db('twitter')
pipeline = make_pipeline()
result = aggregate(db, pipeline)
import pprint
pprint.pprint(result)
assert len(result) == 5
assert result[0]["count"] > result[4]["count"]
sample_tweet_text = u'Take my money! #liesguystell http://movie.sras2.ayorganes.com'
# assert result[4]["tweet_texts"][0] == sample_tweet_text

[{u'_id': u'behcolin',
  u'count': 8,
  u'tweet_texts': [u'RT @VouConfessarQue: #VouConfessarQue j\xe1 aprendi uma mat\xe9ria inteira poucos minutos antes de uma prova.',
                   u'RT @TweetGargalhada: Geisy Arruda nos TT? Achei que tinha finalmente sumido, mas pelo jeito a mar\xe9 trouxe ela de volta!',
                   u'RT @TweetGargalhada: Geisy Arruda nos TT? Achei que tinha finalmente sumido, mas pelo jeito a mar\xe9 trouxe ela de volta!',
                   u'RT @TweetGargalhada: Geisy Arruda nos TT? Achei que tinha finalmente sumido, mas pelo jeito a mar\xe9 trouxe ela de volta!',
                   u'RT @TweetGargalhada: Geisy Arruda nos TT? Achei que tinha finalmente sumido, mas pelo jeito a mar\xe9 trouxe ela de volta!',
                   u'RT @TweetGargalhada: Geisy Arruda nos TT? Achei que tinha finalmente sumido, mas pelo jeito a mar\xe9 trouxe ela de volta!',
                   u'RT @TweetGargalhada: Geisy Arruda nos TT? Achei que tinha finalmente sumido, m

In [194]:
result[4]["tweet_texts"][0]

u'Photo: (via lovedbythesun) http://tumblr.com/xqrhfp90o'

In [193]:
sample_tweet_text

u'Take my money! #liesguystell http://movie.sras2.ayorganes.com'

# Quiz 5

In [202]:
"""
In an earlier exercise we looked at the cities dataset and asked which region in India contains 
the most cities. In this exercise, we'd like you to answer a related question regarding regions in 
India. What is the average city population for a region in India? Calculate your answer by first 
finding the average population of cities in each region and then by calculating the average of the 
regional averages.

Hint: If you want to accumulate using values from all input documents to a group stage, you may use 
a constant as the value of the "_id" field. For example, 
    { "$group" : {"_id" : "India Regional City Population Average",
      ... }

Please modify only the 'make_pipeline' function so that it creates and returns an aggregation 
pipeline that can be passed to the MongoDB aggregate function. As in our examples in this lesson, 
the aggregation pipeline should be a list of one or more dictionary objects. 
Please review the lesson examples if you are unsure of the syntax.

Your code will be run against a MongoDB instance that we have provided. If you want to run this code 
locally on your machine, you have to install MongoDB, download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.

Please note that the dataset you are using here is a smaller version of the twitter dataset used 
in examples in this lesson. If you attempt some of the same queries that we looked at in the lesson 
examples, your results will be different.
"""

def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

### Let's fill the db with all the fields

In [217]:
def most_significant(text, converter=float):
    options = text.strip("{}").split("|")
    num_significant = map(lambda x: len(
        x.replace('.', '').replace(',', '').rstrip('0')), options)
    return converter(options[num_significant.index(max(num_significant))])

In [220]:
most_significant("{123123.4523|14|123445}")

123123.4523

In [232]:
import csv
FIELDS = ["elevation", "name", "country", "lon", "lat", "isPartOf", 
          "timeZone", "population"]

db = get_db('examples')

with open('data/cities/cities.csv', 'r') as f:
    reader = csv.DictReader(f)
    for i in range(3):
        _ = next(reader)
    data = list()
    for doc in reader:
        if doc['isPartOf'] != "NULL":
            doc['isPartOf'] = map(str.strip, doc['isPartOf_label'].strip('{}').split("|"))
        else:
            doc['isPartOf'] = None
        doc['country'] = doc['country_label']
        if doc['point'] != "NULL":
            point = doc['point'].strip('{}').split('|')[0].split()
            doc['lon'] = float(point[0])
            doc['lat'] = float(point[1])
        else:
            doc['lon'] = None
            doc['lat'] = None
        if doc['populationTotal'] != "NULL":
            doc['population'] = int(doc['populationTotal'].strip('{}').split('|')[0])
        else:
            doc['population'] = None
        doc['timeZone'] = doc['timeZone_label']
        for f in FIELDS:
            if doc[f] == 'NULL':
                doc[f] = None
        if doc['elevation'] is not None:
            doc['elevation'] = most_significant(doc['elevation'])
        data.append({f: doc[f] for f in FIELDS})

db.cities.drop()
db.cities.insert_many(data)

<pymongo.results.InsertManyResult at 0x1233fc290>

In [234]:
# Let's look at one record
db = get_db('examples')
next(db.cities.find().limit(1))

{u'_id': ObjectId('5ca79c02873d8102a816e2ca'),
 u'country': u'India',
 u'elevation': 1855.0,
 u'isPartOf': [u'Jammu and Kashmir', u'Udhampur district'],
 u'lat': 75.28,
 u'lon': 33.08,
 u'name': u'Kud',
 u'population': 1140,
 u'timeZone': u'Indian Standard Time'}

### Now, let's solve it

In [248]:
aggregate(db, [
    {"$match": {"country": "India"}},
    {"$group": {"_id": {"$arrayElemAt": ["$isPartOf", 0]}, 
                "avg_pop": {"$avg": "$population"}
               }},
    {"$limit": 10}
])

[{u'_id': u'Etaeh_district', u'avg_pop': 35662.0},
 {u'_id': u'Haveri', u'avg_pop': 15874.0},
 {u'_id': u'Pratapgarh district Uttar Pradesh', u'avg_pop': 24608.6},
 {u'_id': u'Dindigul district', u'avg_pop': 28940.391304347828},
 {u'_id': u'Are-Malenadu', u'avg_pop': 18517.0},
 {u'_id': u'Bhopal district', u'avg_pop': 24289.0},
 {u'_id': u'Balod_district', u'avg_pop': 35829.5},
 {u'_id': u'Hapur district', u'avg_pop': 262801.0},
 {u'_id': u'Panchsheel_nagar_India', u'avg_pop': 5938.0},
 {u'_id': u'Debagarh district', u'avg_pop': 20085.0}]

In [251]:
aggregate(db, [
    {"$match": {"country": "India"}},
    {"$group": {"_id": {"$arrayElemAt": ["$isPartOf", 0]}, 
                "avg_pop": {"$avg": "$population"}
               }},
    {"$group": {"_id": "India Regional City Population Average", 
                "avg_pop_tot": {"$avg": "$avg_pop"}}},
    {"$limit": 10}
])

[{u'_id': u'India Regional City Population Average',
  u'avg_pop_tot': 127518.0195506342}]

In [253]:
aggregate(db, [
    {"$match": {"country": "India"}},
    {"$unwind": "$isPartOf"},
    {"$group": {"_id": "$isPartOf", 
                "avg_pop": {"$avg": "$population"}
               }},
    {"$group": {"_id": "India Regional City Population Average", 
                "avg_pop_tot": {"$avg": "$avg_pop"}}},
    {"$limit": 10}
])

[{u'_id': u'India Regional City Population Average',
  u'avg_pop_tot': 201128.02415469187}]

In [256]:
def make_pipeline():
    # complete the aggregation pipeline
    pipeline = [
    {"$match": {"country": "India"}},
    {"$unwind": "$isPartOf"},
    {"$group": {"_id": "$isPartOf", 
                "avg_pop": {"$avg": "$population"}
               }},
    {"$group": {"_id": "India Regional City Population Average", 
                "avg": {"$avg": "$avg_pop"}}},
        {"$limit": 10}
    ]
    return pipeline

In [257]:
def aggregate(db, pipeline):
    return [doc for doc in db.cities.aggregate(pipeline)]


db = get_db('examples')
pipeline = make_pipeline()
result = aggregate(db, pipeline)
assert len(result) == 1
# Your result should be close to the value after the minus sign.
assert abs(result[0]["avg"] - 201128.0241546919) < 10 ** -8
import pprint
pprint.pprint(result)

[{u'_id': u'India Regional City Population Average',
  u'avg': 201128.02415469187}]


# Indexes

For example: (hashtag, date, username)

```
db.nodes.ensure_index({"tg": 1})
```

### Geospatial indexes

```
db.nodes.ensure_index({"loc": pymongo.GEO2D})
db.nodes.find({"loc": {"\$near": [41.94, -87.65]}})
```