In [1]:
# Impor libraries
from pymongo import MongoClient
from pprint import pprint
from operator import itemgetter # Used in a sort options
from collections import Counter

In [2]:
# Client connects to "localhost" by default
client = MongoClient()

# Connect to "nobel" database on the fly
db = client["nobel"]

# 03. Get Only What You Need, and Fast

You can now query collections with ease and collect documents to examine and analyze with Python. But this process is sometimes slow and onerous for large collections and documents. This chapter is about various ways to speed up and simplify that process.

## 03.01 Projection

1. Projection: Getting only what you need
>In this chapter, we're going to learn about getting only what you need, and fast. We'll start with what MongoDB calls "projection".

2. What is "projection"?
>The term "projection" is about reducing multidimensional data. In cartography, it's about getting what you need to make a reasonable image from a 3D earth. You can also think about asking a certain part of your data to project its voice, to "speak up"! With a table of data, it's about selecting columns. With MongoDB, it's about selecting substructure.

3. Projection in MongoDB
>In MongoDB, we fetch projections by specifying what document fields interest us. We can do this by passing a dictionary as a second argument to the "find" method of a collection. For each field that we want to include in the projection, we give a value of 1. Fields that we don't include in the dictionary are not included in the projection. The exception is a document's "_id" field. The "_id" field is always included in a projection by default. We must assign it the value 0 in the projection dictionary to leave it out. Here I try to collect the prize affiliation data for all laureates - my filter document is empty. What I get back is not the data itself, but a so-called cursor, an iterable that I can fetch documents from, one at a time.

4. Projection in MongoDB
>In Python, we can collect from an iterable into a list. We do this by passing it to the "list" function. I don't want to print out hundreds of laureate documents. Thus, I also use slicing syntax to get only the first three elements of the resulting list. We can see that our projections contain only document data about prize affiliations. We also retained the structure of that data. Remember how to project fields this way: it’s going to be very useful in the rest of the course.

5. Missing fields
>What happens when you try to project out fields that are not present in some documents? Rather than raise an error, MongoDB returns the documents without those fields. This expression projects the bornCountry field. This field isn't present for organization laureates, though. Only the firstName and id fields get returned. Notice that I formatted a projection as a list of fields. When a projection doesn't involve excluding fields, the pymongo driver accepts this format.

6. Missing fields
>Also, it's okay if a projected field isn't in any of a collection's documents. Here, because there is no favoriteIceCreamFlavor field, the projection returns only object IDs.

7. Simple aggregation
>We're going to learn about MongoDB's aggregation framework in the next chapter. But already we have a new tool to fetch less data, only what we need. For example, let's count the total number of prize medals awarded. That is, the total number of elements in prizes arrays across all laureates. We can iterate over a cursor of all laureates with only the prizes field projected out. In this way, we avoid having to download the other data in each laureate document. This can definitely affect performance for very large collections. We can even, in this case, use a comprehension to reduce memory overhead in Python. We can leverage Python's built-in tools for iterables and dictionaries. And we can use projection to slim down these dictionaries to contain only what we need for our analysis.

8. Let's project!
>Let's practice some projection.

In [3]:
# Projection in MongoDB
# Include only prizes.affiliations, exclude _id
docs = db.laureates.find(filter={}, 
                         projection={"prizes.affiliations": 1, "_id": 0})
print(type(docs))
docs = list(docs)

# size of docs
print('Size:', len(docs))

# convert to list and slice
pprint(docs[:3])

<class 'pymongo.cursor.Cursor'>
Size: 934
[{'prizes': [{'affiliations': [{'city': 'Leiden',
                                'country': 'the Netherlands',
                                'name': 'Leiden University'}]}]},
 {'prizes': [{'affiliations': [{'city': 'Providence, RI',
                                'country': 'USA',
                                'name': 'Brown University'}]}]},
 {'prizes': [{'affiliations': [{'city': 'Philadelphia, PA',
                                'country': 'USA',
                                'name': 'University of Pennsylvania'}]}]}]


In [4]:
# Missing fields
# use "gender":"org" to select organizations organizations have no bornCountry
docs = db.laureates.find(filter={"gender": "org"},
                         projection=["bornCountry", "firstname"])
docs = list(docs)
print(len(docs))
pprint(docs[:2])

# only projected fields that exist are returned
docs = db.laureates.find({}, 
                         ["favoriteIceCreamFlavor"]) 
docs = list(docs)
print(len(docs))
pprint(docs[:2])

24
[{'_id': ObjectId('6035cd48354dd8e35462339c'),
  'firstname': 'Comité international de la Croix Rouge (International '
               'Committee of the Red Cross)'},
 {'_id': ObjectId('6035cd48354dd8e3546233f1'),
  'firstname': 'Friends Service Council (The Quakers)'}]
934
[{'_id': ObjectId('6035cd48354dd8e354623266')},
 {'_id': ObjectId('6035cd48354dd8e354623267')}]


In [5]:
# Simple aggregation
docs = list(db.laureates.find({}, ["prizes"]))
n_prizes = 0
for doc in docs:
    # count the number of pizes in each doc
    n_prizes += len(doc["prizes"])
print(n_prizes)

# using comprehension
print(sum([len(doc["prizes"]) for doc in docs]))

941
941


## 03.02 Shares of the 1903 Prize in Physics

You want to examine the laureates of the 1903 prize in physics and how they split the prize. Here is a query without projection:

<code>db.laureates.find_one({"prizes": {"$elemMatch": {"category": "physics", "year": "1903"}}})</code>

**Instructions**

Which projection(s) will fetch ONLY the laureates' full names and prize share info? I encourage you to experiment with the console and re-familiarize yourself with the structure of laureate collection documents.

**Possible Answers**
1. ["firstname", "surname", "prizes"]
2. ["firstname", "surname", "prizes.share"]
3- {"firstname": 1, "surname": 1, "prizes.share": 1, "_id": 0}
4. All of the above

**Results**

<font color=darkgreen>This represents the minimal projection to get the info we need. Great!</font>

In [6]:
criteria = {"prizes": {"$elemMatch": {"category": "physics", "year": "1903"}}}
fields_to_retrieve = {"firstname": 1, "surname": 1, "prizes.share": 1, "_id": 0}

pprint(db.laureates.find_one(criteria))
docs = list(db.laureates.find(filter = criteria, projection = fields_to_retrieve))

print(len(docs))
pprint(docs)

{'_id': ObjectId('6035cd48354dd8e3546232a8'),
 'born': '1852-12-15',
 'bornCity': 'Paris',
 'bornCountry': 'France',
 'bornCountryCode': 'FR',
 'died': '1908-08-25',
 'diedCountry': 'France',
 'diedCountryCode': 'FR',
 'firstname': 'Antoine Henri',
 'gender': 'male',
 'id': '4',
 'prizes': [{'affiliations': [{'city': 'Paris',
                               'country': 'France',
                               'name': 'École Polytechnique'}],
             'category': 'physics',
             'motivation': '"in recognition of the extraordinary services he '
                           'has rendered by his discovery of spontaneous '
                           'radioactivity"',
             'share': '2',
             'year': '1903'}],
 'surname': 'Becquerel'}
3
[{'firstname': 'Antoine Henri',
  'prizes': [{'share': '2'}],
  'surname': 'Becquerel'},
 {'firstname': 'Pierre', 'prizes': [{'share': '4'}], 'surname': 'Curie'},
 {'firstname': 'Marie',
  'prizes': [{'share': '4'}, {'share': '1'}],
  '

## 03.03 Rounding up the G.S. crew

In chapter 2, you used a regular expression object __Regex__ to find values that follow a pattern. We can also use the regular expression operator __$regex__ for the same purpose. For example, the following query:

<code>{ "name": {$regex: "^Py"}    }</code>

will fetch documents where the field __'name'__ starts with __"Py"__. Here the caret symbol __^__ means "starts with".

In this exercise, you will use regular expressions, projection, and list comprehension to collect the full names of laureates whose initials are "G.S.".

**Instructions**

1. First, use regular expressions to fetch the documents for the laureates whose "firstname" starts with "G" and whose "surname" starts with "S".

In the previous step, we fetched all the data for all the laureates with initials G.S. This is unnecessary if we only want their full names!

2. Use projection and adjust the query to select only the "firstname" and "surname" fields.

Now the documents you fetched contain only the relevant information!

3. Iterate over the documents, and for each document, concatenate the first name and the surname fields together with a space in between to obtain full names.

**Results**

<font color=darkgreen>Great work!</font>

In [7]:
# Find laureates whose first name starts with "G" and last name starts with "S"
criteria = {"firstname": {"$regex": "^G"}, "surname": {"$regex": "^S"}}
docs = list(db.laureates.find(filter = criteria))
print(f'\nFilter: {criteria} \nFound docs: {len(docs)} \nFirst doc:')
pprint(docs[0])

# Use projection to select only firstname and surname
fields_to_retrieve = {"firstname": 1, "surname": 1, "_id": 0}
docs = list(db.laureates.find(filter = criteria, projection = fields_to_retrieve))
print(f'\nFilter: {criteria} \nFound docs: {len(docs)} \nFirst doc:')
pprint(docs[0])

# Iterate over docs and concatenate first name and surname
full_names = [doc["firstname"] + " " + doc["surname"]  for doc in docs]

# Print the full names
print('\nFinal result:')
pprint(full_names)


Filter: {'firstname': {'$regex': '^G'}, 'surname': {'$regex': '^S'}} 
Found docs: 9 
First doc:
{'_id': ObjectId('6035cd48354dd8e354623383'),
 'born': '1903-12-19',
 'bornCity': 'Bradford, MA',
 'bornCountry': 'USA',
 'bornCountryCode': 'US',
 'died': '1996-06-06',
 'diedCity': 'Bar Harbor, ME',
 'diedCountry': 'USA',
 'diedCountryCode': 'US',
 'firstname': 'George D.',
 'gender': 'male',
 'id': '421',
 'prizes': [{'affiliations': [{'city': 'Bar Harbor, ME',
                               'country': 'USA',
                               'name': 'Jackson Laboratory'}],
             'category': 'medicine',
             'motivation': '"for their discoveries concerning genetically '
                           'determined structures on the cell surface that '
                           'regulate immunological reactions"',
             'share': '3',
             'year': '1980'}],
 'surname': 'Snell'}

Filter: {'firstname': {'$regex': '^G'}, 'surname': {'$regex': '^S'}} 
Found docs: 9 
First

## 03.04 Doing our share of data validation

In our Nobel __prizes__ collection, each document has an array of laureate subdocuments "laureates", each containing information such as the prize share for a laureate:

<code>
{'_id': ObjectId('5bc56145f35b634065ba1997'),
 'category': 'chemistry',
 'laureates': [{'firstname': 'Frances H.',
   'id': '963',
   'motivation': '"for the directed evolution of enzymes"',
   'share': '2',
   'surname': 'Arnold'},
  {'firstname': 'George P.',
   'id': '964',
   'motivation': '"for the phage display of peptides and antibodies"',
   'share': '4',
   'surname': 'Smith'},
 {...
</code>

Each __"laureates.share"__ value appears to be the reciprocal of a laureate's fractional share of that prize, encoded as a string. For example, a laureate "share" of "4" means that this laureate received a 1/4 share of the prize. Let's check that for each prize, all the shares of all the laureates add up to 1!

Notice the quotes around the values in the __"share"__ field: these values are actually given as strings! You'll have to convert then to numbers before you find the reciprocals and add up the shares.

**Instructions**

1. Save a list of prizes (prizes), projecting out only the "laureates.share" values for each prize.
2. For each prize, compute the total share as follows:
3. Initialize the variable total_share to 0.
4. Iterate over the laureates for each prize, converting the "share" field of the "laureate" to float and adding the reciprocal of it (that is, 1 divided by it) to total_share.

**Results**

<font color=darkgreen>Phenominal! It seems like all the shares add up to 1 for all the prizes!</font>

In [8]:
# Save documents, projecting out laureates share
criteria = {}
fields_to_retrieve = ['laureates.share']
prizes = list(db.prizes.find(criteria, fields_to_retrieve))

print(f'Prizes found: {len(prizes)} \nFirst prize found:')
pprint(prizes[0])

total_share_not_one = []
# Iterate over prizes
for prize in prizes:
    # Initialize total share
    total_share = 0
    
    # Iterate over laureates for the prize
    for laureate in prize["laureates"]:
        # add the share of the laureate to total_share
        total_share += 1 / float(laureate['share'])
        
    # Print the total share if not one    
    if total_share != 1: 
        print(total_share)    
        total_share_not_one.append(laureate['_id'])

if len(total_share_not_one) == 0:
    print('All share prizess add up to 1!')

Prizes found: 590 
First prize found:
{'_id': ObjectId('6035cd48354dd8e354623018'),
 'laureates': [{'share': '2'}, {'share': '4'}, {'share': '4'}]}
All share prizess add up to 1!


## 03.05 Sorting

1. Sorting
>We've learned how to project out only the fields we need from a query. There are other conditions we can give to MongoDB about the data returned. In this lesson, we'll learn how to sort results on the server before they get returned to us.

2. Sorting post-query with Python
>First, its useful to review how we can sort a list of retrieved documents with only Python. This may be plenty performant, especially for small datasets. After all, we can store a local cache of our data and try many different operations on it. Let's get a list of all physics prize documents, with the year field projected. I'm not interested in printing out the full projected documents with the document ids. So, I'll use a list comprehension to collect the year values, slice out the first five values, and print them. It looks like the documents are in reverse chronological order, but we get no guarantee of this. To sort the documents in ascending order of year, I use the built-in Python "sorted" function. I also import the "itemgetter" function from Python's standard library. Given a key, it fetches the value for that key in a dictionary. To sort in reverse - or descending - order, we can pass True as the "reverse" keyword argument to the "sorted" function.

3. Sorting in-query with MongoDB
>We can also ask Mongo to do simple sorting by field values on the server and yield results in sorted order. Here, we pass a "sort" argument to the "find" method, giving a list of field-direction pairs. In this case, we want to sort on the "year" field, and in the ascending direction. I don't need the documents beyond the print statement, so I pass the cursor to the list comprehension. To sort by year in descending order, we use negative one as the second element of the sort pair. Why is a list passed to the sort keyword argument? This is because you can sort first by one field and then by others. Let's see how.

4. Primary and secondary sorting
>Let's sort prize documents first by ascending year and then by descending category. To do this, we provide the corresponding pairs in order as a list. Here, we also query for prizes awarded between 1966 and 1970, exclusive. We project out only the data we need, the category and year values. Notice that we could sort by fields that we do not project - in this case, we happen to be also projecting the sort fields. For each projection yielded by the cursor, we format a string with the year and category value. To do this, we use Python's double-star dictionary unpacking syntax. The output shows years increasing and, for each year, categories decreasing. In both cases, the order is alphabetical because the fields are both string-valued. For the four-digit-year strings, sorting produces the same result as numerical sorting. We can see that there was no award for economics in 1967 or 1968, and there was no award for peace in 1967.

5. Sorting with pymongo versus MongoDB shell
>One last thing: the command-line shell for MongoDB uses JavaScript. You specify a sort using the form of a JavaScript object, which looks like a Python dictionary. This works because JavaScript objects in the console keep their key order as entered. In Python 3-point-6 and below, though, there is no similar guarantee with dictionaries. The order of keys may not be preserved as entered. This is why pymongo requires a list of tuples.

6. Let's get sorted!
>Let's get some practice with sorting.

In [9]:
# Just getting the data
docs = list(db.prizes.find({"category": "physics"}, ["year"]))
print([doc["year"] for doc in docs][:5])

# Sorting post-query with itemgetter, ascending
docs = sorted(docs, key=itemgetter("year"))
print([doc["year"] for doc in docs][:5])

# Sorting post-query with itemgetter, descending
docs = sorted(docs, key=itemgetter("year"), reverse=True)
print([doc["year"] for doc in docs][:5])

['2018', '2017', '2016', '2015', '2014']
['1901', '1902', '1903', '1904', '1905']
['2018', '2017', '2016', '2015', '2014']


In [10]:
# Sorting in-query with MongoDB - ascending
cursor = db.prizes.find({"category": "physics"}, ["year"], sort=[("year", 1)])
print([doc["year"] for doc in cursor][:5])

# Sorting in-query with MongoDB - descending
cursor = db.prizes.find({"category": "physics"}, ["year"],
sort=[("year", -1)])
print([doc["year"] for doc in cursor][:5])

# Primary and secondary sorting
for doc in db.prizes.find(filter     = {"year": {"$gt": "1966", "$lt": "1970"}}, 
                          projection = ["category", "year"], 
                          sort       = [("year", 1), ("category", -1)]):
    print("{year} {category}".format(**doc))

['1901', '1902', '1903', '1904', '1905']
['2018', '2017', '2016', '2015', '2014']
1967 physics
1967 medicine
1967 literature
1967 chemistry
1968 physics
1968 peace
1968 medicine
1968 literature
1968 chemistry
1969 physics
1969 peace
1969 medicine
1969 literature
1969 economics
1969 chemistry


## 03.06 What the sort?

**Instructions**
This block prints out the first five projections of a sorted query. What "sort" argument fills the blank?

<code>
docs = list(db.laureates.find(
    {"born": {"$gte": "1900"}, "prizes.year": {"$gte": "1954"}},
    {"born": 1, "prizes.year": 1, "_id": 0},
    sort=____))
for doc in docs[:5]:
    print(doc)
</code>
    
<code>
{'born': '1916-08-25', 'prizes': [{'year': '1954'}]}
{'born': '1915-06-15', 'prizes': [{'year': '1954'}]}
{'born': '1901-02-28', 'prizes': [{'year': '1954'}, {'year': '1962'}]}
{'born': '1913-07-12', 'prizes': [{'year': '1955'}]}
{'born': '1911-01-26', 'prizes': [{'year': '1955'}]}
</code>

**Possible Answers**

1. __[("prizes.year", 1), ("born", -1)]__
2. {"prizes.year": 1, "born": -1}
3. None
4. [("prizes.year", 1)]

**Results**

<font color=darkgreen>Yes! Does the 'prizes.year' field sort like you expect?</font>

In [11]:
docs = list(db.laureates.find(
    {"born": {"$gte": "1900"}, "prizes.year": {"$gte": "1954"}},
    {"born": 1, "prizes.year": 1, "_id": 0},
    sort=[('prizes.year', 1), ('born', -1)]))

for doc in docs[:5]:
    print(doc)

{'born': '1916-08-25', 'prizes': [{'year': '1954'}]}
{'born': '1915-06-15', 'prizes': [{'year': '1954'}]}
{'born': '1901-02-28', 'prizes': [{'year': '1954'}, {'year': '1962'}]}
{'born': '1913-07-12', 'prizes': [{'year': '1955'}]}
{'born': '1911-01-26', 'prizes': [{'year': '1955'}]}


## 03.07 Sorting together: MongoDB + Python

In this exercise you'll explore the prizes in the physics category. __You will use Python to sort laureates for one prize by last name, and then MongoDB to sort prizes by year__:

<code>
1901: Röntgen
1902: Lorentz and Zeeman
1903: Becquerel and Curie and Curie, née Sklodowska
</code>

You'll start by writing a function that takes a prize document as an argument, extracts all the laureates from that document, arranges them in alphabetical order, and returns a string containing the last names separated by __" and "__.

The Nobel database is again available to you as __db__. We also pre-loaded a sample document __sample_doc__ so you can test your laureate-extracting function.

(Remember that you can always type __help(function_name)__ in console to get a refresher on functions you might be less familiar with, e.g. __help(sorted)__!)

**Instructions**

1. Complete the definition of all_laureates(prize). Within the body of the function:<br>
    1.1. Sort the "laureates" list of the prize document according to the "surname" key.<br>
    1.2 For each of the laureates in the sorted list, extract the "surname" field.<br>
    1.3 The code for joining the last names into a single string is already written for you.<br>
    1.4 Take a look at the console to make sure the output looks like what you'd expect!<br>
2. Find the documents for the prizes in the physics category, sort them in chronological order (by "year", ascending), and only fetch the "year", "laureates.firstname", and "laureates.surname" fields.
3. Now that you have the prizes, and the function to extract laureates from a prize, print the year and the names of the laureates (use your all_laureates() function) for each prize document.

**Results**

<font color=darkgreen>Excellent! You worked through stages of filtering, projecting, sorting, adding a derived field ("names"), and producing formatted output for each document.</font>

In [12]:
# Definition of all_laureates function
def all_laureates(prize):  
  """Sort the laureates by surname"""
  sorted_laureates = sorted(prize['laureates'], key=itemgetter('surname'))
  
  # extract surnames
  surnames = [laureate['surname'] for laureate in sorted_laureates]
  
  # concatenate surnames separated with " and " 
  all_names = " and ".join(surnames)
  
  return all_names

In [13]:
# Finding one document to remember her structure.
sample_prize = db.prizes.find_one({})
pprint(sample_prize)

# test the function on a sample doc
print(all_laureates(sample_prize))

{'_id': ObjectId('6035cd48354dd8e354623018'),
 'category': 'physics',
 'laureates': [{'firstname': 'Arthur',
                'id': '960',
                'motivation': '"for the optical tweezers and their application '
                              'to biological systems"',
                'share': '2',
                'surname': 'Ashkin'},
               {'firstname': 'Gérard',
                'id': '961',
                'motivation': '"for their method of generating high-intensity, '
                              'ultra-short optical pulses"',
                'share': '4',
                'surname': 'Mourou'},
               {'firstname': 'Donna',
                'id': '962',
                'motivation': '"for their method of generating high-intensity, '
                              'ultra-short optical pulses"',
                'share': '4',
                'surname': 'Strickland'}],
 'overallMotivation': '“for groundbreaking inventions in the field of laser '
                   

In [14]:
# find physics prizes, project year and first and last name, and sort by year
docs = db.prizes.find(
           filter= {'category': 'physics'}, 
           projection= ["year", "laureates.firstname", "laureates.surname"], 
           sort= [('year', 1)])
docs = list(docs)
pprint(docs[0])

{'_id': ObjectId('6035cd48354dd8e354623263'),
 'laureates': [{'firstname': 'Wilhelm Conrad', 'surname': 'Röntgen'}],
 'year': '1901'}


In [15]:
# print the year and laureate names (from all_laureates)
for doc in docs:
  print("{year}: {names}".format(year=doc['year'], names=all_laureates(doc)))

1901: Röntgen
1902: Lorentz and Zeeman
1903: Becquerel and Curie and Curie, née Sklodowska
1904: (John William Strutt)
1905: von Lenard
1906: Thomson
1907: Michelson
1908: Lippmann
1909: Braun and Marconi
1910: van der Waals
1911: Wien
1912: Dalén
1913: Kamerlingh Onnes
1914: von Laue
1915: Bragg and Bragg
1917: Barkla
1918: Planck
1919: Stark
1920: Guillaume
1921: Einstein
1922: Bohr
1923: Millikan
1924: Siegbahn
1925: Franck and Hertz
1926: Perrin
1927: Compton and Wilson
1928: Richardson
1929: de Broglie
1930: Raman
1932: Heisenberg
1933: Dirac and Schrödinger
1935: Chadwick
1936: Anderson and Hess
1937: Davisson and Thomson
1938: Fermi
1939: Lawrence
1943: Stern
1944: Rabi
1945: Pauli
1946: Bridgman
1947: Appleton
1948: Blackett
1949: Yukawa
1950: Powell
1951: Cockcroft and Walton
1952: Bloch and Purcell
1953: Zernike
1954: Born and Bothe
1955: Kusch and Lamb
1956: Bardeen and Brattain and Shockley
1957: Lee and Yang
1958: Cherenkov and Frank and Tamm
1959: Chamberlain and Segrè
19

## 03.08 Gap years

The prize in economics was not added until 1969. There have also been many years for which prizes in one or more of the original categories were not awarded.

In this exercise, you will utilize sorting by multiple fields to see which categories are missing in which years.

For now, you will just print the list of all documents, but in the next chapter, you'll learn how to use MongoDB to group and aggregate data to present this information in a more convenient format.

**Instructions**

1. Find the original prize categories established in 1901 by looking at the distinct values of the "category" field for prizes from year 1901.
2. Fetch ONLY the year and category from all the documents (without the "_id" field).
3. Sort by "year" in descending order, then by "category" in ascending order.

**Results**

<font color=darkgreen>Great work! We can see that, for example, 'literature' is mising from 2018 prizes. Also, there were few prizes were awarded between 1914 and 1920. Why do you think that is?</font>

In [16]:
# original categories from 1901
original_categories = db.prizes.distinct('category', {'year': '1901'})
print(original_categories)

# project year and category, and sort
docs = db.prizes.find(
        filter={},
        projection = {'year': 1, 'category': 1, '_id': 0},
        sort = [('year', -1), ('category', 1)]
)

#print the documents
for doc in docs:
        print(doc)

['chemistry', 'literature', 'medicine', 'peace', 'physics']
{'year': '2018', 'category': 'chemistry'}
{'year': '2018', 'category': 'economics'}
{'year': '2018', 'category': 'medicine'}
{'year': '2018', 'category': 'peace'}
{'year': '2018', 'category': 'physics'}
{'year': '2017', 'category': 'chemistry'}
{'year': '2017', 'category': 'economics'}
{'year': '2017', 'category': 'literature'}
{'year': '2017', 'category': 'medicine'}
{'year': '2017', 'category': 'peace'}
{'year': '2017', 'category': 'physics'}
{'year': '2016', 'category': 'chemistry'}
{'year': '2016', 'category': 'economics'}
{'year': '2016', 'category': 'literature'}
{'year': '2016', 'category': 'medicine'}
{'year': '2016', 'category': 'peace'}
{'year': '2016', 'category': 'physics'}
{'year': '2015', 'category': 'chemistry'}
{'year': '2015', 'category': 'economics'}
{'year': '2015', 'category': 'literature'}
{'year': '2015', 'category': 'medicine'}
{'year': '2015', 'category': 'peace'}
{'year': '2015', 'category': 'physics'}

## 03.09 What are indexes?

1. What are indexes?
>It's time to speed up our queries. Enter indexes.

2. What are indexes?
>An index in MongoDB is like a book's index. Let's say I grab a textbook on materials science. I want information on eutectic temperatures. I could flip through the book. I could also try narrowing down to a chapter from skimming the table of contents.

3. What are indexes?
>Instead, I go to the index in back. I see an alphabetical ordering of terms, with page numbers for each. I find "eutectic temperature", which directs me to page 314. I go there, and violà!

4. What are indexes?
>With MongoDB, imagine each collection as a book, each document as a page, and each field as a type of content. Imagine an ordered index for temperatures. This means you can find all pages that list temperatures in a range of interest. This is hard to do with an actual book. But, MongoDB structures documents with fields. Thus, you can index fields using the values of those fields.

5. When to use indexes?
>When are indexes useful? First, when you expect to get only one or a few documents back. If your typical queries fetch most if not all documents, you might as well scan the whole collection. Making Mongo maintain an index is a waste of time. Second, when you have very large documents or very large collections. Rather than load these into memory from disk, Mongo can use much-smaller indexes.

6. Gauging performance before indexing
>How long does it take to collect prize documents using certain queries? Let's measure. I use here the "timeit" module of Python, via the Jupyter notebook's "timeit" magic. Fetching prizes from 1901 takes half a millisecond on my laptop. Fetching all prizes sorted by year takes over five milliseconds.

7. Adding a single-field index
>Let's now create an ascending index on prize years using the create_index method. Mongo can read a single-field index in reverse. For a multi-field index, though, direction matters. After creating this index, query performance improves. The first query's runtime drops by 30%, the second by 20%. These gains surprise me - not because they are low, but because they are high! Our prizes collection is under a quarter megabyte uncompressed. It has fewer than a thousand documents. Imagine the performance gain on a much larger collection. Especially if the working set doesn't fit in RAM.

8. Adding a compound (multiple-field) index
>Here we create a compound index on ascending category and then ascending year. Thus, Mongo maintains an index by ascending year for each category. Here we list all years of economics prizes. We involve only the category and year fields. Thus, Mongo never has to examine the collection itself to execute the query - the query is "covered" by the index. We see only a minor speedup, but indexes can take up far less space than their collections. Defining indexes that cover common queries can be huge for performance. Here is another query, fetching the first award year for the prize in economics. Once again, our compound index covers the query.

9. Learn more: ask your collection and your queries
>Finally, I want to show you some tools to troubleshoot query performance. We won't cover these tools in the exercises. The first is the "index information" method. This helps confirm which indexes exist for a collection. The second tool is the "explain" method of a cursor. MongoDB provides output from its query plan detailing how a given query will execute. Filtering, projecting, sorting, and fetching all happen in stages. Here we see a full collection scan, or collscan, preceding projection. After creating an appropriate index, we see that an index scan, or ixscan, happens instead.

10. Let's practice!
>Okay, let's cover some queries!

In [17]:
if 'year_1' in db.prizes.index_information():
    db.prizes.drop_index('year_1')
    
if 'category_year_1' in db.prizes.index_information():
    db.prizes.drop_index('category_year_1')

if 'firstname_bornCountry' in db.laureates.index_information():
    db.laureates.drop_index('firstname_bornCountry')

**Before single index**

In [18]:
%%timeit
docs = list(db.prizes.find({"year": "1901"}))

1.63 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [19]:
%%timeit
docs = list(db.prizes.find({}, sort=[("year", 1)]))

14.7 ms ± 807 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


**After single index**

In [20]:
# Adding a single-field index
_ = db.prizes.create_index([("year", 1)], name='year_1')

In [21]:
%%timeit
# Previously: 524 μs ± 7.34 μs
docs = list(db.prizes.find({"year": "1901"}))

1.06 ms ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [22]:
%%timeit
# Previously: 5.18 ms ± 54.9 μs
docs = list(db.prizes.find({}, sort=[("year", 1)]))

10.7 ms ± 676 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [23]:
db.prizes.drop_index('year_1')

**Before compound index**

In [24]:
%%timeit
list(db.prizes.find({"category": "economics"}, {"year": 1, "_id": 0}))

1.44 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [25]:
%%timeit
db.prizes.find_one({"category": "economics"}, {"year": 1, "_id": 0}, sort=[("year", 1)])

1.25 ms ± 42.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


**After compound index**

In [26]:
# Adding a compound (multiple-field) index
_ = db.prizes.create_index([("category", 1), ("year", 1)], name='category_year_1')

In [27]:
%%timeit
list(db.prizes.find({"category": "economics"}, {"year": 1, "_id": 0}))

1.5 ms ± 63 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [28]:
%%timeit
db.prizes.find_one({"category": "economics"}, {"year": 1, "_id": 0}, sort=[("year", 1)])

1.09 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [29]:
db.prizes.drop_index('category_year_1')

**Information about index**

In [30]:
print('Existing indexes in the "Laureates" collextion:')
pprint(db.laureates.index_information())

Existing indexes in the "Laureates" collextion:
{'_id_': {'key': [('_id', 1)], 'v': 2},
 'bornCountry_1': {'key': [('bornCountry', 1)], 'v': 2}}


In [31]:
print('Process used by MongoDB before index:')
pprint(db.laureates.find({"firstname": "Marie"}, {"bornCountry": 1, "_id": 0}).explain())

Process used by MongoDB before index:
{'executionStats': {'allPlansExecution': [],
                    'executionStages': {'advanced': 1,
                                        'executionTimeMillisEstimate': 0,
                                        'inputStage': {'advanced': 1,
                                                       'direction': 'forward',
                                                       'docsExamined': 934,
                                                       'executionTimeMillisEstimate': 0,
                                                       'filter': {'firstname': {'$eq': 'Marie'}},
                                                       'isEOF': 1,
                                                       'nReturned': 1,
                                                       'needTime': 934,
                                                       'needYield': 0,
                                                       'restoreState': 0,
                     

In [32]:
print('Process used by MongoDB after creation of index:')
_ = db.laureates.create_index([("firstname", 1), ("bornCountry", 1)], name='firstname_bornCountry')
pprint(db.laureates.find({"firstname": "Marie"}, {"bornCountry": 1, "_id": 0}).explain())
db.laureates.drop_index('firstname_bornCountry')

Process used by MongoDB after creation of index:
{'executionStats': {'allPlansExecution': [],
                    'executionStages': {'advanced': 1,
                                        'executionTimeMillisEstimate': 0,
                                        'inputStage': {'advanced': 1,
                                                       'direction': 'forward',
                                                       'dupsDropped': 0,
                                                       'dupsTested': 0,
                                                       'executionTimeMillisEstimate': 0,
                                                       'indexBounds': {'bornCountry': ['[MinKey, '
                                                                                       'MaxKey]'],
                                                                       'firstname': ['["Marie", '
                                                                                     '"Marie"]']},

## 03.10 High-share categories

In the year 3030, everybody wants to be a Nobel laureate. Over the last thousand years, many new categories have been added. You serve a MongoDB prizes collection with the same schema as we've seen. Many people theorize that they have a better chance in "high-share" categories. They are hitting your server with similar, long-running queries. It's time to cover those queries with an index.

**Instructions**

Which of the following indexes is best suited to speeding up the operation 
<code>db.prizes.distinct("category", {"laureates.share": {"$gt": "3"}})</code>
?

**Possible Answers**

1. [("category", 1)] <font color=red>Recall that for a distinct query the filter argument is passed as a second argument, whereas the projected field is passed first.</font>
2. [("category", 1), ("laureates.share", 1)] <font color=red>Recall that for a distinct query the filter argument is passed as a second argument, whereas the projected field is passed first.</font>
3. [("laureates.share", 1)] <font color=red>This index does indeed speed up the query, but we can do better by covering the projection of the category field as well.</font>
4. __[("laureates.share", 1), ("category", 1)]__

**Results**

<font color=darkgreen>Excellent! For a distinct query the filter argument is passed as a second argument, whereas the projected field is passed first.</font>

## 03.11 Recently single?

A prize might be awarded to a single laureate or to several. For each prize category, report the most recent year that a single laureate -- rather than several -- received a prize in that category. As part of this task, you will ensure an index that speeds up finding prizes by category and then sorting results by decreasing year

**Instructions**

1. Specify an index model that indexes first on category (ascending) and second on year (descending).
2. Save a string report for printing the last single-laureate year for each distinct category, one category per line. To do this, for each distinct prize category, find the latest-year prize (requiring a descending sort by year) of that category (so, find matches for that category) with a laureate share of "1".

**Results**

<font color=darkgreen>Simply singular! It seems that physics is the most consistently shared prize category in modern times.</font>

In [33]:
# Specify an index model for compound sorting
index_model = [('category', 1), ('year', -1)]
db.prizes.create_index(index_model)

# Collect the last single-laureate year for each category
report = ""
for category in sorted(db.prizes.distinct("category")):
    doc = db.prizes.find_one(
        {'category': category, "laureates.share": "1"},
        sort=[('year', -1)]
    )
    report += "{category}: {year}\n".format(**doc)

print(report)

chemistry: 2011
economics: 2017
literature: 2017
medicine: 2016
peace: 2017
physics: 1992



## 03.12 Born and affiliated

Some countries are, for one or more laureates, both their country of birth ("bornCountry") and a country of affiliation for one or more of their prizes ("prizes.affiliations.country"). You will find the five countries of birth with the highest counts of such laureates.

**Instructions**

1. Create an index on country of birth ("bornCountry") for db.laureates to ensure efficient gathering of distinct values and counting of documents
2. Complete the skeleton dictionary comprehension to construct n_born_and_affiliated, the count of laureates as described above for each distinct country of birth. For each call to count_documents, ensure that you use the value of country to filter documents properly.

**Results**

<font color=darkgreen>Good work! As you may guess, simple string matching of country names for this dataset is problematic, but this is a solid first pass.</font>

In [34]:
# Ensure an index on country of birth
db.laureates.create_index([('bornCountry', 1)])

# Collect a count of laureates for each country of birth
n_born_and_affiliated = {
    country: db.laureates.count_documents({
        'bornCountry': country,
        "prizes.affiliations.country": country
    })
    for country in db.laureates.distinct("bornCountry")
}

five_most_common = Counter(n_born_and_affiliated).most_common(5)
pprint(five_most_common)

[('USA', 241),
 ('United Kingdom', 56),
 ('France', 26),
 ('Germany', 19),
 ('Japan', 17)]


## 03.13 Limits

1. Limits and Skips with Sorts, Oh My!
>In this lesson we will learn about the limit and skip parameters of Mongo queries. They can help us inspect a few documents at a time and page through a collection. In concert with sorting, they can help us get documents with extreme values.

2. Limiting our exploration
>Let's say I want to get prize category and year information for a few prizes split three ways. First, I check that for all prizes, either all laureates have a one-third share, or none have a one-third share. I verify my assumption with this for-loop of assertions. Now, I can print information on prizes split three ways. I filter for laureate share equal to three, and I get a long iterator: tens of lines fill my screen. Can I fetch only a few documents to examine before I decide how to proceed next in my analysis? Yes. Mongo provides a convenient limit option as an extra parameter to the find method. There we go.

3. Skips and paging through results
>Besides limiting the number of results, we can also skip results server-side. When you use the "skip" parameter in conjunction with limits, you can get pagination, with the number of results per page set by the limit parameter.

4. Using cursor methods for {sort, skip, limit}
>You can also chain methods to a cursor. This is an alternative to passing extra parameters to the "find" method. Here's what this looks like in the case of setting limits. I don't pass "limit" as a keyword argument to the find method. Rather, I chain the limit method, with an argument of three, to the cursor. And here's how to amend a cursor by chaining both skip and limit methods to it. Finally, I can even alter the sorting on a cursor by chaining a call to the sort method. Here, I sort by ascending year.

5. Simpler sorts of sort
>One last thing. When sorting using the chained method, pymongo allows a couple of shortcuts. Here we specify sorting as before with a list of (field, direction) pairs. There is only one pair because we are sorting by only one field. In this case, we can destructure that single pair. Here I specify the sort with the field as the first argument and the direction as the second argument. Furthermore, pymongo will take the default direction to be ascending. Thus, we can sort by ascending year as a chained call with a single argument, "year". All these cursors yield the same sequence of documents. Finally, note that using the "find_one" method is different. It's like a call to "find" with the limit set to one and with automatic fetching from the cursor. Thus, in this case, you cannot use cursor methods - you need to pass skip and sort requirements as arguments.

6. Limit or Skip Practice? Exactly.
>Before you skip ahead, let's test some limits. Especially after getting some things sorted.

In [35]:
# Exploring prizes collections...
pprint(db.prizes.find_one({}))

{'_id': ObjectId('6035cd48354dd8e354623018'),
 'category': 'physics',
 'laureates': [{'firstname': 'Arthur',
                'id': '960',
                'motivation': '"for the optical tweezers and their application '
                              'to biological systems"',
                'share': '2',
                'surname': 'Ashkin'},
               {'firstname': 'Gérard',
                'id': '961',
                'motivation': '"for their method of generating high-intensity, '
                              'ultra-short optical pulses"',
                'share': '4',
                'surname': 'Mourou'},
               {'firstname': 'Donna',
                'id': '962',
                'motivation': '"for their method of generating high-intensity, '
                              'ultra-short optical pulses"',
                'share': '4',
                'surname': 'Strickland'}],
 'overallMotivation': '“for groundbreaking inventions in the field of laser '
                   

In [36]:
# Limiting our exploration
for doc in db.prizes.find({}, ["laureates.share"]):
    share_is_three = [laureate["share"] == "3" for laureate in doc["laureates"]]
    
    assert all(share_is_three) or not any(share_is_three)
    
for doc in db.prizes.find({"laureates.share": "3"}, limit=6):
    print("{year} {category}".format(**doc))

2016 chemistry
2015 chemistry
2014 physics
2013 chemistry
2013 medicine
2013 economics


In [37]:
# Skips and paging through results
for doc in db.prizes.find({"laureates.share": "3"}, skip=2, limit=2):
    print("{year} {category}".format(**doc))
for doc in db.prizes.find({"laureates.share": "3"}, skip=4, limit=2):
    print("{year} {category}".format(**doc))

2014 physics
2013 chemistry
2013 medicine
2013 economics


In [38]:
# Using cursor methods for {sort, skip, limit}
for doc in db.prizes.find({"laureates.share": "3"}).limit(3):
    print("{year} {category}".format(**doc))
    
for doc in (db.prizes.find({"laureates.share": "3"}).skip(3).limit(3)):
    print("{year} {category}".format(**doc))

for doc in (db.prizes.find({"laureates.share": "3"}).sort([("year", 1)]).skip(3).limit(3)):
    print("{year} {category}".format(**doc))

2016 chemistry
2015 chemistry
2014 physics
2013 chemistry
2013 medicine
2013 economics
1954 medicine
1956 medicine
1956 physics


In [39]:
# Simpler sorts of sort
cursor1 = (db.prizes.find({"laureates.share": "3"}).skip(3).limit(3).sort([("year", 1)]))
cursor2 = (db.prizes.find({"laureates.share": "3"}).skip(3).limit(3).sort("year", 1))
cursor3 = (db.prizes.find({"laureates.share": "3"}).skip(3).limit(3).sort("year"))

docs = list(cursor1)
assert docs == list(cursor2) == list(cursor3)

for doc in docs:
    print("{year} {category}".format(**doc))

doc = db.prizes.find_one({"laureates.share": "3"}, skip=3, sort=[("year", 1)])
print("{year} {category}".format(**doc))

1954 medicine
1956 medicine
1956 physics
1954 medicine


## 03.14 Setting a new limit?

**Instructions**
How many documents does the following expression return?

<code>
list(db.prizes.find({"category": "economics"},
                    {"year": 1, "_id": 0})
     .sort("year")
     .limit(3)
     .limit(5))
</code>

**Possible Answers**
1. 3: the first call to limit takes precedence
2. 5: the second call to limit overrides the first
3. none: instead, an error is raised

**Results**

<font color=darkgreen>Correct! You can think of the query parameters as being updated like a dictionary in Python: d = {'limit': 3}; d.update({'limit': 5}); print(d) will print "{'limit': 5}"</font>

In [40]:
list(db.prizes.find({"category": "economics"},
                    {"year": 1, "_id": 0})
     .sort("year")
     .limit(3)
     .limit(5))

[{'year': '1969'},
 {'year': '1970'},
 {'year': '1971'},
 {'year': '1972'},
 {'year': '1973'}]

## 03.15 The first five prizes with quarter shares

Find the first five prizes with one or more laureates sharing 1/4 of the prize. Project our prize category, year, and laureates' motivations.

**Instructions**

1. Save to filter_ the filter document to fetch only prizes with one or more quarter-share laureates, i.e. with a "laureates.share" of "4".
2. Save to projection the list of field names so that prize category, year and laureates' motivations ("laureates.motivation") may be fetched for inspection.
3. Save to cursor a cursor that will yield prizes, sorted by ascending year. Limit this to five prizes, and sort using the most concise specification.

**Results**

<font color=darkgreen>Great work! For all of these prizes, there were two laureates with quarter shares for their work together, and there was a third laureate with a half share for separate work (as evidenced by the motivation fields).</font>

In [41]:
# Fetch prizes with quarter-share laureate(s)
filter_ = {'laureates.share': '4'}

# Save the list of field names
projection = ['category', 'year', 'laureates.motivation']

# Save a cursor to yield the first five prizes
cursor = db.prizes.find(filter_, projection).sort('year').limit(5)
pprint(list(cursor))

[{'_id': ObjectId('6035cd48354dd8e35462320d'),
  'category': 'physics',
  'laureates': [{'motivation': '"in recognition of the extraordinary services '
                               'he has rendered by his discovery of '
                               'spontaneous radioactivity"'},
                {'motivation': '"in recognition of the extraordinary services '
                               'they have rendered by their joint researches '
                               'on the radiation phenomena discovered by '
                               'Professor Henri Becquerel"'},
                {'motivation': '"in recognition of the extraordinary services '
                               'they have rendered by their joint researches '
                               'on the radiation phenomena discovered by '
                               'Professor Henri Becquerel"'}],
  'year': '1903'},
 {'_id': ObjectId('6035cd48354dd8e3546231b4'),
  'category': 'chemistry',
  'laureates': [{'motivation':

## 03.16 Pages of particle-prized people

You and a friend want to set up a website that gives information on Nobel laureates with awards relating to particle phenomena. You want to present these laureates one page at a time, with three laureates per page. You decide to order the laureates chronologically by award year. When there is a "tie" in ordering (i.e. two laureates were awarded prizes in the same year), you want to order them alphabetically by surname.

**Instructions**

1. Complete the function get_particle_laureates that, given page_number and page_size, retrieves a given page of prize data on laureates who have the word "particle" (use \$regex) in their prize motivations ("prizes.motivation"). Sort laureates first by ascending "prizes.year" and next by ascending "surname".
2. Collect and save the first nine pages of laureate data to pages.

**Results**

<font color=darkgreen>Great! Particles may be small, but discoveries related to them have made quite an impact!</font>

In [42]:
# Explore the strycture
pprint(db.laureates.find_one({}))

{'_id': ObjectId('6035cd48354dd8e354623266'),
 'born': '1853-07-18',
 'bornCity': 'Arnhem',
 'bornCountry': 'the Netherlands',
 'bornCountryCode': 'NL',
 'died': '1928-02-04',
 'diedCountry': 'the Netherlands',
 'diedCountryCode': 'NL',
 'firstname': 'Hendrik Antoon',
 'gender': 'male',
 'id': '2',
 'prizes': [{'affiliations': [{'city': 'Leiden',
                               'country': 'the Netherlands',
                               'name': 'Leiden University'}],
             'category': 'physics',
             'motivation': '"in recognition of the extraordinary service they '
                           'rendered by their researches into the influence of '
                           'magnetism upon radiation phenomena"',
             'share': '2',
             'year': '1902'}],
 'surname': 'Lorentz'}


In [43]:
# Write a function to retrieve a page of data
def get_particle_laureates(page_number=1, page_size=3):
    if page_number < 1 or not isinstance(page_number, int):
        raise ValueError("Pages are natural numbers (starting from 1).")
    particle_laureates = list(
        db.laureates.find(
            {'prizes.motivation': {'$regex': "particle"}},
            ["firstname", "surname", "prizes"])
        .sort([('prizes.year', 1), ('surname', 1)])
        .skip(page_size * (page_number - 1))
        .limit(page_size))
    return particle_laureates

# Collect and save the first nine pages
pages = [get_particle_laureates(page_number=page) for page in range(1,9)]
pprint(pages[0])

[{'_id': ObjectId('6035cd48354dd8e3546232c5'),
  'firstname': 'Charles Thomson Rees',
  'prizes': [{'affiliations': [{'city': 'Cambridge',
                                'country': 'United Kingdom',
                                'name': 'University of Cambridge'}],
              'category': 'physics',
              'motivation': '"for his method of making the paths of '
                            'electrically charged particles visible by '
                            'condensation of vapour"',
              'share': '2',
              'year': '1927'}],
  'surname': 'Wilson'},
 {'_id': ObjectId('6035cd48354dd8e3546232db'),
  'firstname': 'Sir John Douglas',
  'prizes': [{'affiliations': [{'city': 'Harwell, Berkshire',
                                'country': 'United Kingdom',
                                'name': 'Atomic Energy Research '
                                        'Establishment'}],
              'category': 'physics',
              'motivation': '"for their pione

# Aditional material
- Datacamp course: https://learn.datacamp.com/courses/introduction-to-using-mongodb-for-data-science-with-python