# Projection and Sorting: Exercises

In [None]:
from pymongo import MongoClient

client = MongoClient()
db = client.nobel

## Rounding up the G.S. crew

We can use the regular expression operator `$regex` to find laureates whose initials are G.S. Let's use projection and list comprehension to collect the full names of these laureates by concatenating their first ("firstname") and last ("surname") names.

- Fill in the blanks to save a list `names` of full names ("firstname" plus "surname") of laureates with initials G.S. (ignoring middles names/initials). You'll need to both filter on names as well as project out the fields required to collect the full names.


In [None]:
# Collect a list of full names
names = [" ".join([doc["firstname"], doc["surname"]])
         for doc in db.laureates.find(
             {"firstname": {"$regex": "^G"},
              "surname": {"$regex": "^S"}},
             {"firstname": 1, "surname": 1})]
print(names)

## Sorting together: MongoDB + Python

You will print out the names of all physics laureates, with one line printed for each award year, in chronological order. Each line will list laureates for that year in alphabetical order by surname ("last" name).

I encourage you to print intermediate results and understand the nested structure of prize documents.

- Construct a sort specification `sort_spec` to fetch physics prizes by ascending year.

In [None]:
from operator import itemgetter

# Sort by ascending year
sort_spec = [("year", 1)]

- Use `<collection>.find` to construct a `cursor` that fetches prizes with a "category" of "physics", sorts by ascending year, and projects the year and first laureate full name (`laureates.firstname` and `laureates.surname`). *You should encounter an error at year 1916*.

In [None]:
# Construct a cursor over physics prizes
cursor = db.prizes.find({"category": "physics"}, 
                        {"year": 1, "laureates.firstname": 1, "laureates.surname": 1},
                        sort=sort_spec)

for doc in cursor:
    print("{year}: {first_laureate_firstname} {first_laureate_surname}".format(
        year=doc["year"],
        first_laureate_firstname=doc["laureates"][0]["firstname"],
        first_laureate_surname=doc["laureates"][0]["surname"]))
cursor.rewind() # Rewind cursor to reuse in the next step

- The error is caused by the fact that the Nobel Prize in physics was not awarded in 1916 due to World War I. Supplement the cursor's filter to avoid the error:

In [None]:
# Construct a fixed cursor over physics prizes
cursor = db.prizes.find({"category": "physics", "laureates": {"$exists":True}}, 
                        {"year": 1, "laureates.firstname": 1, "laureates.surname": 1},
                        sort=sort_spec)

for doc in cursor:
    print("{year}: {first_laureate_firstname} {first_laureate_surname}".format(
        year=doc["year"],
        first_laureate_firstname=doc["laureates"][0]["firstname"],
        first_laureate_surname=doc["laureates"][0]["surname"]))
cursor.rewind() # Rewind cursor to reuse in the next step

- Complete the definition of the function `names` so that, given a prize document, it returns a list of formatted names, sorted by ascending "surname", for each of the "laureates" in that prize document.



In [None]:
# Define a function names() to return a list of formatted names
def names(doc):
    formatted_names = ["{firstname} {surname}".format(**laureate)
          for laureate in sorted(doc["laureates"], key=itemgetter("surname"))]
    return formatted_names

lines = ["{year}: {names}".format(year=doc["year"], names=" and ".join(names(doc)))
         for doc in cursor]
for line in lines: print(line)

## Gap years

As we saw above, there have been years for which prizes in one or more of the original categories were not awarded.

Sorting first by reverse chronological order and second by alphabetical order of category, collect and format prize documents to produce one formatted entry per year listing categories missing for that year.

- Construct a set `original_categories` of prize categories awarded in 1901.

In [None]:
import itertools
from operator import itemgetter

# Save the set of prize categories awarded in 1901
original_categories = set(db.prizes.distinct("category", {"year": "1901"}))
print(original_categories)

- Use `<collection>.find` to construct a cursor that yields prize documents only for categories in the list of original categories which contain the `laureates` key and thus were awarded, sorted first by decreasing year and second by increasing category.



In [None]:
# Construct a cursor over original-category prizes
cursor = db.prizes.find({"category": {"$in": list(original_categories)}, "laureates": {"$exists":True}},
                        {"category": 1, "year": 1},
                        sort=[("year", -1), ("category", 1)])

- Collect a list `not_awarded` of entries to be printed, one per line, that displays a year and the categories missing for that year. You will collect "category" values for each year and set-subtract them from the original categories.



In [None]:
# Collect entries for missing prize categories
not_awarded = []
for key, group in itertools.groupby(cursor, key=itemgetter("year")):
    year_categories = set(prize["category"] for prize in group)
    missing = ", ".join(sorted(original_categories - year_categories))
    if missing: not_awarded.append("{}: {}".format(key, missing))

for line in not_awarded: print(line)