In [1]:
# Import libraries
from pymongo import MongoClient
from pprint import pprint
from bson.regex import Regex

import re

In [2]:
# Client connects to "localhost" by default
client = MongoClient()

# Connect to "nobel" database on the fly
db = client["nobel"]

# 02. Working with Distinct Values and Sets

Now you have a sense of the data's structure. This chapter is about dipping your toes into the pools of values for various fields. You'll collect distinct values, test for membership in sets, and match values to patterns.

## 02.01 Survey Distinct Values

1. The distinct() method
>In this lesson, we'll learn to use the "distinct" method on Mongo collections. Using this method, we can collect the set of values assigned to a field across all documents.

2. An exceptional laureate
>We found an exceptional laureate in the last chapter. This laureate has received three Nobel prizes, more than any other laureate. Here we can see that this is the International Committee of the Red Cross. You may not have known that organizations can win Nobel prizes, but they have. For example, 23 employees of Lawrence Berkeley Lab shared the Nobel Peace Prize in 2007. They were part of the Intergovernmental Panel on Climate Change. A future exercise will be about the proportion of Nobel prizes awarded to immigrants. Keep in mind that the idea of immigration doesn't apply to some laureates. In this document, I see that the "gender" field has a value of "org", presumably short for "organization". What are the values that this field stores across documents in this collection? MongoDB provides a built-in collection method for such aggregation.

3. Using .distinct()
>Here, we call the "distinct" method on the laureates collection. We pass a single argument, "gender". MongoDB collects the distinct values that this field takes across the collection. We see that there are three and only three values for the "gender" field across the collection. You may be wondering where this method comes from, or how you can define a similar operation yourself. The "distinct" method is a convenience for a common aggregation. The "count_documents" method we have been using is a similar convenience. An aggregation processes data across a collection and produces a computed result. In the last chapter of this course, we'll learn how to create custom aggregations. You may be wondering about the efficiency of aggregations in MongoDB. You can register so-called "indexes" on fields for MongoDB to maintain. These indexes can ensure efficient queries and aggregations. In some cases, a query might not even need to run on a collection. We will learn how to create indexes in the next chapter. But, if we're not working with a lot of data, indexes are generally not needed. The laureates collection we're using in this course fits in memory. It weighs in at under a megabyte and has on the order of a thousand documents or fewer. It doesn't matter much if you use an inefficient algorithm to sort a list of a few hundred items. Likewise, a full collection scan isn't a big deal for this aggregation.

4. .distinct() with dot notation
>You can use dot notation to specify fields embedded deeper than the root level of a document. This applies in query methods like "find" and "find_one", and it applies for aggregations as well. I notice here that each subdocument in the "prizes" array field has a "category" field. The dot-two in the filter denotes index two of an array field. Thus, this is a laureate where a third element exists in the prizes array. Let's fetch the distinct values of this field. We see, as expected, that there is a value for each category of Nobel prize.

5. Let's practice!
>Let's use the distinct method to answer some questions about our Nobel Prize data.

In [3]:
# An exceptional laureate
criteria = {"prizes.2": {"$exists": True}}
print(criteria, db.laureates.count_documents(criteria))
pprint(db.laureates.find_one(criteria))

# Using .distinct()
criteria = "gender"
print(f'\n{criteria}:', db.laureates.distinct(criteria))

# .distinct() with dot notation
criteria = "prizes.category"
print(f'\n{criteria}:', db.laureates.distinct(criteria))

{'prizes.2': {'$exists': True}} 1
{'_id': ObjectId('6035898c6109195e81d36b1b'),
 'born': '0000-00-00',
 'died': '0000-00-00',
 'firstname': 'Comité international de la Croix Rouge (International Committee '
              'of the Red Cross)',
 'gender': 'org',
 'id': '482',
 'prizes': [{'affiliations': [[]],
             'category': 'peace',
             'share': '1',
             'year': '1917'},
            {'affiliations': [[]],
             'category': 'peace',
             'share': '1',
             'year': '1944'},
            {'affiliations': [[]],
             'category': 'peace',
             'share': '2',
             'year': '1963'}]}

gender: ['female', 'male', 'org']

prizes.category: ['chemistry', 'economics', 'literature', 'medicine', 'peace', 'physics']


## 02.02 Categorical data validation

Remember to explore example documents in the console via e.g. __db.prizes.find_one()__ and __db.laureates.find_one()__.

**Instructions**

What expression asserts that the distinct Nobel Prize categories catalogued by the "prizes" collection are the same as those catalogued by the "laureates"? 

**Possible Answers**

1. assert db.prizes.distinct("category") == db.laureates.distinct("prizes.category")
<font color=darkred>==>Although <collection>.distinct returns unique values, they are returned as a list and not guaranteed to be sorted in any way.</font>
2. assert db.prizes.distinct("laureates.category") == db.laureates.distinct("prizes.category")
3. __assert set(db.prizes.distinct("category")) == set(db.laureates.distinct("prizes.category"))__

**Results**

<font color=darkgreen>Correct! Converting the lists returned by <collection>.distinct to sets ensures that a check for equality is reliable.</font>

In [4]:
pprint(db.prizes.find_one())

{'_id': ObjectId('6035898c6109195e81d36797'),
 'category': 'physics',
 'laureates': [{'firstname': 'Arthur',
                'id': '960',
                'motivation': '"for the optical tweezers and their application '
                              'to biological systems"',
                'share': '2',
                'surname': 'Ashkin'},
               {'firstname': 'Gérard',
                'id': '961',
                'motivation': '"for their method of generating high-intensity, '
                              'ultra-short optical pulses"',
                'share': '4',
                'surname': 'Mourou'},
               {'firstname': 'Donna',
                'id': '962',
                'motivation': '"for their method of generating high-intensity, '
                              'ultra-short optical pulses"',
                'share': '4',
                'surname': 'Strickland'}],
 'overallMotivation': '“for groundbreaking inventions in the field of laser '
                   

In [5]:
pprint(db.laureates.find_one())

{'_id': ObjectId('6035898c6109195e81d369e5'),
 'born': '1853-07-18',
 'bornCity': 'Arnhem',
 'bornCountry': 'the Netherlands',
 'bornCountryCode': 'NL',
 'died': '1928-02-04',
 'diedCountry': 'the Netherlands',
 'diedCountryCode': 'NL',
 'firstname': 'Hendrik Antoon',
 'gender': 'male',
 'id': '2',
 'prizes': [{'affiliations': [{'city': 'Leiden',
                               'country': 'the Netherlands',
                               'name': 'Leiden University'}],
             'category': 'physics',
             'motivation': '"in recognition of the extraordinary service they '
                           'rendered by their researches into the influence of '
                           'magnetism upon radiation phenomena"',
             'share': '2',
             'year': '1902'}],
 'surname': 'Lorentz'}


In [6]:
print(db.prizes.distinct("category"))
print(db.laureates.distinct("prizes.category"))

['chemistry', 'economics', 'literature', 'medicine', 'peace', 'physics']
['chemistry', 'economics', 'literature', 'medicine', 'peace', 'physics']


## 02.03 Never from there, but sometimes there at last

There are some recorded countries of death (__"diedCountry"__) that do not appear as a country of birth (__"bornCountry"__) for laureates. One such country is "East Germany".

**Instructions**

1. Return a set of all such countries as countries.

**Results**

<font color=darkgreen>Well done! Some of these countries are likely to remain in this set, as they no longer exist!</font>

In [7]:
# Countries recorded as countries of death but not as countries of birth
countries = set(db.laureates.distinct('diedCountry')) - set(db.laureates.distinct('bornCountry'))
print(countries)

{'Puerto Rico', 'Greece', 'Tunisia', 'Israel', 'Philippines', 'Gabon', 'Yugoslavia (now Serbia)', 'East Germany', 'USSR', 'Barbados', 'Northern Rhodesia (now Zambia)', 'Czechoslovakia', 'Jamaica'}


## 02.04 Countries of affiliation

We saw in the last exercise that countries can be associated with a laureate as their country of birth and as their country of death. For each prize a laureate received, they may also have been affiliated with an institution at the time, located in a country.

**Instructions**

1. Determine the number of distinct countries recorded as part of an affiliation for laureates' prizes. Save this as count.

**Results**

<font color=darkgreen>Bravo! This number is less than the number of distinct countries of death, and far less than the number of distinct countries of birth.</font>

In [8]:
# The number of distinct countries of laureate affiliation for prizes
bornCountry = len(db.laureates.distinct('bornCountry'))
diedCountry = len(db.laureates.distinct('diedCountry'))
affiliations = len(db.laureates.distinct('prizes.affiliations.country'))
print('bornCountry:', bornCountry)
print('diedCountry:', diedCountry)
print('prizes.affiliations.country:', affiliations)

bornCountry: 120
diedCountry: 52
prizes.affiliations.country: 29


## 02.05 Distinct Values Given Filters

1. Pre-filtering distinct values
>In this lesson, we're going to dip our toes into the world of aggregation pipelines. We'll use the "filter" parameter of the "distinct" method to match certain documents. The method will fetch field values only from these matching documents.

2. Awards into prize shares
>Here I've found a laureate document with a value of "4" for the "share" field in one of it's "prizes" subdocuments. Pierre Curie shared the 1903 Nobel Prize in physics with his wife Marie. They also shared it with physicist Henri Becquerel. Marie and Pierre each received a quarter share of the prize. Henri received the remaining half share. The Nobel Prize API encodes prize share - in particular, the denominator of the fractional share - as a string. Thus, the document records Pierre's quarter share as the string "4".

3. High-share prize categories
>Which Nobel prize categories other than physics have laureates with quarter shares? We know how to get the distinct values of prize categories from the laureates collection. We pass the dotted path "prizes-dot-category" to the "distinct" method. We also know how to find - and list - all laureate documents satisfying some criteria using a filter document. Can we compose these two ideas? Sure! The "distinct" method takes an optional filter argument. You can think of this as a two-stage pipeline. First, filter the collection for documents that match a filter. Then, collected and return distinct values of a field for these documents. In the last chapter of this course, we'll learn how to custom-build such pipelines in an explicit way. The Nobel Prize API serves its data in a denormalized way. Thus, we can answer our question from a different perspective. Here, we ask the "prizes" collection to return distinct prize categories. Given a filter on laureate shares, we return distinct values of the "category" field. The result is the same as the call above using the laureates collection.

4. Prize categories with multi-winners
>Let's look at one more example of pre-filtering distinct values. Which prize categories have laureates who have won more than one prize? We'll start by counting laureates who won at least two prizes. In other words, those for which a second element exists in their document's "prizes" array. Next, we pass this filter document as a second argument to the "distinct" method. Our first argument is the dotted path to the prize category field. This returns all prize category values with laureates who have won more than one prize. Here are the corresponding prize categories for these six laureates. Notice that not all won prizes in the same category. Marie Curie, for instance, won prizes in both physics and chemistry. Linus Pauling won prizes in both chemistry and peace. We'll learn in the next chapter how to fetch only the document substructure we need. In this case, the MongoDB server returned only enough to communicate prize categories.

5. Practice time!
>Now it's your turn. Let's practice enumerating distinct field values of collections given filters.

In [9]:
# Awards into prize shares
# Found a laureate document with a value of "4" for the "share" field in one of it's "prizes" subdocuments.
pprint(db.laureates.find_one({"prizes.share": "4"}))
pprint(db.prizes.find_one({"laureates.share": "4"}))

# High-share prize categories
print(db.laureates.distinct("prizes.category", {"prizes.share": '4'}))
print(db.prizes.distinct("category", {"laureates.share": "4"}))

{'_id': ObjectId('6035898c6109195e81d369f3'),
 'born': '1936-01-10',
 'bornCity': 'Houston, TX',
 'bornCountry': 'USA',
 'bornCountryCode': 'US',
 'died': '0000-00-00',
 'firstname': 'Robert Woodrow',
 'gender': 'male',
 'id': '112',
 'prizes': [{'affiliations': [{'city': 'Holmdel, NJ',
                               'country': 'USA',
                               'name': 'Bell Laboratories'}],
             'category': 'physics',
             'motivation': '"for their discovery of cosmic microwave '
                           'background radiation"',
             'share': '4',
             'year': '1978'}],
 'surname': 'Wilson'}
{'_id': ObjectId('6035898c6109195e81d36797'),
 'category': 'physics',
 'laureates': [{'firstname': 'Arthur',
                'id': '960',
                'motivation': '"for the optical tweezers and their application '
                              'to biological systems"',
                'share': '2',
                'surname': 'Ashkin'},
               {'fi

In [10]:
# Prize categories with multi-winners
criteria = {"prizes.1": {"$exists": True}}
print(db.laureates.distinct("prizes.category", criteria))
for doc in db.laureates.find(criteria):
    for prize in doc['prizes']:
        print(prize['category'])

['chemistry', 'peace', 'physics']
chemistry
peace
physics
chemistry
physics
physics
chemistry
chemistry
peace
peace
peace
peace
peace


## 02.06 Born here, went there

**Instructions**

In which countries have USA-born laureates had affiliations for their prizes?

**Possible Answers**

1. __Australia, Denmark, United Kingdom, USA__
2. Australia, France, Sweden, United Kingdom, USA
3. Australia, Canada, Israel, United Kingdom, USA

**Results**

<font color=darkgreen>Yes! This is the output of db.laureates.distinct('prizes.affiliations.country', {'bornCountry': 'USA'})</font>

In [11]:
print(db.laureates.distinct('prizes.affiliations.country', {'bornCountry': 'USA'}))

['Australia', 'Denmark', 'USA', 'United Kingdom']


## 02.07 Triple plays (mostly) all around

Prizes can be shared, even by more than two laureates. In fact, all prize categories but one – literature – have had prizes shared by three or more laureates.

**Instructions**

1. Save a filter document criteria that, when passed to db.prizes.distinct, returns all prize categories shared by three or more laureates. That is, "laureates.2" must exist for such documents.
2. Save these prize categories as a Python set called triple_play_categories.
3. Confirm via an assertion that "literature" is the only prize category with no prizes shared by three or more laureates.

**Results**

<font color=darkgreen>Around the horn! One day, literature, one day...</font>

In [12]:
# Save a filter for prize documents with three or more laureates
criteria = {"laureates.2": {'$exists': True}}

# Save the set of distinct prize categories in documents satisfying the criteria
triple_play_categories = set(db.prizes.distinct('category', criteria))
print(triple_play_categories)

all_categories = set(db.prizes.distinct('category'))
print(all_categories)
# Confirm literature as the only category not satisfying the criteria.
assert all_categories - triple_play_categories == {'literature'}

{'medicine', 'physics', 'economics', 'chemistry', 'peace'}
{'medicine', 'physics', 'economics', 'chemistry', 'peace', 'literature'}


## 02.08 Filter Arrays using Distinct Values

1. Matching array fields
>In this lesson, we'll learn more about how to query array fields and their structured values.

2. Array fields and equality
>Here we see part of the laureates collection document for John Bardeen. He won two prizes, both in physics. Each prizes array in a laureate document contains subdocuments. Each subdocument has a category field. We can use dot notation to filter for and count laureates with a prize category equal to physics.

3. Array fields and equality, simplified
>Here's a simpler example using a fictitious field. Let's imagine that laureate documents had an extra field, "nicknames". This field stores an array of string values. Let's now find all laureates with a nickname of "JB". We could use a filter document like this. The filter matches all documents that have at least one value in the "nicknames" array field equal to "JB". This notation is familiar. If "nicknames" was not an array, the filter would match for the field value being equal to "JB". Because "nicknames" is an array, the filter matches if any member of the array matches.

4. Array fields and operators
>Let's go back to filtering on the real "category" field. This field is within subdocuments of the top-level "prizes" array field. As before, we can wrap filter document values with operators. For example, here we filter for laureates with a prize not in physics. Note that these documents may contain a prize subdocument with a category of physics. They need only also contain a prize subdocument with another category value. Another example. Here, we use the "in" operator to find laureates with at least one prize in these three categories. And here, we use the "not-in" operator to find laureates with at least one prize not in these three categories.

5. Enter \$elemMatch
>But what if we want to filter on more than one field within a prize subdocument? Let's try something like this to count laureates who won unshared prizes in physics. Hmm, that's not quite what we want. This filter matches prize subdocuments that have two and only two fields. No laureates have a prize subdocument that looks exactly like this. All prize subdocuments also have a year field, for instance. This next filter is better, but it's not quite what we want. This filter matches laureate documents satisfying two conditions. The first is that a prizes field has at least one subdocument with a "category" field equal to "physics". The second is that that a prizes field has at least one subdocument with a "share" field equal to "1". The prizes that match for a laureate could be different prizes. This is where the "element match" - or "elemMatch" - operator comes in. Finally, we count all laureates that have at least one unshared prize in physics. Within the "elemMatch" operation, as with any operation, we can continue to drill down. Operations can nest to make finer-grained queries. Here, we extend the last filter to include laureates only if they won a solo prize in physics before 1945.

6. Onward and array-ward!
>Let's make sure you understand how to filter arrays with "elemMatch". We'll do this in concert with other operators. And we'll learn some Nobel prize statistics along the way.

In [13]:
# Array fields and equality
criteria = {"prizes.category": "physics"}
print(criteria, db.laureates.count_documents(criteria))

# Array fields and equality, simplified
criteria = {'nicknames': {'$exists': True}}
print(f'\n{criteria}', db.laureates.count_documents(criteria))

# Array fields and operators
criteria = {"prizes.category": "physics"}
print(f'\n{criteria}', db.laureates.count_documents(criteria))

criteria = {"prizes.category": {"$ne": "physics"}}
print(f'\n{criteria}', db.laureates.count_documents(criteria))

criteria = {"prizes.category": {"$in": ["physics", "chemistry", "medicine"]}}
print(f'\n{criteria}', db.laureates.count_documents(criteria))

criteria ={"prizes.category": {"$nin": ["physics", "chemistry", "medicine"]}}
print(f'\n{criteria}', db.laureates.count_documents(criteria))

# Enter $elemMatch
print('\nCount laureates who won unshared prizes in physics:')
criteria = {"prizes": {"category": "physics", "share": "1"}}
print(f'Incorrect --> {criteria}', db.laureates.count_documents(criteria)) # Structure not found

criteria = {"prizes.category": "physics", "prizes.share": "1"}
print(f'Incorrect --> {criteria}', db.laureates.count_documents(criteria)) # Match both not necessary in the same subdocument.

criteria = {"prizes": {"$elemMatch":{"category": "physics", "share": "1"}}}
print(criteria, db.laureates.count_documents(criteria)) # Perfect!

criteria = {"prizes": {"$elemMatch": {"category": "physics",
                                      "share": "1",
                                      "year": {"$lt": "1945"},}}}
print(f'\n{criteria}', db.laureates.count_documents(criteria))

{'prizes.category': 'physics'} 209

{'nicknames': {'$exists': True}} 0

{'prizes.category': 'physics'} 209

{'prizes.category': {'$ne': 'physics'}} 725

{'prizes.category': {'$in': ['physics', 'chemistry', 'medicine']}} 604

{'prizes.category': {'$nin': ['physics', 'chemistry', 'medicine']}} 330

Count laureates who won unshared prizes in physics:
Incorrect --> {'prizes': {'category': 'physics', 'share': '1'}} 0
Incorrect --> {'prizes.category': 'physics', 'prizes.share': '1'} 48
{'prizes': {'$elemMatch': {'category': 'physics', 'share': '1'}}} 47

{'prizes': {'$elemMatch': {'category': 'physics', 'share': '1', 'year': {'$lt': '1945'}}}} 29


## 02.09 Sharing in physics after World War II

**Instructions**
What is the approximate ratio of the number of laureates who won an unshared (__{"share": "1"}__) prize in physics after World War II (__{"year": {"$gte": "1945"}}__) to the number of laureates who won a shared prize in physics after World War II?

For reference, the code below determines the number of laureates who won a shared prize in physics before 1945.

<code>db.laureates.count_documents({
    "prizes": {"\$elemMatch": {
        "category": "physics",
        "share": {"\$ne": "1"},
        "year": {"\$lt": "1945"}}}})</code>
        
**Possible Answers**
1. 0.06
2. __0.13__
3. 0.33
4. 0.50

**Results**

<font color=darkgreen>Right-o! There has been significant sharing of physics prizes since World War II</font>

In [14]:
criteria = {"prizes": {"$elemMatch": {"category": "physics", 
                                      "share": "1", 
                                      "year": {"$gte": "1945"}}}}
unshared_prize = db.laureates.count_documents(criteria)
print('Unshared prize:', unshared_prize)

criteria = {"prizes": {"$elemMatch": {"category": "physics", 
                                      "share": {"$ne": "1"}, 
                                      "year": {"$gte": "1945"}}}}
shared_prize = db.laureates.count_documents(criteria)
print('Shared prize:', shared_prize)

print('Ratio:', unshared_prize/shared_prize)

Unshared prize: 18
Shared prize: 143
Ratio: 0.1258741258741259


## 02.10 Meanwhile, in other categories...

We learned in the last exercise that there has been significantly more sharing of physics prizes since World War II: the ratio of the number of laureates who won an unshared prize in physics in or after 1945 to the number of laureates who shared a prize in physics in or after 1945 is approximately 0.13. What is this ratio for prize categories other than physics, chemistry, and medicine?

**Instructions**

1. Save an \$elemMatch filter unshared to count laureates with unshared prizes in categories other than ("not in") ["physics", "chemistry", "medicine"] in or after 1945.
2. Save an \$elemMatch filter shared to count laureates with shared (i.e., "share" is not "1") prizes in categories other than ["physics", "chemistry", "medicine"] in or after 1945.

**Results**

<font color=darkgreen>Wow! This ratio is a ten-fold jump over the ratio for physics!</font>

In [15]:
# Save a filter for laureates with unshared prizes
unshared = {
    "prizes": {'$elemMatch': {
        'category': {'$nin': ["physics", "chemistry", "medicine"]},
        "share": "1",
        "year": {'$gte': "1945"},
    }}}

# Save a filter for laureates with shared prizes
shared = {
    "prizes": {'$elemMatch': {
        'category': {'$nin': ["physics", "chemistry", "medicine"]},
        "share": {'$ne': "1"},
        "year": {'$gte': "1945"},
    }}}

ratio = db.laureates.count_documents(unshared) / db.laureates.count_documents(shared)
print(ratio)

1.3653846153846154


## 02.11 Organizations and prizes over time

How many organizations won prizes before 1945 versus in or after 1945?

**Instructions**

1. You won't need the \$elemMatch operator at all for this exercise.
2. Save a filter before to count organization laureates with prizes won before 1945. Recall that organization status is encoded with the "gender" field, and that dot notation is needed to access a laureate's "year" field within its "prizes" array.
3. Save a filter in_or_after to count organization laureates with prizes won in or after 1945.


**Results**

<font color=darkgreen>Cool! Even though fewer than two thirds of Nobel prizes were awarded in 1945 and later, over 80% of organizations won prizes then.</font>

In [16]:
# Save a filter for organization laureates with prizes won before 1945
before = {
    'gender': 'org',
    'prizes.year': {'$lt': "1945"},
    }

# Save a filter for organization laureates with prizes won in or after 1945
in_or_after = {
    'gender': 'org',
    'prizes.year': {'$gte': "1945"},
    }

n_before = db.laureates.count_documents(before)
n_in_or_after = db.laureates.count_documents(in_or_after)
ratio = n_in_or_after / (n_in_or_after + n_before)
print(ratio)

0.84


## 02.12 Distinct As You Like It

1. Distinct As You Like It: Filtering with Regular Expressions
>We've seen how to construct filters comparing a field's value exactly. For string-valued fields, we may want instead to match a field's value to a pattern. We may want to match a substring. We may want to constrain that substring to appear at the start or end of a field's value. Or, we may want something more complex. Regular expressions are a powerful way to express such filters. Let's see how MongoDB supports them.

2. Finding a substring with \$regex
>Let's look at the laureate document for Marie Curie. Recall that she discovered a new element and named it polonium. She did this to publicize her native land's lack of independence. We see here that Poland is a substring of her document's "bornCountry". How can we filter for values of "bornCountry" that contain Poland as a substring? We can use MongoDB's regular expression operator, regex. Here I use the regex operator on the string "Poland" in a filter document. This expression gets distinct values of "bornCountry" that contain "Poland" as a substring. The results show that some laureates were born in places that at the time were not part of Poland but today are. Others were born in places that at the time were part of Poland but today are not. And finally, some were born in places that both at the time were and today are part of Poland.

3. Flag options for regular expressions
>We can use the regex operator together with the options operator. This will change the conditions for matching. For example, the "i" option ensures case-insensitive matching. The string passed to regex in the second statement is "poland", all lower case. The assertion here is true - Poland is always capitalized for this field. MongoDB also supports compiled regular expression objects. The pymongo driver includes a bson package with a Regex class, which you can import and use as shown. Finally, using native Python regular expression objects is possible. I do not recommend this, though. Use of the bson Regex class is more robust for MongoDB.

4. Beginning and ending (and escaping)
>The syntax of regular expressions is rich. For the exercises, though, you only need to know a few tricks. First, you need to know how to match the beginning or end of a field's value. Second, you need to know how to escape a special character so that you match the character itself. To match the beginning of a field's value, use the caret character. Anchor it to the beginning of the string you pass to regex. This expression returns distinct values of the "bornCountry" field that start with Poland. To escape a character, use a backslash. A paren functions to capture groups in regular expressions. Because we want to match a literal open paren and not use this function, we escape it with a backslash. This expression returns "bornCountry" values for countries that used to be Poland. Finally, to match the end of a field's value, use the dollar sign. Anchor it to the end of what you pass to regex. This expression returns all countries that became Poland after a laureate's birth. What you see here is all you need for the exercises. Use a caret to match the beginning of a field, a dollar sign to match the end, and a backslash to escape parentheses.

5. Let's practice!
>We have new tools to answer questions about string-valued fields in MongoDB collections. Let's practice!

In [17]:
# Exploring
criteria = {"firstname": "Marie"}
print(f'Filter: {criteria} \nFound elements:')
pprint(db.laureates.find_one(criteria))

# Finding a substring with $regex
criteria = {"bornCountry": {"$regex": "Poland"}}
print(f'\nFilter: {criteria} \nFound elements:')
pprint(db.laureates.distinct('bornCountry', criteria))

Filter: {'firstname': 'Marie'} 
Found elements:
{'_id': ObjectId('6035898c6109195e81d36a29'),
 'born': '1867-11-07',
 'bornCity': 'Warsaw',
 'bornCountry': 'Russian Empire (now Poland)',
 'bornCountryCode': 'PL',
 'died': '1934-07-04',
 'diedCity': 'Sallanches',
 'diedCountry': 'France',
 'diedCountryCode': 'FR',
 'firstname': 'Marie',
 'gender': 'female',
 'id': '6',
 'prizes': [{'affiliations': [[]],
             'category': 'physics',
             'motivation': '"in recognition of the extraordinary services they '
                           'have rendered by their joint researches on the '
                           'radiation phenomena discovered by Professor Henri '
                           'Becquerel"',
             'share': '4',
             'year': '1903'},
            {'affiliations': [{'city': 'Paris',
                               'country': 'France',
                               'name': 'Sorbonne University'}],
             'category': 'chemistry',
             'motiva

In [18]:
# Flag options for regular expressions - using $regex
case_sensitive = db.laureates.distinct("bornCountry", {"bornCountry": {"$regex": "Poland"}})
pprint(case_sensitive)

case_insensitive = db.laureates.distinct("bornCountry", {"bornCountry": {"$regex": "poland", "$options": "i"}})
pprint(case_insensitive)

assert set(case_sensitive) == set(case_insensitive)

# Flag options for regular expressions - using bson.regex (the best option)
bson_option = db.laureates.distinct("bornCountry", {"bornCountry": Regex("poland", "i")})

assert set(case_sensitive) == set(bson_option)

# Flag options for regular expressions - using re (not recomended)
re_option = db.laureates.distinct("bornCountry", {"bornCountry": re.compile("poland", re.I)})

assert set(case_sensitive) == set(re_option)

['Austria-Hungary (now Poland)',
 'Free City of Danzig (now Poland)',
 'German-occupied Poland (now Poland)',
 'Germany (now Poland)',
 'Poland',
 'Poland (now Belarus)',
 'Poland (now Lithuania)',
 'Poland (now Ukraine)',
 'Prussia (now Poland)',
 'Russian Empire (now Poland)']
['Austria-Hungary (now Poland)',
 'Free City of Danzig (now Poland)',
 'German-occupied Poland (now Poland)',
 'Germany (now Poland)',
 'Poland',
 'Poland (now Belarus)',
 'Poland (now Lithuania)',
 'Poland (now Ukraine)',
 'Prussia (now Poland)',
 'Russian Empire (now Poland)']


In [19]:
# Beginning and ending (and escaping)
print('\nBegin with "Poland":')
pprint(db.laureates.distinct("bornCountry", {"bornCountry": Regex("^Poland")}))

print('\nBegin with "Poland (now"')
pprint(db.laureates.distinct("bornCountry", {"bornCountry": Regex("^Poland \(now")}))

print('\nEnd with "Poland)"')
pprint(db.laureates.distinct("bornCountry", {"bornCountry": Regex("now Poland\)$")}))


Begin with "Poland":
['Poland',
 'Poland (now Belarus)',
 'Poland (now Lithuania)',
 'Poland (now Ukraine)']

Begin with "Poland (now"
['Poland (now Belarus)', 'Poland (now Lithuania)', 'Poland (now Ukraine)']

End with "Poland)"
['Austria-Hungary (now Poland)',
 'Free City of Danzig (now Poland)',
 'German-occupied Poland (now Poland)',
 'Germany (now Poland)',
 'Prussia (now Poland)',
 'Russian Empire (now Poland)']


## 02.13 Glenn, George, and others in the G.B. crew

**Instructions**

There are two laureates with Berkeley, California as a prize affiliation city that have the initials G.S. - Glenn Seaborg and George Smoot. How many laureates in total have a first name beginning with "G" and a surname beginning with "S"?

Evaluate the expression

<code>db.laureates.count_documents({"firstname": Regex(____), "surname": Regex(____)})</code>

in the console, filling in the blanks appropriately.

**Possible Answers**

1. __9 laureates__
2. 12 laureates
3. 50 laureates

**Results**

<font color=darkgreen>Correct! The filter {"firstname": Regex("^G"), "surname": Regex("^S")} gives us the right answer.</font>

In [20]:
db.laureates.count_documents({"firstname": Regex('^G'), "surname": Regex('^S')})

9

## 02.14 Germany, then and now

Just as we saw with Poland, there are laureates who were born somewhere that was in Germany at the time but is now not, and others born somewhere that was not in Germany at the time but now is.

**Instructions**

1. Use a regular expression object to filter for laureates with "Germany" in their "bornCountry" value.
2. Use a regular expression object to filter for laureates with a "bornCountry" value starting with "Germany".
3. Use a regular expression object to filter for laureates born in what was at the time Germany but is now another country.
4. Use a regular expression object to filter for laureates born in what is now Germany but at the time was another country.

**Results**

<font color=darkgreen>Wunderbar! There are twelve distinct values that represent countries that were or became part of Germany. Also, some laureates were born in parts of modern-day Poland, France, and Russia that were at the time part of Germany. Finally, it's true – the home of Oktoberfest, Bavaria, was really its own country at one time!</font>

In [21]:
# Filter for laureates with "Germany" in their "bornCountry" value
criteria = {"bornCountry": Regex('Germany')}
print(f'Filter: {criteria} \nFound born country:')
pprint(set(db.laureates.distinct("bornCountry", criteria)))

# Filter for laureates with a "bornCountry" value starting with "Germany"
criteria = {"bornCountry": Regex('^Germany')}
print(f'\nFilter: {criteria} \nFound born country:')
pprint(set(db.laureates.distinct("bornCountry", criteria)))

# Fill in a string value to be sandwiched between the strings "^Germany " and "now"
criteria = {"bornCountry": Regex("^Germany " + '\(' + "now")}
print(f'\nFilter: {criteria} \nFound born country:')
pprint(set(db.laureates.distinct("bornCountry", criteria)))

#Filter for currently-Germany countries of birth. Fill in a string value to be sandwiched between the strings "now" and "$"
criteria = {"bornCountry": Regex("now" + ' Germany\)' + "$")}
print(f'\nFilter: {criteria} \nFound born country:')
pprint(set(db.laureates.distinct("bornCountry", criteria)))

Filter: {'bornCountry': Regex('Germany', 0)} 
Found born country:
{'Bavaria (now Germany)',
 'East Friesland (now Germany)',
 'Germany',
 'Germany (now France)',
 'Germany (now Poland)',
 'Germany (now Russia)',
 'Hesse-Kassel (now Germany)',
 'Mecklenburg (now Germany)',
 'Prussia (now Germany)',
 'Schleswig (now Germany)',
 'W&uuml;rttemberg (now Germany)',
 'West Germany (now Germany)'}

Filter: {'bornCountry': Regex('^Germany', 0)} 
Found born country:
{'Germany',
 'Germany (now France)',
 'Germany (now Poland)',
 'Germany (now Russia)'}

Filter: {'bornCountry': Regex('^Germany \\(now', 0)} 
Found born country:
{'Germany (now Russia)', 'Germany (now Poland)', 'Germany (now France)'}

Filter: {'bornCountry': Regex('now Germany\\)$', 0)} 
Found born country:
{'Bavaria (now Germany)',
 'East Friesland (now Germany)',
 'Hesse-Kassel (now Germany)',
 'Mecklenburg (now Germany)',
 'Prussia (now Germany)',
 'Schleswig (now Germany)',
 'W&uuml;rttemberg (now Germany)',
 'West Germany (now 

## 02.15 The prized transistor

Three people shared a Nobel prize "for their researches on semiconductors and their discovery of the transistor effect". We can filter on "transistor" as a substring of a laureate's "prizes.motivation" field value to find these laureates.

**Instructions**

1. Save a filter criteria that finds laureates with prizes.motivation values containing "transistor" as a substring. The substring can appear anywhere within the value, so no anchoring characters are needed.
2. Save to first and last the field names corresponding to a laureate's first name and last name (i.e. "surname") so that we can print out the names of these laureates.

**Results**

<font color=darkgreen>Great! Shockley and Bareen and Brattain were a great team.</font>

In [22]:
# Save a filter for laureates with prize motivation values containing "transistor" as a substring
criteria = {'prizes.motivation': Regex('transistor')}

# Save the field names corresponding to a laureate's first name and last name
first, last = 'firstname', 'surname'
pprint([(laureate[first], laureate[last]) for laureate in db.laureates.find(criteria)])

[('William Bradford', 'Shockley'),
 ('John', 'Bardeen'),
 ('Walter Houser', 'Brattain')]


# Aditional material
- Datacamp course: https://learn.datacamp.com/courses/introduction-to-using-mongodb-for-data-science-with-python