# Working with Distinct Values, $elemMatch, and Regex: Exercises

In [None]:
from bson.regex import Regex
from pymongo import MongoClient

client = MongoClient()
db = client.nobel

## Categorical data validation

What expression asserts that the distinct Nobel Prize categories catalogued by the "prizes" collection are the same as those catalogued by the "laureates"?

Remember to explore example documents via e.g. `db.prizes.find_one()` and `db.laureates.find_one()`.

1. `assert db.prizes.distinct("category") == db.laureates.distinct("prizes.category")`
2. `assert db.prizes.distinct("laureates.category") == db.laureates.distinct("prizes.category")`
3. `assert set(db.prizes.distinct("category")) == set(db.laureates.distinct("prizes.category"))`

## Never from there, but sometimes there at last

There are some recorded countries of death (`"diedCountry"`) that do not appear as a country of birth (`"bornCountry"`) for laureates. One such country is "East Germany".

- Return a set of all such countries as `countries`.

In [None]:
# Countries recorded as countries of death but not as countries of birth
countries = set(____) - set(____)
print(countries)

## Countries of affiliation

We saw in the last exercise that countries can be associated with a laureate as their country of birth and as their country of death. For each prize a laureate received, they may also have been affiliated with an institution at the time, located in a country.

- Determine the number of distinct countries recorded as part of an affiliation for laureates' prizes. Save this as `count`.

In [None]:
# The number of distinct countries of laureate affiliation for prizes
count = ____(db.laureates.____(____))
print(count)

## Born here, went there

In which countries have USA-born laureates had affiliations for their prizes?

1. Australia, Denmark, United Kingdom, USA
2. Australia, France, Sweden, United Kingdom, USA
3. Australia, Canada, Israel, United Kingdom, USA

## Triple plays (mostly) all around

All prize categories but one – literature – have had prizes shared by three or more laureates.

- Save a filter document `criteria` that, when passed to `db.prizes.distinct`, returns all prize categories shared by three or more laureates. That is, `"laureates.2"` must exist for such documents.
- Save these prize categories as a Python `set` called `triple_play_categories`.
- Confirm via an assertion that "literature" is the only prize category with no prizes shared by three or more laureates.

In [None]:
# Save a filter for prize documents with three or more laureates
criteria = {____: {____: ____}}

# Save the set of distinct prize categories in documents satisfying the criteria
triple_play_categories = ____(db.prizes.distinct(____, criteria))

# Confirm literature as the only category not satisfying the criteria.
assert set(db.prizes.distinct(____)) - triple_play_categories == {____}

## Sharing in physics after World War II

What is the approximate ratio of the number of laureates who won an *unshared*, i.e.

    {"share": "1"}

, prize in physics after World War II, i.e.

    {"year": {"$gte": "1945"}}

, to the number of laureates who won a *shared* prize in physics after World War II?

1. 0.06
1. 0.13
1. 0.33
1. 0.50

## Meanwhile, in other categories...

What is the aforemenrtioned ratio for prize categories other than physics, chemistry, and medicine?

- Save an `$elemMatch` filter `unshared` to count laureates with unshared prizes in categories other than ("not in") `["physics", "chemistry", "medicine"]` in or after 1945.
- Save an `$elemMatch` filter `shared` to count laureates with shared (i.e., "share" is not "1") prizes in categories other than `["physics", "chemistry", "medicine"]` in or after 1945.



In [None]:
# Save a filter for laureates with unshared prizes
unshared = {
    "prizes": {____: {
        ____: {____: ["physics", "chemistry", "medicine"]},
        "share": "1",
        "year": {____: "1945"},
    }}}

# Save a filter for laureates with shared prizes
shared = {
    "prizes": {____: {
        ____: {____: ["physics", "chemistry", "medicine"]},
        "share": {____: "1"},
        "year": {____: "1945"},
    }}}

ratio = db.laureates.____(____) / db.laureates.____(____)
print(ratio)

## Organizations and prizes over time

How many organizations won prizes before 1945 versus in or after 1945?

- Save a filter `before` to count organization laureates with prizes won before 1945. Recall that organization status is encoded with the "gender" field, and that dot notation is needed to access a laureate's "year" field within its "prizes" array.
- Save a filter `in_or_after` to count organization laureates with prizes won in or after 1945.

In [None]:
# Save a filter for organization laureates with prizes won before 1945
before = {
    ____: ____,
    ____: {____: "1945"},
    }

# Save a filter for organization laureates with prizes won in or after 1945
in_or_after = {
    ____: ____,
    ____: {____: "1945"},
    }

n_before = db.laureates.count_documents(before)
n_in_or_after = db.laureates.count_documents(in_or_after)
ratio = n_in_or_after / (n_in_or_after + n_before)
print(ratio)

## Glenn, George, and others in the G.B. crew

There are two laureates with Berkeley, California as a prize affiliation city that have the initials G.S. - Glenn Seaborg and George Smoot.

How many laureates in total have a first name beginning with "G" and a surname beginning with "S"?

Evaluate the expression
```python
db.laureates.count_documents({"firstname": Regex(____), "surname": Regex(____)})
```
, filling in the blanks appropriately.

1. 9 laureates
1. 12 laureates
1. 50 laureates

## Germany, then and now

Just as we saw with Poland, there are laureates who were born somewhere that was in Germany at the time but is now not, and others born somewhere that was not in Germany at the time but now is.

- Use a regular expression object to filter for laureates with "Germany" in their "bornCountry" value.

In [None]:
# Filter for laureates with "Germany" in their "bornCountry" value
criteria = {"bornCountry": Regex(____)}
print(set(db.laureates.distinct("bornCountry", criteria)))

- Use a regular expression object to filter for laureates with a "bornCountry" value starting with "Germany".

In [None]:
# Filter for laureates with a "bornCountry" value starting with "Germany"
criteria = {"bornCountry": ____(____)}
print(set(db.laureates.distinct("bornCountry", criteria)))

- Use a regular expression object to filter for laureates born in what was at the time Germany but is now another country.

In [None]:
# Fill in a string value to be sandwiched between the strings "^" and "now"
criteria = {"bornCountry": ____("^" + ____ + "now")}
print(set(db.laureates.distinct("bornCountry", criteria)))

- Use a regular expression object to filter for laureates born in what is now Germany but at the time was another country.

In [None]:
# Filter for currently-Germany countries of birth.
# Fill in a string value to be sandwiched between the strings "now" and "$"
criteria = {"bornCountry": ____("now" + ____ + "$")}
print(set(db.laureates.distinct("bornCountry", criteria)))

## The prized transistor

Three people shared a Nobel prize "for their researches on semiconductors and their discovery of the transistor effect". We can filter on "transistor" as a substring of a laureate's "prizes.motivation" field value to find these laureates.

- Save a filter `criteria` that finds laureates with `prizes.motivation` values containing "transistor" as a substring. The substring can appear anywhere within the value, so no anchoring characters are needed.
- Save to `first` and `last` the field names corresponding to a laureate's first name and last name (i.e. "surname") so that we can print out the names of these laureates.



In [None]:
# Save a filter for laureates with prize motivation values containing "transistor" as a substring
criteria = {____: Regex(____)}

# Save the field names corresponding to a laureate's first name and last name
first, last = ____, ____
print([(laureate[first], laureate[last]) for laureate in db.laureates.find(criteria)])