# 1. Flexibly Structured Data
This chapter is about getting a bird's-eye view of the Nobel Prize data's structure. You will relate MongoDB documents, collections, and databases to JSON and Python types. You'll then use filters, operators, and dot notation to explore substructure.

In [1]:
# Testing mongosh
!mongosh --help


  $ mongosh [options] [db address] [file names (ending in .js or .mongodb)]

  Options:

    -h, --help                                 Show this usage information
    -f, --file [arg]                           Load the specified mongosh script
        --host [arg]                           Server to connect to
        --port [arg]                           Port to connect to
        --build-info                           Show build information
        --version                              Show version information
        --quiet                                Silence output from the shell during the connection process
        --shell                                Run the shell after executing files
        --nodb                                 Don't connect to mongod on startup - no 'db address' [arg] expected
        --norc                                 Will not run the '.mongoshrc.js' file on start up
        --eval [arg]                           Evaluate javascript
        -

In [2]:
# Importing libraries
import requests
import json

from pymongo import MongoClient
from pprint import pprint

In [3]:
# Client connects to "localhost" by default
client = MongoClient()

## 1.1 Intro to MongoDB and the Nobel Prize dataset

### The Nobel Prize API data(base)

In [4]:
# Create local "nobel" database on the fly
db = client["nobel"]

for collection_name in ["nobelPrizes", "laureates"]:
    # collect the data from the API
    url = f"https://api.nobelprize.org/2.1/{collection_name}"
    response = requests.get(url, 
                            headers={'User-Agent': 'Mozilla/5.0'})

    # convert the data to json
    documents = response.json()[collection_name]

    # # Create collections on the fly
    # Current API only allows 25 at a call
    # db[collection_name].insert_many(documents)
    print(f'Collection {collection_name} added')

Collection nobelPrizes added
Collection laureates added


In [5]:
!dir data /b

laureates.json
prizes.json


In [6]:
# Adding data from files
db = client["nobel"]

for collection_name in ["prizes", "laureates"]:
    with open(f'data/{collection_name}.json') as f:
        documents = json.load(f)

    # Create collections on the fly
    # db[collection_name].insert_many(documents)
    print(f'Collection {collection_name} added')

Collection prizes added
Collection laureates added


### Accessing databases and collections

#### Using []

In [7]:
# client is a dictionary of databases
db = client["nobel"]

# database is a dictionary of collections
prizes_collection = db["nobelPrizes"]

#### Using .

In [8]:
# databases are attributes of a client
db = client.nobel

# collections are attributes of databases
laureates_collection = db.laureates

### Count documents in a collection

In [9]:
# Use empty document {} as a filter
filter = {}

# Count documents in a collection
n_prizes = db.prizes.count_documents(filter)
n_laureates = db.laureates.count_documents(filter)

print('Documents in prizes:', n_prizes)
print('Document in Laureates:', n_laureates)

Documents in prizes: 590
Document in Laureates: 934


### Listing databases and collections

In [10]:
# Save a list of names of the databases managed by client
db_names = client.list_database_names()
print(db_names)

# Save a list of names of the collections managed by the "nobel" database
nobel_coll_names = client.nobel.list_collection_names()
print(nobel_coll_names)

['admin', 'config', 'local', 'nobel']
['prizes', 'laureates']


### List fields of a document

The `.find_one()` method of a collection can be used to retrieve a single document. This method accepts an optional filter argument that specifies the pattern that the document must match. You can specify no filter or an empty document filter (`{}`), in which case MongoDB will return the document that is first in the internal order of the collection.

This method is useful when you want to learn the structure of documents in the collection.

In Python, the returned document takes the form of a dictionary:
```
    sample_doc = {'id' : 12345, 'name':'Donny Winston', 'instructor': True}
```

The **keys of the dictionary** are the (root-level) "fields" of the document, e.g. `'id', 'name','instructor'`.

In [11]:
# Retrieve sample prize and laureate documents
prize = db.prizes.find_one({})
laureate = db.laureates.find_one({})

# Print the sample prize and laureate documents
pprint(prize)
pprint(laureate)
print(type(laureate))

{'_id': ObjectId('6706d88371ea025ecc350d20'),
 'category': 'physics',
 'laureates': [{'firstname': 'Arthur',
                'id': '960',
                'motivation': '"for the optical tweezers and their application '
                              'to biological systems"',
                'share': '2',
                'surname': 'Ashkin'},
               {'firstname': 'Gérard',
                'id': '961',
                'motivation': '"for their method of generating high-intensity, '
                              'ultra-short optical pulses"',
                'share': '4',
                'surname': 'Mourou'},
               {'firstname': 'Donna',
                'id': '962',
                'motivation': '"for their method of generating high-intensity, '
                              'ultra-short optical pulses"',
                'share': '4',
                'surname': 'Strickland'}],
 'overallMotivation': '“for groundbreaking inventions in the field of laser '
                   

In [12]:
# Get the fields present in each type of document
prize_fields = list(prize.keys())
laureate_fields = list(laureate.keys())

print(prize_fields)
print(laureate_fields)

['_id', 'year', 'category', 'overallMotivation', 'laureates']
['_id', 'id', 'firstname', 'surname', 'born', 'died', 'bornCountry', 'bornCountryCode', 'bornCity', 'diedCountry', 'diedCountryCode', 'gender', 'prizes']


## 1.2 Finding documents

### An example "laureates" document

In [13]:
laureate = db.laureates.find_one({})
pprint(laureate)

{'_id': ObjectId('6706d88371ea025ecc350f6e'),
 'born': '1853-07-18',
 'bornCity': 'Arnhem',
 'bornCountry': 'the Netherlands',
 'bornCountryCode': 'NL',
 'died': '1928-02-04',
 'diedCountry': 'the Netherlands',
 'diedCountryCode': 'NL',
 'firstname': 'Hendrik Antoon',
 'gender': 'male',
 'id': '2',
 'prizes': [{'affiliations': [{'city': 'Leiden',
                               'country': 'the Netherlands',
                               'name': 'Leiden University'}],
             'category': 'physics',
             'motivation': '"in recognition of the extraordinary service they '
                           'rendered by their researches into the influence of '
                           'magnetism upon radiation phenomena"',
             'share': '2',
             'year': '1902'}],
 'surname': 'Lorentz'}


### Filters as (sub)documents

In [14]:
filter_doc = {
    'born': '1845-03-27',
    'diedCountry': 'Germany',
    'gender': 'male',
    'surname': 'Röntgen'
}
db.laureates.count_documents(filter_doc)

1

In [15]:
pprint(db.laureates.find_one(filter_doc))

{'_id': ObjectId('6706d88371ea025ecc350fae'),
 'born': '1845-03-27',
 'bornCity': 'Lennep (now Remscheid)',
 'bornCountry': 'Prussia (now Germany)',
 'bornCountryCode': 'DE',
 'died': '1923-02-10',
 'diedCity': 'Munich',
 'diedCountry': 'Germany',
 'diedCountryCode': 'DE',
 'firstname': 'Wilhelm Conrad',
 'gender': 'male',
 'id': '1',
 'prizes': [{'affiliations': [{'city': 'Munich',
                               'country': 'Germany',
                               'name': 'Munich University'}],
             'category': 'physics',
             'motivation': '"in recognition of the extraordinary services he '
                           'has rendered by the discovery of the remarkable '
                           'rays subsequently named after him"',
             'share': '1',
             'year': '1901'}],
 'surname': 'Röntgen'}


### Simple filters

In [16]:
db.laureates.count_documents({'gender': 'female'})

51

In [17]:
db.laureates.count_documents({'diedCountry': 'France'})

50

In [18]:
db.laureates.count_documents({'bornCity': 'Warsaw'})

2

### Composing filters

In [19]:
filter_doc = {
    'gender': 'female',
    'diedCountry': 'France',
    'bornCity': 'Warsaw'
}
db.laureates.count_documents(filter_doc)

1

In [20]:
pprint(db.laureates.find_one(filter_doc))

{'_id': ObjectId('6706d88371ea025ecc350fb2'),
 'born': '1867-11-07',
 'bornCity': 'Warsaw',
 'bornCountry': 'Russian Empire (now Poland)',
 'bornCountryCode': 'PL',
 'died': '1934-07-04',
 'diedCity': 'Sallanches',
 'diedCountry': 'France',
 'diedCountryCode': 'FR',
 'firstname': 'Marie',
 'gender': 'female',
 'id': '6',
 'prizes': [{'affiliations': [[]],
             'category': 'physics',
             'motivation': '"in recognition of the extraordinary services they '
                           'have rendered by their joint researches on the '
                           'radiation phenomena discovered by Professor Henri '
                           'Becquerel"',
             'share': '4',
             'year': '1903'},
            {'affiliations': [{'city': 'Paris',
                               'country': 'France',
                               'name': 'Sorbonne University'}],
             'category': 'chemistry',
             'motivation': '"in recognition of her services to the a

### Query operators
#### Value in a range $in: <list>

In [21]:
db.laureates.count_documents({
    'diedCountry': {
        '$in': ['France', 'USA']
    }
})

259

#### Not equal $ne : <value>

In [22]:
db.laureates.count_documents({
    'diedCountry': {
        '$ne': 'France'
    }
})

884

#### Comparison: `> : $gt` , `≥ : $gte`, `< : $lt`, and `≤ : $lte`
Strings are compared lexicographically

In [23]:
db.laureates.count_documents({
    'diedCountry': {
        '$gt': 'Belgium',
        '$lte': 'USA'
    }
})

455

### Exercises

In [24]:
# returns the number of laureates with dates of birth earlier than the year 1900
db.laureates.count_documents({"born": {"$lt": "1900"}})

324

In [25]:
# what is the number of laureates born prior to 1800
db.laureates.count_documents({"born": {"$lte": "1800"}})

38

In [26]:
# What about prior to 1700?
db.laureates.count_documents({"born": {"$lte": "1700"}})

38

In [27]:
# Create a filter for Germany-born laureates who died in the USA 
# and with the first name "Albert"
criteria = {'diedCountry': 'USA', 
            'bornCountry': 'Germany', 
            'firstname': 'Albert'}

# Save the count
count = db.laureates.count_documents(criteria)
print(count)

1


In [28]:
# Save a filter for laureates born in the USA, Canada, or Mexico
criteria = { 'bornCountry': 
                { "$in": ['USA', 'Canada', 'Mexico']}
             }

# Count them and save the count
count = db.laureates.count_documents(criteria)
print(count)

291


In [29]:
# Save a filter for laureates who died in the USA and were not born there
criteria = { 'diedCountry': 'USA',
             'bornCountry': { "$ne": 'USA'}, 
           }

# Count them
count = db.laureates.count_documents(criteria)
print(count)

69


## 1.3 Dot notation: reach into substructure

### A functional density

In [30]:
walter_kohn = db.laureates.find_one({
    "firstname": "Walter",
    "surname": "Kohn"
})
pprint(walter_kohn)

{'_id': ObjectId('6706d88371ea025ecc3510e6'),
 'born': '1923-03-09',
 'bornCity': 'Vienna',
 'bornCountry': 'Austria',
 'bornCountryCode': 'AT',
 'died': '2016-04-19',
 'diedCity': 'Santa Barbara, CA',
 'diedCountry': 'USA',
 'diedCountryCode': 'US',
 'firstname': 'Walter',
 'gender': 'male',
 'id': '290',
 'prizes': [{'affiliations': [{'city': 'Santa Barbara, CA',
                               'country': 'USA',
                               'name': 'University of California'}],
             'category': 'chemistry',
             'motivation': '"for his development of the density-functional '
                           'theory"',
             'share': '2',
             'year': '1998'}],
 'surname': 'Kohn'}


In [31]:
db.laureates.count_documents({
    "prizes.affiliations.name": ("University of California")
})

34

In [32]:
db.laureates.count_documents({
    "prizes.affiliations.name": "University of California"
})

34

In [33]:
db.laureates.count_documents({
    "prizes.affiliations.city": ("Berkeley, CA")
})

19

### Operator `$exists`

In [34]:
# No bornCountry for Naipaul
pprint(db.laureates.find_one({'surname': 'Naipaul'}))

{'_id': ObjectId('6706d88371ea025ecc3511bb'),
 'born': '1932-08-17',
 'died': '2018-08-11',
 'diedCity': 'London',
 'diedCountry': 'United Kingdom',
 'diedCountryCode': 'GB',
 'firstname': 'Sir Vidiadhar Surajprasad',
 'gender': 'male',
 'id': '747',
 'prizes': [{'affiliations': [[]],
             'category': 'literature',
             'motivation': '"for having united perceptive narrative and '
                           'incorruptible scrutiny in works that compel us to '
                           'see the presence of suppressed histories"',
             'share': '1',
             'year': '2001'}],
 'surname': 'Naipaul'}


In [35]:
db.laureates.count_documents({
    "bornCountry": {"$exists": False}
})

33

In [36]:
# Total Documents
db.laureates.count_documents({})

934

In [37]:
# Documents with prize data - All documents have prize data
db.laureates.count_documents({"prizes": {"$exists": True}})

934

In [38]:
# Accessing the first element of prize data
# All laureates have at least one prize
db.laureates.count_documents({"prizes.0": {"$exists": True}})

934

In [39]:
# How many laureates have more than one prize
db.laureates.count_documents({"prizes.1": {"$exists": True}})

6

### Exercises

In [40]:
# count the number of laureates born in Austria with a prize affiliation 
# country that is not also Austria
criteria = { 
    'bornCountry': 'Austria',
    'prizes.affiliations.country': { "$ne": 'Austria'},
}
db.laureates.count_documents(criteria)

10

In [41]:
# When a birthdate is unknown, the "born" field has the value "0000-00-00"
# counts the number of laureates with unknown born
db.laureates.count_documents({"born": "0000-00-00"})

38

In [42]:
# Filter for documents without a "born" field
criteria = {'born': {"$exists": False}}
count = db.laureates.count_documents(criteria)
print(count)

0


In [43]:
# Find a document for a laureate with at least three 
# elements in its "prizes" array
criteria = {"prizes.2": {"$exists": True}}
doc = db.laureates.find_one(criteria)
pprint(doc)

{'_id': ObjectId('6706d88371ea025ecc3510a4'),
 'born': '0000-00-00',
 'died': '0000-00-00',
 'firstname': 'Comité international de la Croix Rouge (International Committee '
              'of the Red Cross)',
 'gender': 'org',
 'id': '482',
 'prizes': [{'affiliations': [[]],
             'category': 'peace',
             'share': '1',
             'year': '1917'},
            {'affiliations': [[]],
             'category': 'peace',
             'share': '1',
             'year': '1944'},
            {'affiliations': [[]],
             'category': 'peace',
             'share': '2',
             'year': '1963'}]}


--------------------------------

In [44]:
# close MongoDB connection
client.close()

----------------------------------