# Scripting Week 9: MongoDB and Semi-Structured Data

In [30]:
import pandas as pd

## Agenda

- Lab questions from last week
   - Skills questions: are you curious about doing more with regular expressions?
- JSON
- MongoDB
- Work time - lab and final projects

## Announcements

- RMIS conference: please apply, please.
- Andrea Thomer talk, Thursday at 6:20
  - Data curation expert from UMich
- Alert Level Clear! - with caveat...

# Semi-structured data

Semi-structured data has a structure, but it is specific to the needs of the data.

Tabular data always has two-dimensions - rows and columns (aka 'fields' and 'records') - semi-structured data is less predictable.

### Common Semi-Structured Formats

JSON
- Hierarchical
- Organized around key/value pairs and lists of values

XML
- Hierarchical, enclosed
- Organized with tags, properties, and values

### XML, briefly

XML has tags and attributes, representing elements, and content

![Example XML](../images/xml1.png)

TAGS

- `<catalog>`
- `<book>`
- `<author>`

ELEMENTS
- Everything from `<book>` to `</book>`

ATTRIBUTES
- The ‘id’ of `<book>`

CONTENT
- “Gambardella, Matthew”
- “XML Developer’s Guide”

## JSON

JSON is made up a few data value types:

- string
- number (like a float in Python)
- object (named collection, like a dictionary in Python)
- array  (unnamed collection, like a list in Python)
- boolean
- null

Our 'containers' are objects and arrays. They can hold any of the other data types, *including other arrays and objects*.


### Object

Key value pairs, surrounded by curly braces.

- `:` separates key and value, `,` separates the item pairs.
- Keys have to be strings, values can be anything.

### Arrays

Nearly identical to lists in Python. Values, surrounded by square brackets.

- `,` separates items.
- Values can be any data type.

### Objects 

basic form:

```json
{
  "name": "Jill"
}
```

Adding key-value pairs:
    
```json
{
  "name": "Jill",
  "profession": "Juggler"
}
```

Values can be other types:
    
```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51
}
```

JSON is language-agnostic, so some things are different than in Python. e.g. *true / false* (lowercase) rather than Python's *True / False*:

```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51,
  "active": true
}
```

The values can be arrays and objects - that's how things start to get deep:

```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51,
  "active": true,
  "tools": ["ball", "handkerchief", "flaming motorcycle"]
}
```

Nested Object:

```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51,
  "active": true,
  "tools": ["ball", "handkerchief", "flaming motorcycle"],
  "awards": {
      "Tri-City Juggling Symposium 2010": "Technical Program, Silver",
      "Ballympics 2009": "Best Overall, Runner-up"
  }
}
```

List of objects, why not?

```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51,
  "active": true,
  "tools": ["ball", "handkerchief", "flaming motorcycle"],
  "awards": [
        {
            "event": "Tri-City Juggling Symposium",
             "year": 2010,
             "awards": ["Technical Program, Silver"]
        },
        {
            "event": "Ballympics",
             "year": 2009,
             "awards": ["Best Overall, Runner-up"]
        }
    ]
}
```

Whitespace doesn't matter:

```json
{"name": "Jill", "profession": "Juggler", "age": 51, "active": true, "tools": ["ball", "handkerchief", "flaming motorcycle"], "awards": [{"event": "Tri-City Juggling Symposium", "year": 2010, "awards": ["Technical Program, Silver"]}, {"event": "Ballympics", "year": 2009, "awards": ["Best Overall, Runner-up"]}]}
```

### Reading in Python

In [36]:
person = {"name": "Jill", "profession": "Juggler", "age": 51, "active": True,
          "tools": ["ball", "handkerchief", "flaming motorcycle"], 
          "awards": [
              {"event": "Tri-City Juggling Symposium", "year": 2010, "awards": ["Technical Program, Silver"]},
              {"event": "Ballympics", "year": 2009, "awards": ["Best Overall, Runner-up"]}
          ]}

In [None]:
person['name']

In [41]:
a = person['tools']
a[0]

'ball'

In [None]:
person['awards'][0]['event']

### Exercises

Develop a JSON structure for representing the following:

- a directory of current mayors, organized by cities and subdivided by state
    - Try with 3 cities from 2 states
- a collection of movies, each with basic metadata (e.g. title, director, year) and with viewer opinion information (ratings or reviews)
   - Try with *Black Panther* and *Avengers: Endgame*

# MongoDB

A semi-structured non-relational (No-SQL) database, which uses a JSON-like format for storing information.

### Terminology

<center>**Database Management System (MongoDB)**</center>
<center>*can have one or more*</center>

<center>**Databases**</center>

<center>*can have one or more*</center>
<center>**Collections**</center>

<center>*can have one or more*</center>
<center>**Documents**</center>

Compare to relational databases - How do the terms align?

*Documents* are JSON-like objects, *Collections* are lists of those objects.

```
[
    {
      "name": "Jill",
      "profession": "Juggler"
    },
    { "name": "Jack",
      "profession": "Unemployed"
    }
]
```

Before we learn about MongoDB, let's get it ready on our systems so we can be hands on!

- Turn to Lab and do the small JSON primer

Connecting to a database called 'week9slides'. I'll put all my collections there this week.

In [1]:
from pymongo import MongoClient

# Loading my credentials from a file
with open('credentials.txt', mode='r') as f:
    user, mongopw, cluster_url = [l.strip() for l in f.readlines()]

client = MongoClient("mongodb+srv://{}:{}@{}/test?retryWrites=true&w=majority".format(user, mongopw, cluster_url))
db = client.week9slides

In [78]:
# Reset for future years
x = db.drop_collection('example')
x = db.drop_collection('example2')

New `example` collection:

In [4]:
collection = db.example

## Insert One

In [43]:
doc = {
  "name": "Jill",
  "age": 51,
  "profession": "Juggler"
}
collection.insert_one(doc) 

<pymongo.results.InsertOneResult at 0x7f1ee6b69e40>

## Retrieving Documents

MongoDB does retrieval with `find` or `find_one`, using a JSON-query.

`find_one` gives a JSON object back:

In [44]:
collection.find_one(
    {"age": 51}
)

{'_id': ObjectId('5ec72e9c5138fc0870d52557'),
 'name': 'Jill',
 'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}}

`find` returns an instance of an object that can be iterated over: 

In [45]:
collection.find({'name': 'Jill'})

<pymongo.cursor.Cursor at 0x7f1ee6b6d7f0>

In [53]:
results = collection.find({'name': 'Jill'})
for result in results:
    print(result)

{'_id': ObjectId('5ec72e9c5138fc0870d52557'), 'name': 'Jill', 'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}}
{'_id': ObjectId('5ecd9bf8db075a3e645f62ab'), 'name': 'Jill', 'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}}
{'_id': ObjectId('621e98743b4e98756d5b88df'), 'name': 'Jill', 'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}}
{'_id': ObjectId('621ebd843b4e98756d5b88e2'), 'name': 'Jill', 'age': 51, 'profession': 'Juggler'}


*Try iterating over the `results` variable again without doing another search. What happens?*

In [47]:
for result in results:
    print(result)

The `cursor` is a pointer to the database, so you're not holding all the data in memory - you're just pointing to it. Good for really large datasets!

Alternately, you can convert everything to a list, but don't do this if you have a lot of data!

In [48]:
results = collection.find({'name': 'Jill'})
list_of_results = list(results)
list_of_results

[{'_id': ObjectId('5ec72e9c5138fc0870d52557'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}},
 {'_id': ObjectId('5ecd9bf8db075a3e645f62ab'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}},
 {'_id': ObjectId('621e98743b4e98756d5b88df'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}},
 {'_id': ObjectId('621ebd843b4e98756d5b88e2'),
  'name': 'Jill',
  'age': 51,
  'profession': 'Juggler'}]

## Insert Many

In [55]:
docs = [{ "name": "Jack", "age": 50, "profession": "Unemployed" },
        { "name": "Jun Ho", "age": 34, "profession": "Juggler" }
       ]
collection.insert_many(docs)

<pymongo.results.InsertManyResult at 0x7f1ee6b57f40>

In [80]:
results = collection.find({"profession": "Juggler"})
list(results)

[]

### Count

In [79]:
collection.estimated_document_count()

0

## Retrieving Everything

*What would our search be to get everything?*

In [58]:
results = collection.find({})
list(results)

[{'_id': ObjectId('5ec72e9c5138fc0870d52557'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}},
 {'_id': ObjectId('5ec72e9d5138fc0870d52558'),
  'name': 'Jack',
  'profile': {'age': 50, 'hobbies': ['internet commenting']}},
 {'_id': ObjectId('5ec72e9e5138fc0870d52559'),
  'name': 'Ju Ho',
  'profile': {'age': 34, 'hobbies': ['petting dogs', 'juggling']}},
 {'_id': ObjectId('5ecd9bf8db075a3e645f62ab'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}},
 {'_id': ObjectId('5ecd9bfcdb075a3e645f62ac'),
  'name': 'Jack',
  'profile': {'age': 50, 'hobbies': ['internet commenting']}},
 {'_id': ObjectId('5ecd9c19db075a3e645f62ad'),
  'name': 'Ju Ho',
  'profile': {'age': 34, 'hobbies': ['petting dogs', 'juggling']}},
 {'_id': ObjectId('621e98743b4e98756d5b88df'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}},
 {'_id': ObjectId('621e98753b4e98756d5b88e0'),
  'name': 'Jack',
  'profile': {'age': 50, 'h

## `Find` by Example

Exact match:

In [59]:
results = collection.find({
            "age": 50
        })
list(results)

[{'_id': ObjectId('621ebfa03b4e98756d5b88e3'),
  'name': 'Jack',
  'age': 50,
  'profession': 'Unemployed'}]

Multiple conditions:

In [16]:
results = collection.find({
            "age": 50,
            "profession": "Unemployed"
        })
list(results)

[{'_id': ObjectId('621e983e3b4e98756d5b88dd'),
  'name': 'Jack',
  'age': 50,
  'profession': 'Unemployed'}]

Greater than, Less than, Greater than or equal to, Less than or equal to:

In [61]:
results = collection.find({
            "age": { "$gt": 50 }
        })
list(results)

[{'_id': ObjectId('621ebd843b4e98756d5b88e2'),
  'name': 'Jill',
  'age': 51,
  'profession': 'Juggler'}]

In [18]:
results = collection.find({
            "age": { "$gte": 50 }
        })
list(results)

[{'_id': ObjectId('621e981a3b4e98756d5b88dc'),
  'name': 'Jill',
  'age': 51,
  'profession': 'Juggler'},
 {'_id': ObjectId('621e983e3b4e98756d5b88dd'),
  'name': 'Jack',
  'age': 50,
  'profession': 'Unemployed'}]

```json
{ "$gt": 50 }
```

The `$` tells mongo that this is a special function, not simply a value named `gt`

In [69]:
results = collection.find({
            "age": { "$gte": 50 }
        })
list(results)

[{'_id': ObjectId('621ebd843b4e98756d5b88e2'),
  'name': 'Jill',
  'age': 51,
  'profession': 'Juggler'},
 {'_id': ObjectId('621ebfa03b4e98756d5b88e3'),
  'name': 'Jack',
  'age': 50,
  'profession': 'Unemployed'}]

And more:
    
- Combine expressions with `$or`, `$and`
- Match against multiple values with `$in`, `$all`
    
*What do these do? Can we figure out how to use them?*

### Specifying the return fields

The 2nd argument to `find` or `find_one` specifies which fields to include:

In [71]:
results = collection.find({}, { "name": 1 })
list(results)

[{'_id': ObjectId('5ec72e9c5138fc0870d52557'), 'name': 'Jill'},
 {'_id': ObjectId('5ec72e9d5138fc0870d52558'), 'name': 'Jack'},
 {'_id': ObjectId('5ec72e9e5138fc0870d52559'), 'name': 'Ju Ho'},
 {'_id': ObjectId('5ecd9bf8db075a3e645f62ab'), 'name': 'Jill'},
 {'_id': ObjectId('5ecd9bfcdb075a3e645f62ac'), 'name': 'Jack'},
 {'_id': ObjectId('5ecd9c19db075a3e645f62ad'), 'name': 'Ju Ho'},
 {'_id': ObjectId('621e98743b4e98756d5b88df'), 'name': 'Jill'},
 {'_id': ObjectId('621e98753b4e98756d5b88e0'), 'name': 'Jack'},
 {'_id': ObjectId('621e98773b4e98756d5b88e1'), 'name': 'Ju Ho'},
 {'_id': ObjectId('621ebd843b4e98756d5b88e2'), 'name': 'Jill'},
 {'_id': ObjectId('621ebfa03b4e98756d5b88e3'), 'name': 'Jack'},
 {'_id': ObjectId('621ebfa03b4e98756d5b88e4'), 'name': 'Jun Ho'}]

In [None]:
results = collection.find({}, { "age": 1, "name": 1 })
list(results)  

Or by exclusion:

In [22]:
results = collection.find({}, { "_id": 0 })
list(results)

[{'name': 'Jill', 'age': 51, 'profession': 'Juggler'},
 {'name': 'Jack', 'age': 50, 'profession': 'Unemployed'},
 {'name': 'Jun Ho', 'age': 34, 'profession': 'Juggler'}]

*Why use Mongo rather than an SQL DB?*

# <center>Nested Fields and Arrays</center>

New collection:

In [81]:
collection = db.example2

In [82]:
doc = {
  "name": "Jill",
  "profile": {
      "age": 51,
      "hobbies": ["jogging", "juggling"]
  },
}
collection.insert_one(doc)

<pymongo.results.InsertOneResult at 0x7f1ee6b1a280>

In [85]:
doc = {
  "name": "Jack",
  "profile": {
      "age": 50,
      "hobbies": ["internet commenting"]
  }
}
collection.insert_one(doc)

<pymongo.results.InsertOneResult at 0x7f1ee6b0a140>

In [86]:
doc = {
  "name": "Ju Ho",
  "profile": {
      "age": 34,
      "hobbies": ["petting dogs", "juggling"]
  }
}
collection.insert_one(doc)

<pymongo.results.InsertOneResult at 0x7f1ee6b1a840>

In [87]:
collection.estimated_document_count()

3

'dot' notation: separate nested field with a field. e.g. `profile.age`

In [88]:
results = collection.find( { "profile.age": 51} )
list(results)

[{'_id': ObjectId('621ec6b63b4e98756d5b88e5'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}}]

In [29]:
results = collection.find( { "profile.hobbies": "juggling"} )
list(results)

[{'_id': ObjectId('5ec72e9c5138fc0870d52557'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}},
 {'_id': ObjectId('5ec72e9e5138fc0870d52559'),
  'name': 'Ju Ho',
  'profile': {'age': 34, 'hobbies': ['petting dogs', 'juggling']}},
 {'_id': ObjectId('5ecd9bf8db075a3e645f62ab'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}},
 {'_id': ObjectId('5ecd9c19db075a3e645f62ad'),
  'name': 'Ju Ho',
  'profile': {'age': 34, 'hobbies': ['petting dogs', 'juggling']}},
 {'_id': ObjectId('621e98743b4e98756d5b88df'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}},
 {'_id': ObjectId('621e98773b4e98756d5b88e1'),
  'name': 'Ju Ho',
  'profile': {'age': 34, 'hobbies': ['petting dogs', 'juggling']}}]