# Scripting Week 7: MongoDB and Semi-Structured Data

In [18]:
import pandas as pd

## Announcements

## Review

Load sample data:

In [164]:
movies = pd.read_csv('../data/movielens_small.csv')
df = movies.sample(n=5, random_state=12345).set_index('title')
df

Unnamed: 0_level_0,userId,rating,genres,timestamp,year
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bringing Up Baby,481,4.0,Comedy,1437001472,1938
"Long, Hot Summer, The",311,3.5,Drama,1061927755,1958
"Net, The",191,3.0,Action,839925608,1995
City Lights,648,4.5,Comedy,1176754888,1931
Eagle vs Shark,132,4.0,Comedy,1284496709,2007


In [178]:
df

Unnamed: 0_level_0,userId,rating,genres,timestamp,year
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bringing Up Baby,481,4.0,Comedy,1437001472,1938
"Long, Hot Summer, The",311,3.5,Drama,1061927755,1958
"Net, The",191,3.0,Action,839925608,1995
City Lights,648,4.5,Comedy,1176754888,1931
Eagle vs Shark,132,4.0,Comedy,1284496709,2007


How would you select:
    
- Rows 2:4
- The row titled "City Lights"
- The first, second, and fifth rows
- The columns `genres` and `year`
- The column `genres` as a Series
- The column `genres` as a DataFrame
- The rows where the year is `> 1990`
- The rows where the `genres` value is `Action` or `Drama`

### Selecting DataFrames

Everything follows the pattern:

`df[ ... ]`

Except selecting rows by index name, which uses:

`df.loc[ ... ]`

**Selecting rows by numeric index**

Provide `x:y` notation in : `df[10:14]`

**Selecting rows by index name**

Provide the name to `.loc[]`: `df.loc['Sherlock Holmes']`

**Selecting rows by inclusion criteria**

Provide any collection (e.g. a list or Series) of True/False values:

```
df[[True, False, False, True, True]]
```

```
df[df.year > 1996]
```

**Selecting multiple columns**

Provide a collection of strings, referencing the column names:

```
df[['genres', 'year']]
```
    
**Selecting single column (as Series)**

```
df['year']
```

Or:

```
df.year
```

Consider the latter as the shortcut, not the main way.

The output is a Series. To select a single column as a DataFrame, use list with only one value.

## Selecting by Index
    
In addition to passing a string to `.loc[]`:

In [47]:
df.loc['City Lights']

userId              648
rating              4.5
genres           Comedy
timestamp    1176754888
year               1931
Name: City Lights, dtype: object

You can pass a list of index names:

In [48]:
df.loc[['City Lights', 'Bringing Up Baby']]

Unnamed: 0_level_0,userId,rating,genres,timestamp,year
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
City Lights,648,4.5,Comedy,1176754888,1931
Bringing Up Baby,481,4.0,Comedy,1437001472,1938


### Setting and reseting an index:

In [51]:
df2 = df.reset_index()
df2

Unnamed: 0,title,userId,rating,genres,timestamp,year
0,Bringing Up Baby,481,4.0,Comedy,1437001472,1938
1,"Long, Hot Summer, The",311,3.5,Drama,1061927755,1958
2,"Net, The",191,3.0,Action,839925608,1995
3,City Lights,648,4.5,Comedy,1176754888,1931
4,Eagle vs Shark,132,4.0,Comedy,1284496709,2007


In [52]:
df2.set_index('genres')

Unnamed: 0_level_0,title,userId,rating,timestamp,year
genres,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Comedy,Bringing Up Baby,481,4.0,1437001472,1938
Drama,"Long, Hot Summer, The",311,3.5,1061927755,1958
Action,"Net, The",191,3.0,839925608,1995
Comedy,City Lights,648,4.5,1176754888,1931
Comedy,Eagle vs Shark,132,4.0,1284496709,2007


In [55]:
df3 = df2.set_index(['genres', 'title']).sort_index()
df3

Unnamed: 0_level_0,Unnamed: 1_level_0,userId,rating,timestamp,year
genres,title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Action,"Net, The",191,3.0,839925608,1995
Comedy,Bringing Up Baby,481,4.0,1437001472,1938
Comedy,City Lights,648,4.5,1176754888,1931
Comedy,Eagle vs Shark,132,4.0,1284496709,2007
Drama,"Long, Hot Summer, The",311,3.5,1061927755,1958


In [62]:
df3.loc[('Comedy')]

Unnamed: 0_level_0,userId,rating,timestamp,year
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bringing Up Baby,481,4.0,1437001472,1938
City Lights,648,4.5,1176754888,1931
Eagle vs Shark,132,4.0,1284496709,2007


In [61]:
df3.loc[('Comedy', 'City Lights')]

userId       6.480000e+02
rating       4.500000e+00
timestamp    1.176755e+09
year         1.931000e+03
Name: (Comedy, City Lights), dtype: float64

# Semi-structured data

Semi-structured data has a structure, but it is specific to the needs of the data.

Tabular data always has two-dimensions - rows and columns (aka 'fields' and 'records') - semi-structured data is less predicatable.

### Common Semi-Structured Formats

JSON
- Hierarchical
- Organized around key/value pairs and lists of values

XML
- Hierarchical, enclosed
- Organized with tags, properties, and values

### XML, briefly

XML has tags and attributes, representing elements, and content

![Example XML](../images/xml1.png)

TAGS

- `<catalog>`
- `<book>`
- `<author>`

ELEMENTS
- Everything from `<book>` to `</book>`

ATTRIBUTES
- The ‘id’ of `<book>`

CONTENT
- “Gambardella, Matthew”
- “XML Developer’s Guide”

## JSON

JSON is made up a few data value types:

- string
- number (like a float in Python)
- object (named collection, like a dictionary in Python)
- array  (unnamed collection, like a list in Python)
- boolean
- null

Our 'containers' are objects and arrays. They can hold any of the other data types, *including other arrays and objects*.


### Object

Key value pairs, surrounded by curly braces.

- `:` separates key and value, `,` separates the item pairs.
- Keys have to be strings, values can be anything.

### Arrays

Nearly identical to lists in Python. Values, surrounded by square brackets.

- `,` separates items.
- Values can be any data type.

### Objects 

basic form:

```json
{
  "name": "Jill"
}
```

Adding key-value pairs:
    
```json
{
  "name": "Jill",
  "profession": "Juggler"
}
```

Values can be other types:
    
```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51
}
```

JSON is language-agnostic, so some things are different than in Python. e.g. *true / false* (lowercase) rather than Python's *True / False*:

```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51,
  "active": true
}
```

The values can be arrays and objects - that's how things start to get deep:

```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51,
  "active": true,
  "tools": ["ball", "handkerchief", "flaming motorcycle"]
}
```

Nested Object:

```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51,
  "active": true,
  "tools": ["ball", "handkerchief", "flaming motorcycle"],
  "awards": {
      "Tri-City Juggling Symposium 2010": "Technical Program, Silver",
      "Ballympics 2009": "Best Overall, Runner-up"
  }
}
```

List of objects, why not?

```json
{
  "name": "Jill",
  "profession": "Juggler",
  "age": 51,
  "active": true,
  "tools": ["ball", "handkerchief", "flaming motorcycle"],
  "awards": [
        {
            "event": "Tri-City Juggling Symposium",
             "year": 2010,
             "awards": ["Technical Program, Silver"]
        },
        {
            "event": "Ballympics",
             "year": 2009,
             "awards": ["Best Overall, Runner-up"]
        }
    ]
}
```

Whitespace doesn't matter:

```json
{"name": "Jill", "profession": "Juggler", "age": 51, "active": true, "tools": ["ball", "handkerchief", "flaming motorcycle"], "awards": [{"event": "Tri-City Juggling Symposium", "year": 2010, "awards": ["Technical Program, Silver"]}, {"event": "Ballympics", "year": 2009, "awards": ["Best Overall, Runner-up"]}]}
```

### Exercises

Develop a JSON structure for representing the following:

- a directory of current mayors, organized by cities and subdivided by state
    - Try with 3 cities from 2 states
- a collection of movies, each with basic metadata (e.g. title, director, year) and with viewer opinion information (ratings or reviews)
   - Try with *Black Panther* and *Avengers: Infinity War*

# MongoDB

A semi-structured non-relational (No-SQL) database, which uses a JSON-like format for storing information.

### Terminology

<center>**Database Management System (MongoDB)**</center>
<center>*can have one or more*</center>

<center>**Databases**</center>

<center>*can have one or more*</center>
<center>**Collections**</center>

<center>*can have one or more*</center>
<center>**Documents**</center>

Compare to relational databases - How do the terms align?

*Documents* are JSON-like objects, *Collections* are lists of those objects.

```
[
    {
      "name": "Jill",
      "profession": "Juggler"
    },
    { "name": "Jack",
      "profession": "Unemployed"
    }
]
```

Before we learn about MongoDB, let's get it ready on our systems so we can be hands on!

- Turn to Lab and do the small JSON primer
- Install and connect to Mongo according to lab instructions.

Connecting to a database called 'week7'. I'll put all my collections there this week.

In [64]:
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client.week7
db

Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'week7')

New `example` collection:

In [65]:
collection = db.example

## Insert One

In [89]:
doc = {
  "name": "Jill",
  "age": 51,
  "profession": "Juggler"
}
collection.insert_one(doc)

<pymongo.results.InsertOneResult at 0x111e1d1eb40>

## Retrieving Documents

MongoDB does retrieval with `find` or `find_one`, using a JSON-query.

`find_one` gives a JSON object back:

In [91]:
collection.find_one({'name': 'Jill'})

{'_id': ObjectId('5af1c28e4b6d021e1098fa59'),
 'age': 51,
 'name': 'Jill',
 'profession': 'Juggler'}

`find` returns an instance of an object that can be iterated over:

In [92]:
collection.find({'name': 'Jill'})

<pymongo.cursor.Cursor at 0x111e1e046d8>

In [179]:
results = collection.find({'name': 'Jill'})
for result in results:
    print(result)

{'_id': ObjectId('5af1ca774b6d021e1098fa64'), 'name': 'Jill', 'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}}


*Try iterating over the `results` variable again without doing another search. What happens?*

In [191]:
for result in results:
    print(result)

The `cursor` is a pointer to the databases, so you're not holding all the data in memory. Good for really large datasets!

Alternately, you can convert everything to a list, but don't do this if you have a lot of data!

In [95]:
results = collection.find({'name': 'Jill'})
list_of_results = list(results)
list_of_results

[{'_id': ObjectId('5af1c28e4b6d021e1098fa59'),
  'age': 51,
  'name': 'Jill',
  'profession': 'Juggler'}]

## Insert Many

In [96]:
docs = [{ "name": "Jack", "age": 50, "profession": "Unemployed" },
        { "name": "Jun Ho", "age": 34, "profession": "Juggler" }
       ]
collection.insert_many(docs)

<pymongo.results.InsertManyResult at 0x111e1d21a20>

In [97]:
results = collection.find({"profession": "Juggler"})
list(results)

[{'_id': ObjectId('5af1c28e4b6d021e1098fa59'),
  'age': 51,
  'name': 'Jill',
  'profession': 'Juggler'},
 {'_id': ObjectId('5af1c2bb4b6d021e1098fa5b'),
  'age': 34,
  'name': 'Jun Ho',
  'profession': 'Juggler'}]

### Count

In [125]:
collection.count()

3

## Retrieving Everything

*What would our search be to get everything?*

In [98]:
results = collection.find({})
list(results)

[{'_id': ObjectId('5af1c28e4b6d021e1098fa59'),
  'age': 51,
  'name': 'Jill',
  'profession': 'Juggler'},
 {'_id': ObjectId('5af1c2bb4b6d021e1098fa5a'),
  'age': 50,
  'name': 'Jack',
  'profession': 'Unemployed'},
 {'_id': ObjectId('5af1c2bb4b6d021e1098fa5b'),
  'age': 34,
  'name': 'Jun Ho',
  'profession': 'Juggler'}]

## `Find` by Example

Exact match:

In [105]:
results = collection.find({
            "age": 50
        })
list(results)

[{'_id': ObjectId('5af1c2bb4b6d021e1098fa5a'),
  'age': 50,
  'name': 'Jack',
  'profession': 'Unemployed'}]

Multiple conditions:

In [110]:
results = collection.find({
            "age": 50,
            "profession": "Unemployed"
        })
list(results)

[{'_id': ObjectId('5af1c2bb4b6d021e1098fa5a'),
  'age': 50,
  'name': 'Jack',
  'profession': 'Unemployed'}]

Greater than, Less than, Greater than or equal to, Less than or equal to:

In [104]:
results = collection.find({
            "age": { "$gt": 50 }
        })
list(results)

[{'_id': ObjectId('5af1c28e4b6d021e1098fa59'),
  'age': 51,
  'name': 'Jill',
  'profession': 'Juggler'}]

In [106]:
results = collection.find({
            "age": { "$gte": 50 }
        })
list(results)

[{'_id': ObjectId('5af1c28e4b6d021e1098fa59'),
  'age': 51,
  'name': 'Jill',
  'profession': 'Juggler'},
 {'_id': ObjectId('5af1c2bb4b6d021e1098fa5a'),
  'age': 50,
  'name': 'Jack',
  'profession': 'Unemployed'}]

```json
{ "$gt": 50 }
```

The `$` tells mongo that this is a special function, not simply a value named `gt`

In [197]:
results = collection.find({
            "age": { "$gte": 50 }
        })
list(results)

[]

And more:
    
- Combine expressions with `$or`, `$and`
- Match against multiple values with `$in`, `$all`
    
*What do these do? Can we figure out how to use them?*

### Specifying the return fields

The 2nd argument to `find` or `find_one` specifies which fields to include:

In [114]:
results = collection.find({}, { "name": 1})
list(results)

[{'_id': ObjectId('5af1c28e4b6d021e1098fa59'), 'name': 'Jill'},
 {'_id': ObjectId('5af1c2bb4b6d021e1098fa5a'), 'name': 'Jack'},
 {'_id': ObjectId('5af1c2bb4b6d021e1098fa5b'), 'name': 'Jun Ho'}]

In [115]:
results = collection.find({}, { "name": 1, "age": 1})
list(results)

[{'_id': ObjectId('5af1c28e4b6d021e1098fa59'), 'age': 51, 'name': 'Jill'},
 {'_id': ObjectId('5af1c2bb4b6d021e1098fa5a'), 'age': 50, 'name': 'Jack'},
 {'_id': ObjectId('5af1c2bb4b6d021e1098fa5b'), 'age': 34, 'name': 'Jun Ho'}]

Or by exclusion:

In [116]:
results = collection.find({}, { "name": 0 })
list(results)

[{'_id': ObjectId('5af1c28e4b6d021e1098fa59'),
  'age': 51,
  'profession': 'Juggler'},
 {'_id': ObjectId('5af1c2bb4b6d021e1098fa5a'),
  'age': 50,
  'profession': 'Unemployed'},
 {'_id': ObjectId('5af1c2bb4b6d021e1098fa5b'),
  'age': 34,
  'profession': 'Juggler'}]

*Why use Mongo rather than an SQL DB?*

# <center>Nested Fields and Arrays</center>

New collection:

In [198]:
collection = db.example2

In [134]:
doc = {
  "name": "Jill",
  "profile": {
      "age": 51,
      "hobbies": ["jogging", "juggling"]
  }
}
collection.insert_one(doc)

<pymongo.results.InsertOneResult at 0x111e1e1f798>

In [135]:
doc = {
  "name": "Jack",
  "profile": {
      "age": 50,
      "hobbies": ["internet commenting"]
  }
}
collection.insert_one(doc)

<pymongo.results.InsertOneResult at 0x111e1d3f2d0>

In [136]:
doc = {
  "name": "Ju Ho",
  "profile": {
      "age": 34,
      "hobbies": ["petting dogs", "juggling"]
  }
}
collection.insert_one(doc)

<pymongo.results.InsertOneResult at 0x111e1e1f2d0>

In [199]:
collection.count()

3

'dot' notation: separate nested field with a field. e.g. `profile.age`

In [147]:
results = collection.find( { "profile.age": 51} )
list(results)

[{'_id': ObjectId('5af1ca774b6d021e1098fa64'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}}]

In [146]:
results = collection.find( { "profile.hobbies": "juggling"} )
list(results)

[{'_id': ObjectId('5af1ca774b6d021e1098fa64'),
  'name': 'Jill',
  'profile': {'age': 51, 'hobbies': ['jogging', 'juggling']}},
 {'_id': ObjectId('5af1ca794b6d021e1098fa66'),
  'name': 'Ju Ho',
  'profile': {'age': 34, 'hobbies': ['petting dogs', 'juggling']}}]