---
Even more NoSQL 
===
<img src="images/nosql-bar-joke.jpeg" style="width: 400px;"/>

---
By the end of this session, you should be able to:
---

- Explain how __MongoDB__ is a Document datastore
- Describe how MongoDB stores data
- Write code to query MongoDB

<br>
<br> 
<br>

<img src="http://www.theodo.fr/uploads/blog//2015/11/mongodb.png" style="width: 400px;"/>

<img src="images/no_value.png" style="width: 400px;"/>

### Student Survey

Who has used MongoDB?

---
What is MongoDB?
----

from:
> hu__mongo__us

----
MongoDB: The Good, The Bad, & The Ugly
-----

__The Good__:
 - Document-based datastore
    - Good for storing unstructured data Schema-less
    - JSON-style, stored as BSON 
    - Documents (objects) map nicely to programming language data types.
- Open Source
- Atomic operations within a single document
- Indexing
- MapReduce support
- Fully-consistent reads

__The Bad__:

- Semi-scalable
- No transactions
- No schema
- No joins

__The Ugly__:

- Built with developers in mind, not ops
- Available on all platforms - client library bindings available for almost all languages (Erlang, Haskell, lua, smalltalk, prolog,..)
- Suboptimal for complicated queries 

<img src="images/clients.png" style="width: 400px;"/>

---
Check for understanding
---

<details><summary>
Which Python datastructure most logically maps to JSON/BSON?
</summary>
`dict`
</details>

---
RDMS vs. MongoDB
---


### Operations

<img src="images/crud.png" style="width: 400px;"/>

----
Terms
----

| Concept | SQL | MongoDB |  
|:-------|:------|:---|
| DB server | mysqld | mongod |
| DB client | mysql | mongo |
| Highest storage unit | database | database |
| Logical data structure | table | collection |
| row | row | document |
| column | column | field |
| Join | Join  | Embedded Doc  |
| Distribute | partition | shard |
| Queries Return | table | cursor |

> When it comes to analytics and reporting, however, it is possible that the data you need to access spans multiple collections. For example, where the _id field of multiple documents from the products collection is included in a document from the orders collection. 

> For a query to analyze orders and details about their associated products, it must fetch the order document from the orders collection and then use the embedded references to read multiple documents from the products collection. 

> Prior to MongoDB 3.2, this work is implemented in application code. However, this adds complexity to the application and requires multiple round trips to the database, which can impact performance.

---
Cursing Cursors
-----

<img src="images/cursors_more.png" style="width: 400px;"/>

---
Data Model
---

<img src="http://www.thevisualist.org/wp-content/uploads/2013/05/Butcher_GodsMasters_HighRes.jpg" style="width: 400px;"/>

MongoDB has many similar concepts to Postgres, you can use the official as a reference to map them.  There are databases with collections (tables) each of which have documents (rows) containing multiple fields (columns).

---
Design
---

<img src="images/design.png" style="width: 400px;"/>

---
Data Object
---

<img src="images/data.png" style="width: 400px;"/>

---
Mongo is Flexable
----

Mongo can create databases, collections, documents, etc. on the fly.  

For example, to create a new database simply try to use the database you haven't created: `use my_new_database`

And __POOF__ it comes into existence (this also happens with collections)!

---
MongoDB workflow
---

<img src="images/workflow.png" style="width: 400px;"/>

---
<img src="images/summary.png" style="width: 400px;"/>

<img src="images/no_good.png" style="width: 400px;"/>

---
Summary
---

- MongoDB is 😜 - Be wary
    - Document datastore
    - Great for toy project
    - __Do not__ build a company around it

<br>
<br>
---

---
Bonus Materials
===

---
MongoDB demo
---

![](http://i.imgur.com/SnxSuwI.gif)

```bash
mkdir db
mongod --dbpath db &
```

In [1]:
! conda install pymongo -y

Fetching package metadata .......
Solving package specifications: ..........

# All requested packages already installed.
# packages in environment at /Users/alessandro/anaconda/envs/dsci6007:
#
pymongo                   3.3.0                    py27_0  


In [2]:
import pymongo

First things first, let us setup our database.  Just like Postgres, mongoDB has a server process which you connect to with a client:

In [3]:
mdb = pymongo.MongoClient('localhost', 27017)

#### Where is the Data?

MongoDB just like Postgres (or any database really) uses a binary format to store the data contained within it.  There is a specified database file (defaults to `/data/db`) which you can in its configuration.  But you should not think of it in this way, the client-server abstraction is quite powerful and anytime you need to put data in MongoDB or take data out you must go through the gatekeeper (the client ex: `mongo`).

In [4]:
test_db = mdb.test_db

In [5]:
users = test_db.users
users

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'test_db'), u'users')

In [6]:
users.find_one()

{u'_id': ObjectId('5870080c0ffde23a41a3bcfc'), u'name': u'Boaz'}

### Querying

MongoDB has a somewhat unique manner for querying and selecting our data.  The selector is a JSON object that MongoDB uses to pattern match the appropriate documents:

```javascript
// find by single field
db.users.find({ name: 'Jon'})

// find by presence of field
db.users.find({ car: { $exists : true } })

// find by value in array
db.users.find({ friends: 'Henry' })

// field selection (only return name)
db.users.find({}, { name: true })
```

In the first case we are simply searching for all users where the `name` field is equal to `Jon`.  Nothing too special here.  In the next case we retrieve all documents that have the `car` field present.  The third line does something quite convenient, arrays are actually first class citizens in MongoDB and this line finds any document that has a `friends` field with contains 'Henry' (since it is an array).  And the last line is how we only return certain fields of a document, the first argument to `find()` being the query and the second argument is the fields to return.  This line gives us the names of every user.

In [7]:
post_id = users.insert_one({'name': 'Boaz'})
post_id

<pymongo.results.InsertOneResult at 0x10471b6e0>

In [8]:
post_id.inserted_id

ObjectId('587008700ffde23a58d8b129')

In [9]:
users.find_one({'_id': post_id.inserted_id})

{u'_id': ObjectId('587008700ffde23a58d8b129'), u'name': u'Boaz'}

In [10]:
users.find_one({'_id': '5591a3866291224d1d7afb40'})

In [11]:
users

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'test_db'), u'users')

#### Iteration

MongoDB having a flexible data model also has a flexible shell/driver.  If you need to take some action based on a query or update documents you can use an iterator on the cursor to go document by document.  In the Javascript shell we can do this with Javascript's `forEach`.  Like Python's `for ... in`, Javascript actually has a more functional approach to this type of iteration and requires that you pass a callback (similar to map() and reduce()).

```javascript
db.tweets.coffee.find().forEach(function(doc){
    doc.entities.urls.forEach(function(url) {
        db.urls.insert({ 'url':url, 'user': doc.user });
    });
});
```

Here we are finding all the urls contained in our tweets and adding them to a new collections.  You most likely would do something more interesting like actually download those urls, but for simplicity sake we just add them to another collection.  With forEach the sky's really the limit!

#### Aggregation

In SQL, some of the most useful features of the language are the aggregates to compute useful statistics and metrics.  In MongoDB there are the same but in my opinion the querying/aggregation ability in MongoDB is not as elegant as SQL.

Let us say we want to find out how many total retweets we had for our tweets about coffee by each country.

```javascript
db.coffee.aggregate( [ { $group :
    {
        _id: "$place.country_code",
        total: { $sum: "$retweet_count" }
    }
}])

db.coffee.find({ retweet_count : { $gt : 0 }  })
```

First we group by `place.country_code`, then we sum the `retweet_count`.  And if we do a more specific find, we see that there actually we not any retweets!


<br>
<br> 
<br>

----