# Introduction to MongoDB with PyMongo and NOAA Data

This notebook provides a basic walkthrough of how to use MongoDB and is based on a tutorial originally by [Alberto Negron](http://altons.github.io/python/2013/01/21/gentle-introduction-to-mongodb-using-pymongo/).

Metadata records are frequently stored as JSON, and almost anything you get from an API will be JSON. For example, check out the [metadata records](https://data.noaa.gov/data.json) for the National Oceanic and Atmospheric Administration. 

MongoDB is a great tool to use with JSON data because it stores structured data as JSON-like documents, using dynamic schemas (called BSON), rather than predefined schemas. 

In MongoDB, an element of data is called a document, and documents are stored in collections. One collection may have any number of documents. Collections are a bit like tables in a relational database, and documents are like records. But there is one big difference: every record in a table has the same fields (with, usually, differing values) in the same order, while each document in a collection can have completely different fields from the other documents.

Documents are Python dictionaries that can have strings as keys and can contain various primitive types (int, float,unicode, datetime) as well as other documents (Python dicts) and arrays (Python lists).

## Getting started
First we need to import `json` and `pymongo`:

In [3]:
import json
import pymongo

## Connect    
Just as with the relational database example with `sqlite`, we need to begin by setting up a connection. With MongoDB, we will be using `pymongo`, though MongoDB also comes with a [console API that uses Javascript](https://docs.mongodb.org/manual/tutorial/write-scripts-for-the-mongo-shell/).    


To make our connection, we will use the PyMongo method `MongoClient`:

In [4]:
conn=pymongo.MongoClient()
conn

## Create and access a database    
Mongodb creates databases and collections automatically for you if they don't exist already. A single instance of MongoDB can support multiple independent databases. When working with PyMongo, we access databases using attribute style access, just like we did with `sqlite`:

In [6]:
db = conn.mydb
db

Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'mydb')

## Collections    
A collection is a group of documents stored in MongoDB, and can be thought of as roughly the equivalent of a table in a relational database. Getting a collection in PyMongo works the same as getting a database:

In [8]:
collection = db.my_collection
collection

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'mydb'), u'my_collection')

## Insert data   
To insert some data into MongoDB, all we need to do is create a dict and call `insert_one` on the collection object:

In [9]:
doc = {"class":"XBUS-502","date":"03-05-2016","instructor":"Bengfort","classroom":"C222","roster_count":"25"}
collection.insert_one(doc)

  from ipykernel import kernelapp as app


ObjectId('56d6d299be18b1126609c8e6')

In [10]:
conn.database_names()

[u'local', u'mydb']

In [11]:
db.collection_names()

[u'my_collection', u'system.indexes']

## A practical example

Let's say you wanted to gather up a bunch of JSON metadata records and store them for analysis. 

```python
import requests

NOAA_URL = "https://data.noaa.gov/data.json"

def load_data(URL):
    """
    Loads the data from URL and returns data in JSON format.
    """
    r = requests.get(URL)
    data = r.json()
    return data
    
noaa = load_data(NOAA_URL)
```

This takes a long time, so I've created a file for you that contains a small chunch of the records to use for today's workshop.

In [None]:
# load JSON from file

Now let's print out just one record to examine the structure.

In [None]:
noaa[0]

We want to enter these records into our database. First, we'll define our database for the NOAA records:

In [None]:
db = conn.noaa_results

Next we define the collection where we'll insert the NOAA metadata records:

In [None]:
records = db.records

Then we loop through each record in the noaa dataset and insert just the target information for each into the collection:

In [None]:
def insert(records):
    for record in records:
    data ={}
    data["title"] = record.title
    data["description"] = record.description
    data["keywords"] = record.keyword

    records.insert(data)

In [None]:
insert(noaa)

querying with find

querying with findOne 

the aggregation pipeline 