# 3.4 Working with MongoDB

## 3.4.1 Intro to MongoDB
MongoDB is a NoSQL database. It is widely used in Big Data.

It is also called a **Document Database**. By document, we don't mean PDFs or .doc files but associative arrays such as JSON objects, PHP arrays, Python dictionaries or Ruby hashes.

MongoDB stores these hierarchical data structures directly in the database as individual items or documents. It uses JSON-like syntax.

**Data Modeling in MongoDB**
* e.g. Tesla Motors
* Start with scalar values

Scalar values
Array fields
Embedded documents

-> MongoDB natively supports JSON. Can store data items like this and do queries e.g. draw out all data where cities = Fremont, even when it's nested a few layers deep.

(Opt: Insert code from data modelling in MongoDB video)

**Why MongoDB?**
* Flexible schema -> more easily handle flat data e.g. csv as well as hierarchical data
* Oriented towards programmers: JSON maps onto e.g. Python dictionaries
* MongoDB supports drivers (client libraries) from most languages -> translate to and from native datatypes in the language
* Flexible deployment: One laptop or multiple servers with several daemons running
* Designed for big data: Highly scalable horizontally on commodity hardware, includes native support for MapReduce
* Includes aggregation framework which enables efficient analytics applications

In [None]:
# Exercise
"""
Your task is to sucessfully run the exercise to see how pymongo works
and how easy it is to start using it.
You don't actually have to change anything in this exercise,
but you can change the city name in the add_city function if you like.

Your code will be run against a MongoDB instance that we have provided.
If you want to run this code locally on your machine,
you have to install MongoDB (see Instructor comments for link to installation information)
and uncomment the get_db function.
"""

def add_city(db):
    # Changes to this function will be reflected in the output. 
    # All other functions are for local use only.
    # Try changing the name of the city to be inserted
    db.cities.insert({"name" : "Chicago"})
    
def get_city(db):
    return db.cities.find_one()

def get_db():
    # For local use
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    # 'examples' here is the database name. It will be created if it does not exist.
    db = client.examples
    return db

if __name__ == "__main__":
    # For local use
    # db = get_db() # uncomment this line if you want to run this locally
    add_city(db)
    print get_city(db)

### Flexible Schema

Deal with 2 problems:
* Some documents will have fields that others do not.
* Data models usually go through several iterations

E.g.: Person infobox data: Some people have a child, some have more than one, some don't.
Indexing system and query execution system take this into account, e.g. can query for people with 2+ children.

Choose a schema that is easy to work with when designing your collections.

### 3.4.1.2 Intro to PyMongo
PyMongo is one of MongoDB's drivers (client libraries).

Find out more about other drivers at api.mongodb.org

#### Application Architecture
PyMongo module communicates with DB using wire protocol. Data exchanged in format called BSON (binary encoding for JSON).
MongoD is the daemon.


In [None]:
from pymongo import MongoClient
import pprint

# Create client object and specify connection string
client = MongoClient('mongodb://localhost:27017/')

# Encoding of Tesla as a client dictionary
tesla_s = {
    "manufacturer": "Tesla Motors",
    "class" : "full-size luxury",
    "body" : "5-door liftback",
    "production" : [2012, 2013, 2014, 2015, 2016]
    "model years" : [2012, 2013, 2014, 2015, 2016]
    "layout" : ["rear-motor", "rear-wheel drive", "dual motor all-wheel drive"]
    "designer" : {
        "firstname" : "Franz",
        "lastname" : "von Holzhausen"
    },
    "assembly" : [
        {
            "country" : "United States",
            "city" : "Fremont",
            "state" : "California"
        },
        {
            "country" : "The Netherlands",
            "city" : "Tilburg"
        }
    ]
}

# Use examples database
db = client.examples
# Insert this document in autos collection for examples database
db.autos.insert(tesla_s)

# Do a find query. Gives us back cursor for all documents.
for a in db.autos.find():
    # Print doc out for every doc we get back
    pprint.pprint(a)

# We get an additional "id" : ObjectId(string) field.

## 3.4.2 Field Queries
* Single field queries
* Multiple field queries
* Projection queries

In [None]:
# Querying using field selection

from pymongo import MongoClient
import pprint

def find():
    # Construct query oducment with field and value for field(s) 
    # we'd like to see for every document in our result set.
    autos = db.autos.find({"manufacturer": "Toyota"})
    for a in autos:
        pprint.pprint(a)

if __name__ == '__main__':
    find()

In [None]:
# Exercise

#!/usr/bin/env python
"""
Your task is to complete the 'porsche_query' function and in particular the query
to find all autos where the manufacturer field matches "Porsche".
Please modify only 'porsche_query' function, as only that will be taken into account.

Your code will be run against a MongoDB instance that we have provided.
If you want to run this code locally on your machine,
you have to install MongoDB and download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials at
the following link:
https://www.udacity.com/wiki/ud032
"""

def porsche_query():
    # Please fill in the query to find all autos manuafactured by Porsche.
    query = {"manufacturer" : "Porsche"}
    return query


# Do not edit code below this line in the online code editor.
# Code here is for local use on your own computer.
def get_db(db_name):
    # For local use
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

def find_porsche(db, query):
    # For local use
    return db.autos.find(query)


if __name__ == "__main__":
    # For local use
    db = get_db('examples')
    query = porsche_query()
    results = find_porsche(db, query)

    print "Printing first 3 results\n"
    import pprint
    for car in results[:3]:
        pprint.pprint(car)

In [None]:
# Multiple field queries

from pymongo import MongoClient
import pprint

client = MongoClient("mongo")

db = client.examples

def find():
    autos = db.autos.find({"manufacturer" : "Toyota", "class": "mid-size car"})
    for a in autos:
        pprint.pprint(a)

if __name__ =" __main__":
    find()

In [None]:
# Projection Queries: Ability to specify projection doc as well as query doc
# Projection describes shape we'd like docs to take in the results set

# E.g. only interested in getting name of result documents.

def find():
    query = {"manufacturer" : "Toyota", "class" : "mid-size car"}
    # By default if we don't specify, we will always get id. So need to set id = 0.
    projection = {"_id": 0, "name": 1}
    autos = db.autos.find(query, projection)

## 3.4.3 Inserting documents into collections

* 'insert' command: database.collection.insert(a)
* mongoimport

In [None]:
# Script to clean autos data. 
# Output JSON docs to import into MongoDB -> Usually recommended strategy.

client = MongoClient("mongodb://localhost:27017")

db = client.examples

num_autos = db.myautos.find().count()
print("num_autos before: " + num_autos)

# Loop through all autos created earlier. 
# Prev created dictionary for each auto
for a in autos:
    # Call insert for all autos. Pymongo transforms each into BSON encoding.
    db.myautos.insert(a)

num_autos = db.myautos.find().count()
print("num autos after: " + num_autos)

In [None]:
# Exercise

#!/usr/bin/env python
""" 
Add a single line of code to the insert_autos function that will insert the
automobile data into the 'autos' collection. The data variable that is
returned from the process_file function is a list of dictionaries, as in the
example in the previous video.
"""

from autos import process_file


def insert_autos(infile, db):
    data = process_file(infile)
    # Add your code here. Insert the data in one command.
    for a in data:
        db.autos.insert(a)
  
if __name__ == "__main__":
    # Code here is for local use on your own computer.
    from pymongo import MongoClient
    client = MongoClient("mongodb://localhost:27017")
    db = client.examples

    insert_autos('autos-small.csv', db)
    print db.autos.find_one()

## 3.4.4 Operators

Problem: Need to match based on inexact criteria, e.g. all people above a certain age.

Use **operators**:
* Same idea as in programming languages
* Same syntax as field names
* Distinguished using $

### 3.4.4.1 Range Queries
E.g. supporting range queries:
* \$gt
* \$lt
* \$gte
* \$lte
* \$ne

In [None]:
def find():
    
    query = {"population" : {"$gt": 250000, "$lte": 500000}}
    cities = db.cities.find(query)
    
    num_cities = 0
    for c in cities:
        pprint.pprint(c)
        num_cities += 1
        
    print("\nNumber of cities matching: %d\n" % num_cities)

In [None]:
# Range query for strings: All cities with names starting with X
query = {"name" : {"$gte": "X", "$lt": "Y"}}

# For dates: All cities with founding dates in 1837
query = {"foundingDate" : {"$gte" : datetime(1837,1,1), 
                           "$lte": datetime(1837,12,31)}

# Not equal to:
query = {"country" : {"$ne" : "United States"}}

In [None]:
# Exercise
#!/usr/bin/env python
"""
Your task is to write a query that will return all cities
that are founded in 21st century.
Please modify only 'range_query' function, as only that will be taken into account.

Your code will be run against a MongoDB instance that we have provided.
If you want to run this code locally on your machine,
you have to install MongoDB, download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.
"""

from datetime import datetime
    
def range_query():
    # Modify the below line with your query.
    # You can use datetime(year, month, day) to specify date in the query
    query = {"foundingDate" : {"$gte": datetime(2001,01,01)}}
    return query

# Do not edit code below this line in the online code editor.
# Code here is for local use on your own computer.
def get_db():
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client.examples
    return db

if __name__ == "__main__":
    # For local use
    db = get_db()
    query = range_query()
    cities = db.cities.find(query)

    print "Found cities:", cities.count()
    import pprint
    pprint.pprint(cities[0])

### 3.4.4.2 Exists
Exists: Query based on structure of info in the docs as well as their values

#### Using the Mongo Shell



### 3.4.4.3 Regex Operator
Looking for patterns in strings.

\$regex (Regular expressions queries) is based on a regular expression library: the PCRE.


A bit confused as to why [Ff]riendship matches all strings that contain the string friendship (with or without initial caps) but friendship only matches ones where friendship is the entire string of the motto (vs contains a non-capitalised version).

**RegEx tutorials**
Live RegEx tester at regexpal.com.

MongoDB $regex Manual.

Official Python Regular Expression HOWTO.

Another good Python Regular Expressions page.


### 3.4.4.4 Querying Arrays using Scalars

Query against fields that are not simply scalar values such as strings or integers but are themselves structured data values such as arrays.

\$in operator

In [None]:
# Exercise
#!/usr/bin/env python
"""
Your task is to write a query that will return all cars manufactured by
"Ford Motor Company" that are assembled in Germany, United Kingdom, or Japan.
Please modify only 'in_query' function, as only that will be taken into account.

Your code will be run against a MongoDB instance that we have provided.
If you want to run this code locally on your machine,
you have to install MongoDB, download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.
"""


def in_query():
    # Modify the below line with your query; try to use the $in operator.
    query = {"manufacturer": "Ford Motor Company", 
    "assembly" : {"$in": ["Germany", "United Kingdom", "Japan"]}}
    
    return query


# Do not edit code below this line in the online code editor.
# Code here is for local use on your own computer.
def get_db():
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client.examples
    return db


if __name__ == "__main__":

    db = get_db()
    query = in_query()
    autos = db.autos.find(query, {"name":1, "manufacturer":1, "assembly": 1, "_id":0})

    print "Found autos:", autos.count()
    import pprint
    for a in autos:
        pprint.pprint(a)


Inverse of in: \$all operator. 

Other ways queries
Some fields contain nested docs, e.g. dimensions.
Query instide nested docs using Dot notation


In [None]:
# Exercise
#!/usr/bin/env python
"""
Your task is to write a query that will return all cars with width dimension
greater than 2.5. Please modify only the 'dot_query' function, as only that
will be taken into account.

Your code will be run against a MongoDB instance that we have provided.
If you want to run this code locally on your machine, you will need to install
MongoDB, download and insert the dataset. For instructions related to MongoDB
setup and datasets, please see the Course Materials.
"""


def dot_query():
    # Edit the line below with your query - try to use dot notation.
    # You can check out example_auto.txt for an example of the document
    # structure in the collection.
    query = {"dimensions.width": {"$gt" : 2.5}}
    return query


# Do not edit code below this line in the online code editor.
# Code here is for local use on your own computer.
def get_db():
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client.examples
    return db


if __name__ == "__main__":
    db = get_db()
    query = dot_query()
    cars = db.cars.find(query)

    print "Printing first 3 results\n"
    import pprint
    for car in cars[:3]:
        pprint.pprint(car)

## 3.4.5 Updating documents in a collection

**save** command
* In Pymongo: method on collections objects
* -> Creates new doc or replaces existing one depending on whether or not ID exists.

find_one returns first doc found as opposed to a cursor (which find returns).

In [None]:
from pymongo import MongoClient
import pprint

client = MongoClient("mongodb:/localhost:27017")

db = client.examples

def main():
    city = db.cities.find_one({ "name" : "München",
                                "country" : "Germany"})
    city["isoCountryCode"] = "DEU"
    db.cities.save(city)

**update** command
* Query document as first parameter
* Update document as second parameter: Operation MongDB should perform on first document found matching this query.

**\$set** operator
* If document does not already contain the field specified here, then the field should be added with value specified. 
* If the field already exists, update field value to value specified.

**\$unset** operator
* Inverse of \$set
* For whatever doc matches query: if the document has the field specified, remove the field.
* If the document does not have the field specified, the operator has no effect.

In [None]:
from pymongo import MongoClient
import pprint

client = MongoClient("mongodb:/localhost:27017")

db = client.examples

def main():
    db.cities.update({ "name" : "München",
                        "country" : "Germany"},
                    { "$set" : {
                        "isoCountryCode" : "DEU"
                            }
                    })

In [None]:
def main():
    city = db.cities.update({ "name" : "München",
                                "country" : "Germany"},
                            { "$unset" : {
                                "isoCountryCode" : ""
                            }})

In [None]:
# COMMON MISTAKE: If you leave out the operator

def main():
    db.cities.update({ "name" : "München",
                        "country" : "Germany"},
                    {"isoCountryCode" : "DEU"})
# Returns: Entire document will be replaced so that 
# it contains the _id field and the isoCountryCode field only.

### 3.4.5.1 Updating multiple docs at once
Global modification to all docs matching certain criteria.
* Specify third parameter **multi=True**.

In [None]:
def main():
    db.cities.update({ "name" : "München",
                        "country" : "Germany"},
                     { "$set" : {
                        "isoCountryCode" : "DEU"
                        }
                     },
                     multi=True)

## 3.4.6 Removing documents