# Module 14: NoSQL Databases & MongoDB

The term “__NoSQL__” applies to __any non-relational database system__, i.e., data is stored in a format other than some sort of relational table-oriented structure.

A NoSQL database is a storage space into which a user can deposit __virtually any type of digital object__.

NoSQL databases excel at the task of __storing and managing unstructured data__ such as digital images, videos, audio clips, and collections of documents that may share a common subject matter but not a common format.

Unlike relational databases, __NoSQL databases__:

- __Do not__ share a common query language;


- __Do not__ share a common methodology for storing, retrieving, and managing data;


- __Do not__ share a common set of methods for adding, modifying and deleting their contents;


- __Do not__ require that data be "normalized";


- Have a __much shorter history of usage__ than do relational databases.


(*However, we can, if we so choose, store and manage relational-oriented data within a NoSQL database, with the caveat being that the methodology required for managing relationships within any type of NoSQL database is quite different from the methodology used within a relational database*).


In general, a NoSQL database provides a simple user-defined index to each object, and __virtually no internal data integrity/consistency checks__ are performed on the database’s contents as items are added, modified or deleted. Such an environment can be very useful for purposes such as software application development, storage of web-based audio/video, and for storing collections of documents that share no common format. 

Due to the lack of data integrity constraints, NoSQL databases are easy to "scale up" and distribute across remote servers.

There are __four primary types of NoSQL databases__: 


- __Key-Value Pair__: Similar to a Python dictionary object, data is stored in key-value pairs. Each key value __must__ be unique and the "value" can be any type of data, including strings, images, BLOB's (Binary Large OBjects), and videos. Examples include the __Redis__, __Dynamo__, and __Riak__ NoSQL databases.


- __Column-based__: Allow users to organize key-value pairs into __columns__, i.e., we can create a separate column for a given type of attribute and then store key-value pairs within a "column". Key values __within__ a column __must__ be unique. This approach enables __high performance on aggregation queries__ like SUM, COUNT, AVG, MIN, etc. Examples include the __HBase__, __Cassandra__, and __Hypertable__ NoSQL databases.


- __Graph__: Store information as __nodes__ (i.e., entities) __edges__ (i.e., relationships), and allow us to define one or more "relationships" between nodes. Every "node" and "edge" is assigned a unique identifier. Performance tends to be very high since relationships between nodes do not need to be calculated at the time of a query (as is the case in a relational database). Often used for social networks, logistics, and spatial data, but usage is currently expanding to many other fields as well. Examples include the __Neo4J, Infinite Graph, OrientDB__, and __FlockDB__ graph database systems.


- __Document-Oriented__: Relies on key-value pairs, but the "value" consists of a "document" which often takes the form of an __XML__ or __JSON__ file format. __The content of the document "values" can be queried directly by the user__. Very flexible storage and retrieval methodology often used for Customer Management Systems, blogging platforms, real-time analytics and e-commerce applications. However, Document-Oriented NoSQL databases should not be used for complex transactions which require multiple operations or queries against varying aggregate structures. Examples include the __CouchDB__, __MongoDB__, and __Riak__ document-oriented NoSQL databases.


(For more details, see https://www.guru99.com/nosql-tutorial.html)

### NoSQL Database Disadvantages

- Lack of a common query language; For example, lack of a common query language serves as a major impediment to migrating a NoSQL-based application from, say, MongoDB to a different NoSQL platform such as Hadoop.


- Lack of built-in data integrity tools makes them unsuitable for many types of applications. 


- Many NoSQL databases __have proprietary methods for adding, modifying, deleting, and querying their content__. This serves as a major impediment for application developers. For example, MongoDB requires that all queries and any data being added to its platform be converted to BSON format, which itself is not widely used outside of the MongoDB environment. Within some programming languages, converting to BSON format requires that numeric or character data first be converted to JSON format, and then converted to BSON. Requiring users to perform multiple data format conversions prior to adding them to a database is woefully inefficient and wasteful. Similarly, when retrieving data back from MongoDB, in some programming languages the user receives that data in BSON format, which of course subsequently must be converted to a data structure that is suitable for manipulation within the given programming lanaguage.

## MongoDB

- Used for high-volume data storage & retrieval


- Instead of tables, rows, & columns, a MongDB database is organized around __collections__ of __documents__, each of which contain zero or more __fields__.


- Documents are housed within __collections__, which provide a high-level method of segregating documents by, for example, topic or date or document type or whatever type of organizing principle we choose.


- Each document within a collection has a __dynamic schema__ and can contain zero or more __fields__, which are __key-value__ pairs. There is no requirement for each document within a collection to conform to a fixed structure.


- The content of MongoDB __documents__ is constructed / managed using __BSON__, which is a binarized form of __JSON__.


- Components of MongoDB databases share some similarities with the components of a typical RDBMS. See https://www.tutorialspoint.com/mongodb/mongodb_overview.htm and https://docs.mongodb.com/manual/reference/sql-comparison/


- __MongoDB Data Management Hierarchy__: Database >> Collection (similar to RDBMS Table) >> Document (similar to RDBMS Table Row) >> Field (similar to RDBMS Column)

## MongoDB Data Types

https://www.tutorialspoint.com/mongodb/mongodb_datatype.htm

## MongoDB Data Modeling

Examples of how to go about constructing a data model for a MongoDB database: https://www.tutorialspoint.com/mongodb/mongodb_data_modeling.htm



## MongoDB Commands + Syntax

A handy "cheat sheet" providing examples of valid MongoDB commands + syntax is available here: https://gist.github.com/bradtraversy/f407d642bdc3b31681bc7e56d95485b6


## How to Create a New MongoDB Database

https://www.tutorialspoint.com/mongodb/mongodb_create_database.htm

## How to Create a Collection

https://www.tutorialspoint.com/mongodb/mongodb_create_collection.htm

## How to Create an Index for a Collection

https://www.tutorialspoint.com/mongodb/mongodb_indexing.htm

## Add a Document to a Collection

https://www.tutorialspoint.com/mongodb/mongodb_insert_document.htm

## How to Query MongoDB Documents

https://www.tutorialspoint.com/mongodb/mongodb_query_document.htm

## How to Update a MongoDB Document

https://www.tutorialspoint.com/mongodb/mongodb_update_document.htm

## How to Use MongoDB from within Python

Both __PyMongo__ and __MongoEngine__ are suitable for enabling interaction with a MongoDB server from within a Juypyter Notebook.

### PyMongo Installation + Tutorial:

__PyMongo__ Installation instructions for Anaconda can be found here:

https://anaconda.org/anaconda/pymongo

After installation, refer to the tutorial for guidance on usage:

https://pymongo.readthedocs.io/en/stable/tutorial.html


### MongoEngine Installation + Tutorial: 

__MongoEngine__ Installation instructions for Anaconda can be found here:

https://anaconda.org/conda-forge/mongoengine

After installation, refer to the tutorial for guidance on usage:

http://docs.mongoengine.org/tutorial.html

## PyMongo Usage Example

Get started by loading the PyMongo library and establishing a connection to your (installed and running) MongoDB server.

In [1]:
# Load the pymongo library
import pymongo

In [2]:
# establish a connection with your local MongoDB server
# Make sure your MongoDB server is up + running!

from pymongo import MongoClient
client = MongoClient()

Once connected to your MongoDB server from within your Jupyter Notebook, you can retrieve a list of any databases that already exist within your MongoDB server using the __list_database_names()__ function. NOTE: The list of databases you see displayed will likely vary from that shown below.

In [55]:
# get a list of databases on your local server
# note your list may vary from the output shown here
client.list_database_names()

['local', 'test', 'zips']

To __connect to a specific pre-existing MongoDB database__, use the syntax shown below. Note that for this example we are connecting to the pre-existing 'test' database.

In [56]:
# connect to a specific mondoDB database
db = client.test

To __display a list of all collections__ that exist within a database, use the __list_collection_names()__ function (see example below). In this instance, we see that the 'test' database contains one collection, and the name of that collection is 'restaurants'. Note that we could also assign the output of the function to a Python variable, and that variable would be a Python list that contains the names of each collection housed within the MongoDB database we are currently using.

In [57]:
# list collections contained within a database
db.list_collection_names()

['restaurants']

### Creating A New MongoDB Database via PyMongo

To __create a new database__ within MongoDB via PyMongo, use the syntax shown in the example below. In the example, we are creating a new MongoDB database named 'AIM5001'. Note the need to surround the name of the database with single quotes (i.e., it needs to be a Python string) and square brackets (a requirement of PyMongo's syntax).

In [58]:
# create a new database
# Here, we are creating a new database named 'AIM5001'

db = client['AIM5001']

To __use__ the new MongoDB database within your Python environment, use the syntax shown in the example below, wherein a pointer to the MongoDB database 'AIM5001' is being assigned to a Python variable. That variable can then be used to access and manipulate the content of the indicated database.

In [59]:
# To use a MongoDB database via PyMongo, simply assign it to a Python variable.
db = client.AIM5001

To list the names of all __collections__ within a MongoDB database, use the __list_collection_names()__ function. Note that in our example here we have not yet added any collections to our new AIM5001 database, so the list of results is empty.

In [60]:
# list the names of all collections within a database
# note that we currently have none since we've just created the database
db.list_collection_names()

[]

To __create a new MongoDB collection__ within the database, use the syntax shown below. In this example we are adding a collection having the name of __posts__ to our new AIM5001 database.

In [61]:
# add a new collection to the new AIM5001 database
posts_collection = db["posts"]

There are a variety of ways to __add a new document to a MongoDB collection__. In the example below, we are defining the content of a new MongoDB document "by hand" using a Python dictionary object. __REMEMBER__: MongoDB __documents__ are comprised of __fields__, and each __field__ is comprised of __key : value__ pairs.

In [62]:
# create a sample document to add to the new collection
post = {"author": "Mike",
        "text": "My first blog post!",
        "tags": ["mongodb", "python", "pymongo"]}

Now that we have our new document defined, we add it to our __posts__ collection via the PyMongo __insert_one()__ function. We capture the output of the function solely for purposes of verifying that a successful insertion occurred.

In [63]:
# insert a new document into our "posts" collection
post_id = posts_collection.insert_one(post).inserted_id
post_id

ObjectId('605620e9536d8c84414c3375')

Now that our AIM5001 MongoDB database contains some actual content, the name of our new database will appear in the output of the __list_database_names()__ function. 

(*__NOTE__: The name of a database will NOT appear in the output of the __list_database_names()__ function if the database contains no collections or documents.*) 

In [64]:
# now that we've added content to the new database, it will appear 
# in our list of MongoDB databases
client.list_database_names()

['AIM5001', 'local', 'test', 'zips']

To __retrieve the content of all documents housed within a MongoDB collection__, use the __find()__ function via the syntax shown below.

In [65]:
# retrieve + display all documents in a collection
for document in posts_collection.find():
    print (document)

{'_id': ObjectId('605620e9536d8c84414c3375'), 'author': 'Mike', 'text': 'My first blog post!', 'tags': ['mongodb', 'python', 'pymongo']}


To __count the number of documents within a MongoDB collection__, use the __count_documents({})__ function. Be sure to include the brackets __{}__ between the parentheses when invoking the function. Failure to do so will result in an error message. 

In [66]:
# count the number of documents in a collection

posts_collection.count_documents({})

1

Now let's check the list of __collections__ within our AIM5001 database. As shown below, we see that our new __posts__ collection now appears as output of the __list_collection_names()__ function.

In [67]:
# list collections within current database
db.list_collection_names()

['posts']

### Retrieving Documents from a MongoDB Collection via PyMongo

We can now retrieve the document we stored within our __posts__ collection and store it within a __Pandas__ dataframe. To do so, we load the __pandas__ library and then assign the output of the PyMongo __find()__ function to a Python variable. We then convert that variable to a Python list, the content of which will be individual Python dictionary objects, one dictionary for each MongoDB document contained within the collection.

In [68]:
# retrieve documents from a collection and store them within a Pandas
# dataframe

import pandas as pd

# save the output of the find() function to a variable
cursor = posts_collection.find()

# convert the content of the output of the find() function to a Python list
entries = list(cursor)

# convert the content of the python list to a pandas dataframe
df = pd.DataFrame(entries)
df.head()

Unnamed: 0,_id,author,tags,text
0,605620e9536d8c84414c3375,Mike,"[mongodb, python, pymongo]",My first blog post!


As shown in the output above, the dataframe contains the content of the document we had added to the __posts__ collection. Note that the unique MongoDB object id (shown in the first column) is included in the output of the PyMongo __find()__ function. The subsequent dataframe columns correspond to the content of the key:value pairs we specified when we defined the document (see above).

Now let's add a second document to our AIM5001 __posts__ collection. For this example, we use a document whose key values are __NOT__ identical to those of the first document we added (see above).

In [45]:
# define a 2nd document to add to the new database
# Note how the structure of the document is NOT identical to 
# that of the first document

tutorial1 = {
    "title": "Working With JSON Data in Python",
    "author": "Lucas",
    "contributors": [
        "Aldren",
        "Dan",
        "Joanna"
    ],
    "url": "https://realpython.com/python-json/"
}

In [46]:
# Add the 2nd document to the collection
post_id2 = posts_collection.insert_one(tutorial1).inserted_id
post_id2

ObjectId('60561160536d8c84414c3374')

When we now retrieve and display the content of the __posts__ collection, we can clearly see that both documents have been successfully stored within the collection, despite the fact that their structure is not identical.

This is a great example of one of the advantages of using MongoDB: Documents stored within any given database or collection __DO NOT__ need to adhere to an identical format or structure.

In [47]:
# retrieve + display all documents in a collection
for document in posts_collection.find():
    print (document)

{'_id': ObjectId('6055ee49536d8c84414c3373'), 'author': 'Mike', 'text': 'My first blog post!', 'tags': ['mongodb', 'python', 'pymongo']}
{'_id': ObjectId('60561160536d8c84414c3374'), 'title': 'Working With JSON Data in Python', 'author': 'Lucas', 'contributors': ['Aldren', 'Dan', 'Joanna'], 'url': 'https://realpython.com/python-json/'}


Let's retrieve all documents from the __posts__ collection once again and convert the results to a __pandas__ dataframe. Note once again how the content of each MongoDB document is converted to a Python dictionary object by the PyMongo __find()__ function. After converting the output of the __find()__ function to a Python list of dictionaries, we can directly convert that Python list to a __pandas__ dataframe:

In [49]:
# retrieve all documents from collection and convert the result to a Python list

cursor = posts_collection.find()
entries = list(cursor)

# As shown below, we now have a list of Python dictionaries. 
# The content of each document from the collection is converted to a 
# Python dictionary object. Recall from Module 6 that 
# Pandas can seamlessly create a dataframe using a list of dictionaries
entries

[{'_id': ObjectId('6055ee49536d8c84414c3373'),
  'author': 'Mike',
  'text': 'My first blog post!',
  'tags': ['mongodb', 'python', 'pymongo']},
 {'_id': ObjectId('60561160536d8c84414c3374'),
  'title': 'Working With JSON Data in Python',
  'author': 'Lucas',
  'contributors': ['Aldren', 'Dan', 'Joanna'],
  'url': 'https://realpython.com/python-json/'}]

In [50]:
# Convert the list of dictionaries to a Pandas dataframe
# Note how Pandas assigns NaN values wherever the document key values 
# do not align

df = pd.DataFrame(entries)
df.head()

Unnamed: 0,_id,author,contributors,tags,text,title,url
0,6055ee49536d8c84414c3373,Mike,,"[mongodb, python, pymongo]",My first blog post!,,
1,60561160536d8c84414c3374,Lucas,"[Aldren, Dan, Joanna]",,,Working With JSON Data in Python,https://realpython.com/python-json/


Note in the above output how __Pandas__ assigns NaN values wherever the MongoDB document key values do not align. Obviously, the key values fields used for the two sample documents created above do not share common key values. 

However, if we were porting the content of an individual SQL table into MongoDB, all document key values would be identical, since each row of a given SQL table could be represented by a single MongoDB document within a MongoDB collection used to store/manage the content of the SQL table. As such, the dataframe resulting from the process shown above would not be populated with disparate key values or 'NaN' data values.

### Converting a Pandas Dataframe to a Python List of Dictionaries

As mentioned above, a MongoDB __document__ is comprised of key:value pairs, and we can populate a MongoDB __collection__ from our Python environment by arranging the data we want to store as __documents__ within a MongoDB __collection__ in the form of a Python list of dictionaries, with each dictionary within the list corresponding to a single "document". 

__Pandas__ provides us with a quick and easy way to convert the content of a dataframe to a Python list of dictionaries. Specifically, we can use the __to_dict()__ function as shown below. Be sure to pass the __'records'__ keyword to the function when using it, e.g., __df.to_dict('records')__. This ensures that each dataframe row will be converted to a distinct dictionary object. Therefore, the output of the function will be a list of dictionaries, with each dictionary corresponding to the content of a single row within the dataframe.

The resulting list of dictionaries can then easily be used with __PyMongo__ to populate the content of a MongoDB __collection__.

In [52]:
# convert the content of a dataframe to a Python list of dictionaries.
# A list of dictionaries can then be used to populate a MongoDB collection
# with documents, wherein each dictionary within the list represents the
# content of a single MongoDB document
df.to_dict('records') 

[{'_id': ObjectId('6055ee49536d8c84414c3373'),
  'author': 'Mike',
  'contributors': nan,
  'tags': ['mongodb', 'python', 'pymongo'],
  'text': 'My first blog post!',
  'title': nan,
  'url': nan},
 {'_id': ObjectId('60561160536d8c84414c3374'),
  'author': 'Lucas',
  'contributors': ['Aldren', 'Dan', 'Joanna'],
  'tags': nan,
  'text': nan,
  'title': 'Working With JSON Data in Python',
  'url': 'https://realpython.com/python-json/'}]

### Closing your MongoDB Connection

Always be sure to terminate your connectivity with your MongoDB server prior to exiting your Python environment using the __PyMongo__ syntax shown below.

In [69]:
# Lastly, be sure to terminate your connection with your MongoDB server
# prior to exiting your Python environment
client.close()