# MongoDB

MongoDB is a no-sql database for storage of JSON-style data (corresponding to dictionaries and lists in Python).

https://docs.mongodb.com/manual/tutorial/getting-started/

In contrast to relational databases, MongoDB documents (corresponding to table entries in relational databases) in collections (corresponding to tables) have no fixed structure. However, it makes sense to define all documents in one collection as similar as possible to make queries, etc. on them easier.

MongoDB works both on single machines and distributed along multiple clusters. Here, distribution is however not discussed.

This tutorial requires a running MongoDB server, preferably in a Docker container started using the given docker-compose file.

In [2]:
!conda install pymongo mongoengine --yes # installs low-level and high-level APIs

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - mongoengine
    - pymongo


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    mongoengine-0.18.2         |           py37_0         157 KB  conda-forge
    pymongo-3.8.0              |   py37he1b5a44_0         966 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         1.1 MB

The following NEW packages will be INSTALLED:

  mongoengine        conda-forge/linux-64::mongoengine-0.18.2-py37_0
  pymongo            conda-forge/linux-64::pymongo-3.8.0-py37he1b5a44_0



Downloading and Extracting Packages
mongoengine-0.18.2   | 157 KB    | ##################################### | 100% 
pymongo-3.8.0        | 966 KB    | ##########################

In [2]:
import pymongo
import gridfs
import mimetypes

## Connecting

In [33]:
url = 'mongo_db' # Docker alias
port = 27017
db_name = 'tutorial_db'
con_str = f'mongodb://{url}:{port}'
client = pymongo.MongoClient(con_str)
db = client[db_name] # create a new database or open existing database

In [49]:
db.list_collection_names() # should be empty for new database

['doc']

## Storing and Retreiving Documents

In [13]:
docs = db.docs

In [14]:
entry = {'name': 'file1',
        'tags': ['tax', 'bill'],
        'Status': 'payed'}
result = docs.insert_one(entry)

In [15]:
result.inserted_id

ObjectId('5d24e6ff8e16a0e4fa656ca9')

In [16]:
docs.find_one()

{'_id': ObjectId('5d24e6ff8e16a0e4fa656ca9'),
 'name': 'file1',
 'tags': ['tax', 'bill'],
 'Status': 'payed'}

## GridFS

An API to store and retreive files of any size from a MongoDB.

https://docs.mongodb.com/manual/core/gridfs/

In [17]:
fs = gridfs.GridFS(db)

### Store files in db

In [18]:
filename = '../code/test_data.csv' # uses test data from Pandas tutorial
metadata = {'description': 'NYC taxi data',
            'tags': ['test', 'large'],
            }
content_type = mimetypes.guess_type(filename)[0] # automatically infer data type
content_type

'text/csv'

In [19]:
with open(filename,'rb') as f:
    fs.put(f, filename=filename, metadata=metadata, content_type=content_type)

### Check db content

In [21]:
fs.list() # list of all stored files

['../code/test_data.csv']

In [22]:
db.list_collection_names() # note the 2 documents for fs

['docs', 'fs.files', 'fs.chunks']

In [20]:
list(db.fs.files.find()) # storage of file metadata

[{'_id': ObjectId('5d24e7848e16a0e4fa656caa'),
  'filename': '../code/test_data.csv',
  'metadata': {'description': 'NYC taxi data', 'tags': ['test', 'large']},
  'contentType': 'text/csv',
  'md5': '8c2bdc4b7a49464147d557568a2979bf',
  'chunkSize': 261120,
  'length': 70930317,
  'uploadDate': datetime.datetime(2019, 7, 9, 19, 14, 15, 263000)}]

In [29]:
db.fs.chunks.count_documents({}) # here the actual data is stored in chunks

272

### Get Files from db

In [30]:
with open('test_output.csv','wb') as f:
    f.write(fs.get_last_version(filename).read())

### Delete all Files

In [31]:
for f in fs.find():
    print(f._id)
    fs.delete(f._id)

5d24e7848e16a0e4fa656caa


In [32]:
%rm test_output.csv

# MongoEngine

MongoEngine is a Document-Object Mapper (think ORM, but for document databases) for working with MongoDB from Python.

It allows to define a (flexible) schema and validation rules for documents,

http://mongoengine.org/

http://docs.mongoengine.org/

In [38]:
import mongoengine

## Connecting

In [39]:
mongoengine.connect(db=db_name, host=url, port=port)

MongoClient(host=['mongo_db:27017'], document_class=dict, tz_aware=False, connect=True, read_preference=Primary())

## Defining Document Schema

In [40]:
class Doc(mongoengine.Document):
    name = mongoengine.StringField(required=True)
    status = mongoengine.StringField(choices=['open', 'done'])
    tags = mongoengine.ListField(mongoengine.StringField())

## Storing Data

In [41]:
doc1 = Doc(name='bill1', status='open', tags=['tax'])
doc2 = Doc(name='bill2', status='done', tags=['tax', 'handyman'])

In [42]:
doc1.save()
doc2.save()

<Doc: Doc object>

In [46]:
doc3 = Doc(name='invalid_bill', status='invalid', tags=['tax', 'handyman'])
try:
    doc3.save()
except mongoengine.ValidationError as e:
    print(e)

ValidationError (Doc:None) (Value must be one of ['open', 'done']: ['status'])


## Retreiving Data

In [47]:
for doc in Doc.objects(tags='handyman'):
    print(doc.name)

bill2


# Pandas Interaction

In [79]:
import pandas as pd
import json
from datetime import datetime

In [60]:
df_taxi = pd.read_csv('../code/test_data.csv', parse_dates=[1, 2])
df_taxi.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type
0,2,2018-01-01 00:18:50,2018-01-01 00:24:39,N,1,236,236,5,0.7,6.0,0.5,0.5,0.0,0.0,,0.3,7.3,2,1.0
1,2,2018-01-01 00:30:26,2018-01-01 00:46:42,N,1,43,42,5,3.5,14.5,0.5,0.5,0.0,0.0,,0.3,15.8,2,1.0
2,2,2018-01-01 00:07:25,2018-01-01 00:19:45,N,1,74,152,1,2.14,10.0,0.5,0.5,0.0,0.0,,0.3,11.3,2,1.0
3,2,2018-01-01 00:32:40,2018-01-01 00:33:41,N,1,255,255,1,0.03,-3.0,-0.5,-0.5,0.0,0.0,,-0.3,-4.3,3,1.0
4,2,2018-01-01 00:32:40,2018-01-01 00:33:41,N,1,255,255,1,0.03,3.0,0.5,0.5,0.0,0.0,,0.3,4.3,2,1.0


In [90]:
start_time = datetime.utcnow()
db.taxi.insert_many(json.loads(df_taxi.to_json(orient='records'))) # inserting ca. 800k rows!
print(f'elapsed time: {datetime.utcnow() - start_time}')

elapsed time: 0:01:00.772149


Loading data from MongoDB into Pandas DataFrame.

In [93]:
db.taxi.find_one({'VendorID': 1})

{'_id': ObjectId('5d25b9ee8e16a0e4fad56157'),
 'VendorID': 1,
 'lpep_pickup_datetime': 1514765260000,
 'lpep_dropoff_datetime': 1514765720000,
 'store_and_fwd_flag': 'N',
 'RatecodeID': 1,
 'PULocationID': 225,
 'DOLocationID': 37,
 'passenger_count': 1,
 'trip_distance': 1.9,
 'fare_amount': 8.0,
 'extra': 0.5,
 'mta_tax': 0.5,
 'tip_amount': 3.0,
 'tolls_amount': 0.0,
 'ehail_fee': None,
 'improvement_surcharge': 0.3,
 'total_amount': 12.3,
 'payment_type': 1,
 'trip_type': 1.0}

In [91]:
start_time = datetime.utcnow()
df_taxi_from_db = pd.DataFrame(list(db.taxi.find()))
print(f'elapsed time: {datetime.utcnow() - start_time}')
df_taxi_from_db.head()

elapsed time: 0:00:27.026934


Unnamed: 0,DOLocationID,PULocationID,RatecodeID,VendorID,_id,ehail_fee,extra,fare_amount,improvement_surcharge,lpep_dropoff_datetime,lpep_pickup_datetime,mta_tax,passenger_count,payment_type,store_and_fwd_flag,tip_amount,tolls_amount,total_amount,trip_distance,trip_type
0,236,236,1,2,5d25b9ee8e16a0e4fad56148,,0.5,6.0,0.3,1514766279000,1514765930000,0.5,5,2,N,0.0,0.0,7.3,0.7,1.0
1,42,43,1,2,5d25b9ee8e16a0e4fad56149,,0.5,14.5,0.3,1514767602000,1514766626000,0.5,5,2,N,0.0,0.0,15.8,3.5,1.0
2,152,74,1,2,5d25b9ee8e16a0e4fad5614a,,0.5,10.0,0.3,1514765985000,1514765245000,0.5,1,2,N,0.0,0.0,11.3,2.14,1.0
3,255,255,1,2,5d25b9ee8e16a0e4fad5614b,,-0.5,-3.0,-0.3,1514766821000,1514766760000,-0.5,1,3,N,0.0,0.0,-4.3,0.03,1.0
4,255,255,1,2,5d25b9ee8e16a0e4fad5614c,,0.5,3.0,0.3,1514766821000,1514766760000,0.5,1,2,N,0.0,0.0,4.3,0.03,1.0


Retrieving data from MongoDB

In [92]:
len(df_taxi_from_db)

793529

In [89]:
db.drop_collection('taxi')

{'ns': 'tutorial_db.taxi', 'nIndexesWas': 1, 'ok': 1.0}

The performance of data transfer between Pandas DataFrames and MongoDB is rather good.

Note that some data type conversions may require special attention (like datetime above). Additionally, using (tabular) DataFrames does only make sense if the data is somewhat table-like.

# Cleanup

In [36]:
client.drop_database(db_name)