# MongoDB

MongoDB is a no-sql database for storage of JSON-style data (corresponding to dictionaries and lists in Python).

https://docs.mongodb.com/manual/tutorial/getting-started/

In contrast to relational databases, MongoDB documents (corresponding to table entries in relational databases) in collections (corresponding to tables) have no fixed structure. However, it makes sense to define all documents in one collection as similar as possible to make queries, etc. on them easier.

MongoDB works both on single machines and distributed along multiple clusters. Here, distribution is however not discussed.

This tutorial requires a running MongoDB server, preferably in a Docker container started using the given docker-compose file.

In [1]:
!conda install pymongo mongoengine --yes # installs low-level and high-level APIs

^C

CondaError: KeyboardInterrupt



In [1]:
import pymongo
import gridfs
import mimetypes

## Connecting

In [22]:
url = 'mongo_db' # Docker alias
port = 27017
db_name = 'tutorial_db'
con_str = f'mongodb://{url}:{port}'
client = pymongo.MongoClient(con_str)
db = client[db_name] # create a new database or open existing database

In [23]:
db.list_collection_names() # should be empty for new database

[]

## Storing and Retreiving Documents

In [5]:
docs = db.docs

In [6]:
entry = {'name': 'file1',
        'tags': ['tax', 'bill'],
        'Status': 'payed'}
result = docs.insert_one(entry)

In [7]:
result.inserted_id

ObjectId('5d2636403471587e35da0a5b')

In [8]:
docs.find_one()

{'_id': ObjectId('5d2636403471587e35da0a5b'),
 'name': 'file1',
 'tags': ['tax', 'bill'],
 'Status': 'payed'}

## GridFS

An API to store and retreive files of any size from a MongoDB.

https://docs.mongodb.com/manual/core/gridfs/

In [9]:
fs = gridfs.GridFS(db)

### Store files in db

In [10]:
filename = '../code/test_data.csv' # uses test data from Pandas tutorial
metadata = {'description': 'NYC taxi data',
            'tags': ['test', 'large'],
            }
content_type = mimetypes.guess_type(filename)[0] # automatically infer data type
content_type

'text/csv'

In [11]:
with open(filename,'rb') as f:
    fs.put(f, filename=filename, metadata=metadata, content_type=content_type)

### Check db content

In [12]:
fs.list() # list of all stored files

['../code/test_data.csv']

In [13]:
db.list_collection_names() # note the 2 documents for fs

['docs', 'fs.files', 'fs.chunks']

In [14]:
list(db.fs.files.find()) # storage of file metadata

[{'_id': ObjectId('5d2636553471587e35da0a5c'),
  'filename': '../code/test_data.csv',
  'metadata': {'description': 'NYC taxi data', 'tags': ['test', 'large']},
  'contentType': 'text/csv',
  'md5': '8c2bdc4b7a49464147d557568a2979bf',
  'chunkSize': 261120,
  'length': 70930317,
  'uploadDate': datetime.datetime(2019, 7, 10, 19, 2, 47, 981000)}]

In [15]:
db.fs.chunks.count_documents({}) # here the actual data is stored in chunks

272

### Get Files from db

In [16]:
with open('test_output.csv','wb') as f:
    f.write(fs.get_last_version(filename).read())

### Delete all Files

In [17]:
for f in fs.find():
    print(f._id)
    fs.delete(f._id)

5d2636553471587e35da0a5c


In [18]:
%rm test_output.csv

# MongoEngine

MongoEngine is a Document-Object Mapper (think ORM, but for document databases) for working with MongoDB from Python.

It allows to define a (flexible) schema and validation rules for documents,

http://mongoengine.org/

http://docs.mongoengine.org/

In [264]:
import mongoengine
import datetime

## Connecting

In [313]:
client = mongoengine.connect(db=db_name, host=url, port=port)

In [314]:
client.get_database(db_name)

Database(MongoClient(host=['mongo_db:27017'], document_class=dict, tz_aware=False, connect=True, read_preference=Primary()), 'tutorial_db')

## Defining Document Schema

In [315]:
class HistorizedBase(mongoengine.Document):
    """
    Base class for historized collections,
    use historize(doc, db) for historization before any update!
    """
    meta = {'abstract': True} # allows inheritance from this class
    version = mongoengine.IntField(default=0)
    last_update = mongoengine.DateTimeField()
    update_user = mongoengine.StringField(required=True)
    creation_date = mongoengine.DateTimeField(default=datetime.datetime.utcnow)
    history_original_id = mongoengine.ObjectIdField() # only set for historized documents
    
    def save(self):
        """overwriting save method to keep track of version number and last update"""
        self.version += 1
        self.last_update = datetime.datetime.utcnow
        super().save()
        
def historize(doc, db, hist_table=None):
    """
    Historizes Mongoengine document (must inherit from HistorizedBase class) into history table.
    Parameters: 
    * doc: Mongoengine document instance
    * db: pymongo database
    * hist_table: name of history table. If not set, the name is generated appending '_history'
    to original table name.
    returns name of history table
    """
    ref = doc.to_dbref()
    hist_table = f'{ref.collection}_history' if hist_table is None else hist_table
    data = db.doc.find({'_id': ref.id})[0]
    data.pop('_id', None) # remove id field to avoid duplicate key errors
    data['history_original_id'] = ref.id
    db[hist_table].insert_one(data)
    return hist_table

In [316]:
class Doc(HistorizedBase):
    name = mongoengine.StringField(required=True)
    status = mongoengine.StringField(choices=['open', 'done'])
    tags = mongoengine.ListField(mongoengine.StringField())

## Storing Data

In [317]:
doc1 = Doc(name='bill1', status='open', tags=['tax'], update_user='user1')
doc2 = Doc(name='bill2', status='done', tags=['tax', 'handyman'], update_user='user1')

In [318]:
doc1.save()
doc2.save()

In [319]:
doc3 = Doc(name='invalid_bill', status='invalid', tags=['tax', 'handyman'], 
           update_user='user1')
try:
    doc3.save()
except mongoengine.ValidationError as e:
    print(e)

ValidationError (Doc:None) (Value must be one of ['open', 'done']: ['status'])


## Retreiving Data

In [320]:
for doc in Doc.objects(tags='handyman'):
    print(doc.name)

bill2


In [321]:
list(db.doc.find())

[{'_id': ObjectId('5d26ebf840c925f417f072be'),
  'version': 1,
  'last_update': datetime.datetime(2019, 7, 11, 7, 57, 44, 337000),
  'update_user': 'user1',
  'creation_date': datetime.datetime(2019, 7, 11, 7, 57, 43, 717000),
  'name': 'bill1',
  'status': 'open',
  'tags': ['tax']},
 {'_id': ObjectId('5d26ebf840c925f417f072bf'),
  'version': 1,
  'last_update': datetime.datetime(2019, 7, 11, 7, 57, 44, 939000),
  'update_user': 'user1',
  'creation_date': datetime.datetime(2019, 7, 11, 7, 57, 43, 717000),
  'name': 'bill2',
  'status': 'done',
  'tags': ['tax', 'handyman']}]

## Updating Existing Documents

In [322]:
doc_to_update = Doc.objects(name='bill2')[0]
doc_to_update

<Doc: Doc object>

In [323]:
historize(doc_to_update, db)

'doc_history'

In [324]:
doc_to_update.status = 'open'
doc_to_update.update_user = 'user2'
doc_to_update.save()

In [325]:
list(db.doc.find())

[{'_id': ObjectId('5d26ebf840c925f417f072be'),
  'version': 1,
  'last_update': datetime.datetime(2019, 7, 11, 7, 57, 44, 337000),
  'update_user': 'user1',
  'creation_date': datetime.datetime(2019, 7, 11, 7, 57, 43, 717000),
  'name': 'bill1',
  'status': 'open',
  'tags': ['tax']},
 {'_id': ObjectId('5d26ebf840c925f417f072bf'),
  'version': 2,
  'last_update': datetime.datetime(2019, 7, 11, 7, 57, 55, 109000),
  'update_user': 'user2',
  'creation_date': datetime.datetime(2019, 7, 11, 7, 57, 43, 717000),
  'name': 'bill2',
  'status': 'open',
  'tags': ['tax', 'handyman']}]

In [326]:
list(db.doc_history.find())

[{'_id': ObjectId('5d26ec0140c925f417f072c0'),
  'version': 1,
  'last_update': datetime.datetime(2019, 7, 11, 7, 57, 44, 939000),
  'update_user': 'user1',
  'creation_date': datetime.datetime(2019, 7, 11, 7, 57, 43, 717000),
  'name': 'bill2',
  'status': 'done',
  'tags': ['tax', 'handyman'],
  'history_original_id': ObjectId('5d26ebf840c925f417f072bf')}]

# Pandas Interaction

In [260]:
import pandas as pd
import json
from datetime import datetime

In [60]:
df_taxi = pd.read_csv('../code/test_data.csv', parse_dates=[1, 2])
df_taxi.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type
0,2,2018-01-01 00:18:50,2018-01-01 00:24:39,N,1,236,236,5,0.7,6.0,0.5,0.5,0.0,0.0,,0.3,7.3,2,1.0
1,2,2018-01-01 00:30:26,2018-01-01 00:46:42,N,1,43,42,5,3.5,14.5,0.5,0.5,0.0,0.0,,0.3,15.8,2,1.0
2,2,2018-01-01 00:07:25,2018-01-01 00:19:45,N,1,74,152,1,2.14,10.0,0.5,0.5,0.0,0.0,,0.3,11.3,2,1.0
3,2,2018-01-01 00:32:40,2018-01-01 00:33:41,N,1,255,255,1,0.03,-3.0,-0.5,-0.5,0.0,0.0,,-0.3,-4.3,3,1.0
4,2,2018-01-01 00:32:40,2018-01-01 00:33:41,N,1,255,255,1,0.03,3.0,0.5,0.5,0.0,0.0,,0.3,4.3,2,1.0


In [90]:
start_time = datetime.utcnow()
db.taxi.insert_many(json.loads(df_taxi.to_json(orient='records'))) # inserting ca. 800k rows!
print(f'elapsed time: {datetime.utcnow() - start_time}')

elapsed time: 0:01:00.772149


Loading data from MongoDB into Pandas DataFrame.

In [93]:
db.taxi.find_one({'VendorID': 1})

{'_id': ObjectId('5d25b9ee8e16a0e4fad56157'),
 'VendorID': 1,
 'lpep_pickup_datetime': 1514765260000,
 'lpep_dropoff_datetime': 1514765720000,
 'store_and_fwd_flag': 'N',
 'RatecodeID': 1,
 'PULocationID': 225,
 'DOLocationID': 37,
 'passenger_count': 1,
 'trip_distance': 1.9,
 'fare_amount': 8.0,
 'extra': 0.5,
 'mta_tax': 0.5,
 'tip_amount': 3.0,
 'tolls_amount': 0.0,
 'ehail_fee': None,
 'improvement_surcharge': 0.3,
 'total_amount': 12.3,
 'payment_type': 1,
 'trip_type': 1.0}

In [91]:
start_time = datetime.utcnow()
df_taxi_from_db = pd.DataFrame(list(db.taxi.find()))
print(f'elapsed time: {datetime.utcnow() - start_time}')
df_taxi_from_db.head()

elapsed time: 0:00:27.026934


Unnamed: 0,DOLocationID,PULocationID,RatecodeID,VendorID,_id,ehail_fee,extra,fare_amount,improvement_surcharge,lpep_dropoff_datetime,lpep_pickup_datetime,mta_tax,passenger_count,payment_type,store_and_fwd_flag,tip_amount,tolls_amount,total_amount,trip_distance,trip_type
0,236,236,1,2,5d25b9ee8e16a0e4fad56148,,0.5,6.0,0.3,1514766279000,1514765930000,0.5,5,2,N,0.0,0.0,7.3,0.7,1.0
1,42,43,1,2,5d25b9ee8e16a0e4fad56149,,0.5,14.5,0.3,1514767602000,1514766626000,0.5,5,2,N,0.0,0.0,15.8,3.5,1.0
2,152,74,1,2,5d25b9ee8e16a0e4fad5614a,,0.5,10.0,0.3,1514765985000,1514765245000,0.5,1,2,N,0.0,0.0,11.3,2.14,1.0
3,255,255,1,2,5d25b9ee8e16a0e4fad5614b,,-0.5,-3.0,-0.3,1514766821000,1514766760000,-0.5,1,3,N,0.0,0.0,-4.3,0.03,1.0
4,255,255,1,2,5d25b9ee8e16a0e4fad5614c,,0.5,3.0,0.3,1514766821000,1514766760000,0.5,1,2,N,0.0,0.0,4.3,0.03,1.0


Retrieving data from MongoDB

In [92]:
len(df_taxi_from_db)

793529

In [89]:
db.drop_collection('taxi')

{'ns': 'tutorial_db.taxi', 'nIndexesWas': 1, 'ok': 1.0}

The performance of data transfer between Pandas DataFrames and MongoDB is rather good.

Note that some data type conversions may require special attention (like datetime above). Additionally, using (tabular) DataFrames does only make sense if the data is somewhat table-like.

# Cleanup

In [312]:
client.drop_database(db_name)