****
# MongoDB connection using PyMongo
****

## About this notebook: 
Notebook prepared by **Jesus Perez Colino** Version 0.1, First Released: 01/12/2014, Alpha.  

- This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). This work is offered for free, with the hope that it will be useful.


- **Summary**: This notebook contains a brief introduction to **MongoDB** and **PyMongo** with data scrapping examples, using **scrapy**.


- **Python & packages versions** to reproduce the results of this notebook: 

In [1]:
from datetime import datetime, timedelta
import pymongo
import scrapy
from sys import version
from pymongo import MongoClient
print ' Reproducibility conditions for this notebook '.center(90,'-')
print 'Python version:       ' + version
print 'Pymongo version:      ' + pymongo.version
print 'Scrapy version:       ' + scrapy.__version__
print '-'*90

---------------------- Reproducibility conditions for this notebook ----------------------
Python version:       2.7.10 |Anaconda 2.3.0 (x86_64)| (default, Sep 15 2015, 14:29:08) 
[GCC 4.2.1 (Apple Inc. build 5577)]
Pymongo version:      3.0.3
Scrapy version:       0.20.2
------------------------------------------------------------------------------------------


# Basics about MongoDB with PyMongo

First, open a connection with the MondoDB server:

In [2]:
try: 
    client = MongoClient("localhost", 27017)
    print "Connected to MongoDB as:", client
except pymongo.errors.ConnectionFailure, e:
    print "Could not connect to MongoDB: %s" % e 

Connected to MongoDB in: MongoClient('localhost', 27017)


MongoDB is a document-oriented database. This is different from a relational database in two significant ways. Firstly, not all entries must adhere to the same schema. Secondly you can embed entries inside of one another. 

MongoDB creates **databases** and **collections** automatically if they don't exist already. A single instance of MongoDB *can support multiple independent databases*. 

When working with PyMongo you access databases using attribute style access:

In [3]:
db = client.test_database
print db

Database(MongoClient('localhost', 27017), u'test_database')


A **collection** is a *group of documents* stored in MongoDB, and can be thought of as roughly the equivalent of a table in a relational database. 

In [66]:
# to prevent colision cases in db with previous db connetions: 

for name in db.collection_names():
    if name != 'system.indexes':
        db.drop_collection(name)

db.collection_names()

[u'system.indexes']

Getting a collection in PyMongo works the same as getting a database:

In [67]:
db.create_collection("test")

document = {"x": "jpcolino", "tags": ["author", "developer", "tester"]}

db.test.insert_one(document)

<pymongo.results.InsertOneResult at 0x1065f12d0>

In [68]:
print '-'*75
print 'Databases open in client: ', client.database_names()
print 'Collection names in db:   ', db.collection_names()
print '-'*75

---------------------------------------------------------------------------
Databases open in client:  [u'local', u'test_database']
Collection names in db:    [u'system.indexes', u'test']
---------------------------------------------------------------------------


MongoDB is sometimes referred to as a *“schemaless” database*, meaning that it does not enforce a particular structure on documents in a collection. It is perfectly legal (though of questionable utility) to store every object in your application in the same collection, regardless of its structure. In a well-designed application, however, it is more frequently the case that a collection will contain documents of identical, or closely related, structure. When all the documents in a collection are similarly, but not identically, structured, we call this a **polymorphic schema**.

In [None]:
result = db.test.insert_many([{"x": 1, "tags": ["dog", "cat"]},
                              {"x": 2, "tags": ["cat"]},
                              {"x": 2, "tags": ["mouse", "cat", "dog"]},
                              {"x": 3, "tags": []},
                              {"y": 4, "tags": []}])

# Basic operations with a document: update with $rename

db.test.update_one({"y": 4},{"$rename": {"y":"x"}})

In [73]:
for doc in db.test.find():
    print doc

{u'x': u'jpcolino', u'_id': ObjectId('56260c91c47fab036c5273da'), u'tags': [u'author', u'developer', u'tester']}
{u'x': 1, u'_id': ObjectId('56260c93c47fab036c5273db'), u'tags': [u'dog', u'cat']}
{u'x': 2, u'_id': ObjectId('56260c93c47fab036c5273dc'), u'tags': [u'cat']}
{u'x': 2, u'_id': ObjectId('56260c93c47fab036c5273dd'), u'tags': [u'mouse', u'cat', u'dog']}
{u'x': 3, u'_id': ObjectId('56260c93c47fab036c5273de'), u'tags': []}
{u'x': 4, u'_id': ObjectId('56260c93c47fab036c5273df'), u'tags': []}


In [75]:
# Basic operations with a document: delete with delete_one

db.test.delete_one({'x':'jpcolino'})

for doc in db.test.find():
    print doc

{u'x': 1, u'_id': ObjectId('56260c93c47fab036c5273db'), u'tags': [u'dog', u'cat']}
{u'x': 2, u'_id': ObjectId('56260c93c47fab036c5273dc'), u'tags': [u'cat']}
{u'x': 2, u'_id': ObjectId('56260c93c47fab036c5273dd'), u'tags': [u'mouse', u'cat', u'dog']}
{u'x': 3, u'_id': ObjectId('56260c93c47fab036c5273de'), u'tags': []}
{u'x': 4, u'_id': ObjectId('56260c93c47fab036c5273df'), u'tags': []}


In [95]:
# Basic information over the collection:

print 'Name of the Database: \n', db.test.name
print '-'*75
print 'Full descriptions: \n', db.test.acknowledged
print '-'*75
print result.inserted_ids
print '-'*75
print db.test.find_one()
print '-'*75
for d in db.test.find()[1:]:
    print d
print '-'*75
print db.test['x']
print '-'*75
print db.test['tags']
print '-'*75

Name of the Database: 
test
---------------------------------------------------------------------------
Full descriptions: 
Collection(Database(MongoClient('localhost', 27017), u'test_database'), u'test.acknowledged')
---------------------------------------------------------------------------
[ObjectId('56260c93c47fab036c5273db'), ObjectId('56260c93c47fab036c5273dc'), ObjectId('56260c93c47fab036c5273dd'), ObjectId('56260c93c47fab036c5273de'), ObjectId('56260c93c47fab036c5273df')]
---------------------------------------------------------------------------
{u'x': 1, u'_id': ObjectId('56260c93c47fab036c5273db'), u'tags': [u'dog', u'cat']}
---------------------------------------------------------------------------
{u'x': 2, u'_id': ObjectId('56260c93c47fab036c5273dc'), u'tags': [u'cat']}
{u'x': 2, u'_id': ObjectId('56260c93c47fab036c5273dd'), u'tags': [u'mouse', u'cat', u'dog']}
{u'x': 3, u'_id': ObjectId('56260c93c47fab036c5273de'), u'tags': []}
{u'x': 4, u'_id': ObjectId('56260c93c47fab0

Here, we have some **query operators**: 

In [96]:
print 'Number of Documents: ', db.test.count()
print '-'*75
print "Number of Documents with field 'x':",db.test[{'x': 2}]
print '-'*75
print 'Number of Documents where x == 2: ', db.test.find({'x': 2}).count()
print '-'*75
print 'Number of Documents with x >= 2: ', db.test.find({'x':{'$gte': 2}}).count()
print '-'*75
# print 'Number of Documents with x >= 2: ', db.test[$not'x'].find().count()

Number of Documents:  5
---------------------------------------------------------------------------
Number of Documents with field 'x': Collection(Database(MongoClient('localhost', 27017), u'test_database'), u"test.{'x': 2}")
---------------------------------------------------------------------------
Number of Documents where x == 2:  2
---------------------------------------------------------------------------
Number of Documents with x >= 2:  4
---------------------------------------------------------------------------


Queries can also use special query operators. These operators include **gt, gte, lt, lte, ne, nin, regex, exists, not, or**, and many more. 

Additionally we can use **regular expresions**: 

In [97]:
# Using Regex to find tags = cats

import re
regex = re.compile(r'cat')
rstats = db.test.find({"tags":regex}).count()
print 'Number of Documents where you will find a "cat": ', rstats

Number of Documents where you will find a "cat":  3


In [98]:
from pymongo import DESCENDING
db.test.drop_indexes()
db.test.create_index([("x", pymongo.DESCENDING)], name = "id_x")

'id_x'

In [99]:
print db.test.index_information() 
for doc in db.test.index_information() :
    print doc

{u'id_x': {u'ns': u'test_database.test', u'key': [(u'x', -1)], u'v': 1}, u'_id_': {u'ns': u'test_database.test', u'key': [(u'_id', 1)], u'v': 1}}
id_x
_id_
