# SQL Vs NO-SQL

What's the difference between **relational** vs. **non-relational** databases?

Back in 2000 it was enough to smother your website with static text. Brochureware was cool. Not today, though. You have to have a dizzying variety of text, video, audio, images and social media to get someone's attention. But it's hard to add new content to a relational database. Or new features. Or new attributes. Not without disrupting performance or taking your database offline.

With non-relational databases you can store any type of content. Incorporate any kind of data in a single database. Build any feature. Faster. With less money.

### Relational (SQL)

* **Stuck**. Data now includes rich data types – tweets, videos, podcasts, animated gifs – which are hard, if not impossible, to store in a relational database. Development slows to a crawl, and ops is caught playing whack-a-mole.

* **Can’t Scale**. Your audience is global, in many countries, speaking many languages, accessing content on many devices. Scaling a relational database is not trivial. And it isn’t cheap.

* **Expensive**. Large teams tied up for long periods of time make these applications expensive to build and maintain. Proprietary software and hardware, plus separate databases and file systems needed to manage your content, add to the cost.

### Non-Relational (NoSQL)

* **Do the Impossible**. NoSQL can incorporate literally any type of data, while providing all the features needed to build content-rich apps.

* **Scale Big**. Scaling is built into the database. It is automatic and transparent. You can scale as your audience grows, both within a data center and across regions.

* **Cheap**. More productive teams, plus commodity hardware, make your projects cost 10% what they would with a relational database.


## Resources

Check the folder **resources** a collection of useful PDF:

* *Top_5_NoSQL_Considerations.pdf*: 5 reasons to choose a NO-SQL Database
* *AWS_&_MongoDB.pdf*: MongoDB on Amazon Web Services
* *Mongodb_&_ApacheSpark.pdf*: How to use Apache Spark and MongoDB


# MongoDB

 * Schema Free
 * Document Based
 * Supports Indexing
 * Not Transactional
 * Does not support relations (no JOIN)
 * Supports Autosharding
 * Automatic Replication and Failover
 * Relies on System Memory Manager
 * Has an Aggregation Pipeline
 * Builtin support for MapReduce

On Python mongodb support is provided by ``PyMongo`` library, which can be installed using:


    pip install pymongo
    

## Nexus Architecture

MongoDB’s design philosophy is focused on combining the critical capabilities of relational databases (left side) with the innovations of NoSQL technologies (right side).

<img src="nexus-architecture.png" width="640"/>


## Installing MongoDB


Installing MongoDB is as simple as going to http://www.mongodb.org/downloads and downloading it.

Create a ``/data/db`` directory then start ``mongod`` inside the mongodb downloaded package:


    curl -O 'https://fastdl.mongodb.org/osx/mongodb-osx-x86_64-3.0.4.tgz' 
    tar zxvf mongodb-osx-x86_64-3.0.4.tgz 
    cd mongodb-osx-x86_64-3.0.4
    mkdir data
    ./bin/mongod --dbpath=./data



## Using MongoDB


a ``MongoClient`` instance provides connection to MongoDB Server, each server can host multiple databases which can be retrieved with ``connection.database_name`` which can then contain multiple ``collections`` with different documents.

In [1]:
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')

In [2]:
db = client.phonebook
print db.collection_names()

[u'blog', u'people']


Once the database is retrieved, collections can be accessed as attributes of the database itself.

A MongoDB document is actually just a Python Dictionary, inserting a document is as simple as telling pymongo to insert the dictionary into the collection. Each document can have its own structure, can contain different data and you are not required to declare and structure of the collection. Not existing collections will be automatically created on the insertion of the first document

In [3]:
data = {'name': 'Alex', 'phone': '+39123456789'}
db.people.insert(data)

  from ipykernel import kernelapp as app


ObjectId('581c77f0e14afe6d4157eaeb')

In [4]:
print db.collection_names()

[u'blog', u'people']


Each inserted document will receive an **ObjectId** which is a uniquue identifier of the document, the ObjectId is based on some data like the current **timestamp**, **server identifier process id** and other data that guarantees it to be unique across multiple servers.

Being designed to work in a distributed and multinode environment, MongoDB handles "write safety" by the number of servers that are expected to have saved the document before considering the insert command "completed".

This is handled by the **w** option, which indicates the number of servers that must have saved the document before the insert command returns. Setting it to **0** makes mongodb work in *fire and forget* mode, which is useful when inserting a lot of documents quickly. As most drivers will actually generate the ObjectId on client that performs the insertion you will receive an ObjectId even before the document has been written.

In [5]:
db.people.insert({'name': 'Pippo', 'phone': '+39123456788', 'other_phone': '+3933332323'}, w=0)

  if __name__ == '__main__':


ObjectId('581c77f2e14afe6d4157eaec')

In [6]:
try:
    db.people.insert({'name': 'Jonny', 'phone': '+39123456789'}, w=2)
except Exception as e:
    print e

cannot use 'w' > 1 when a host is not replicated


  from ipykernel import kernelapp as app


Fetching back inserted document can be done using ``find`` and ``find_one`` methods of collections. Both methods accept a query expression that filters the returned documents. Omitting it means retrieving all the documents (or in case of find_one the first document).

In [7]:
res = db.people.find_one({'name': 'Alex'})
print res
print type(res)

{u'phone': u'+39123456789', u'_id': ObjectId('581c6968e14afe70495514a7'), u'name': u'Alex'}
<type 'dict'>


Filters in mongodb are described by Documents themselves, so in case of PyMongo they are dictionaries too.
A filter can be specified in the form ``{'field': value}``. 
By default filtering is performed by *equality* comparison, this can be changed by specifying a query operator in place of the value.

Query operators by convention start with a ``$`` sign and can be specified as ``{'field': {'operator': value}}``.
Full list of query operators is available at http://docs.mongodb.org/manual/reference/operator/query/

For example if we want to find each person that has an object id greather than ``53b30ff57ab71c051823b031`` we can achieve that with:

In [8]:
from bson import ObjectId
db.people.find_one({'_id': {'$gt':  ObjectId('5818c8549718d42038af68ad')}})

{u'_id': ObjectId('581c6968e14afe70495514a7'),
 u'name': u'Alex',
 u'phone': u'+39123456789'}

Updating Documents
---------------------

Updating documents in MongoDB can be performed with the ``update`` method of the collection. Updating is actually one of the major sources of issues for new users as it doesn't change values in document like it does on SQL based databases, but instead it replaces the document with a new one.

Also note that the update operation doesn't perform update on each document identified by the query, by default only the first document is updated. To apply it to multiple documents it is required to explicitly specify the ``multi=true`` option

What you usually want to do is actually using the ``$set`` operator which changes the existing document instead of replacing it with a new one.

In [9]:
doc = db.people.find_one({'name': 'Alex'})
print '\nBefore Updated:', doc

db.people.update({'name': 'Alex'}, {'name': 'John Doe'})
doc = db.people.find_one({'name': 'John Doe'})
print '\nAfter Update:', doc

# Go back to previous state
db.people.update({'name': 'John Doe'}, {'$set': {'phone': '+39123456789'}})
print '\nAfter $set phone:', db.people.find_one({'name': 'John Doe'})
db.people.update({'name': 'John Doe'}, {'$set': {'name': 'Alex'}})
print '\nAfter $set name:', db.people.find_one({'name': 'Alex'})



Before Updated: {u'phone': u'+39123456789', u'_id': ObjectId('581c6968e14afe70495514a7'), u'name': u'Alex'}

After Update: {u'_id': ObjectId('581c6968e14afe70495514a7'), u'name': u'John Doe'}

After $set phone: {u'phone': u'+39123456789', u'_id': ObjectId('581c6968e14afe70495514a7'), u'name': u'John Doe'}

After $set name: {u'phone': u'+39123456789', u'_id': ObjectId('581c6968e14afe70495514a7'), u'name': u'Alex'}




SubDocuments
--------------

The real power of mongodb is released when you use subdocuments.

As each mongodb document is a JSON object (actually BSON, but that doesn't change much for the user), it can contain any data which is valid in JSON. Including other documents and arrays. This replaces "relations" between collections in multiple use cases and it's heavily more efficient as it returns all the data in a single query instead of having to perform multiple queries to retrieve related data.

As MongoDB fully supports subdocuments it is also possible to query on sub document fields and even query on arrays using the ``dot notation``.

For example if you want to store a blog post in mongodb you might actually store everything, including author data and tags inside the blogpost itself:

In [10]:
db.blog.insert({'title': 'MongoDB intro!',
                'author': {'name': 'Alex',
                           'surname': 'Comu',
                           'nickname': 'alexcomu'},
                'tags': ['mongodb', 'web', 'new-hair-cut']})



ObjectId('581c77f6e14afe6d4157eaee')

In [11]:
db.blog.find_one({'title': 'MongoDB intro!'})

{u'_id': ObjectId('581c6f4fe14afe70495514ac'),
 u'author': {u'name': u'Alex', u'nickname': u'alexcomu', u'surname': u'Comu'},
 u'tags': [u'mongodb', u'web', u'new-hair-cut'],
 u'title': u'MongoDB intro!'}

In [12]:
db.blog.find_one({'tags': 'mongodb'})

{u'_id': ObjectId('581c6f4fe14afe70495514ac'),
 u'author': {u'name': u'Alex', u'nickname': u'alexcomu', u'surname': u'Comu'},
 u'tags': [u'mongodb', u'web', u'new-hair-cut'],
 u'title': u'MongoDB intro!'}

In [13]:
db.blog.find_one({'author.name': 'Alex'})

{u'_id': ObjectId('581c6f4fe14afe70495514ac'),
 u'author': {u'name': u'Alex', u'nickname': u'alexcomu', u'surname': u'Comu'},
 u'tags': [u'mongodb', u'web', u'new-hair-cut'],
 u'title': u'MongoDB intro!'}

In [14]:
# Create some random posts
TAGS = ['mongodb', 'web', 'scaling', 'cooking']

import random
for postnum in range(1, 5):
    db.blog.insert({'title': 'Post %s' % postnum,
                    'author': {'name': 'Alex',
                               'surname': 'Comu',
                               'nickname': 'alexcomu'},
                    'tags': random.sample(TAGS, 2)})



In [15]:
for post in db.blog.find({'tags': {'$in': ['scaling', 'cooking']}}):
    print post['title'], '->', ', '.join(post['tags'])


Post 1 -> scaling, cooking
Post 4 -> cooking, web
Post 1 -> scaling, cooking
Post 2 -> web, scaling
Post 3 -> web, scaling
Post 4 -> cooking, web


Indexing
----------

Indexing is actually the most important part of MongoDB.

MongoDB has great support for indexing, and it supports single key, multi key, compound and hashed indexes. Each index type has its specific use case and can be used both for querying and sorting.

 * Single Key -> Those are plain indexes on a field
 * Multi Key -> Those are indexes created on an array field
 * Compound -> Those are indexes that cover more than one field.
 * Hashed -> Those are indexes optimized for equality comparison, they actually store the hash of the indexed value and are usually used for sharding.
 
In case of compound indexes they can also be used when only a part of the query filter is present into the index, there is also a special case of indexes called *covering indexes* which happen when the fields you are asking for are all available into the index. In that case MongoDB won't even access the collection and will directly serve you the data from the index. An index cannot be both a multi key index and a covering index.

Indexes are also ordered, so they can be created *ASCENDING* or *DESCENDING*.

Creating indexes can be done using the ``ensure_index`` method

In [16]:
db.blog.ensure_index([('tags', 1)])

  if __name__ == '__main__':


u'tags_1'

Checking which index MongoDB is using to perform a query can be done using the ``explain`` method, forcing an index into a query can be done using the ``hint`` method.

As MongoDB uses a statistical optimizer, using ``hint`` in queries can actually provide a performance boost as it avoids the "best option" lookup cost of the optimizer. 

In [17]:
db.blog.find({'tags': 'mongodb'}).explain()['queryPlanner']['winningPlan']

{u'inputStage': {u'direction': u'forward',
  u'indexBounds': {u'tags': [u'["mongodb", "mongodb"]']},
  u'indexName': u'tags_1',
  u'indexVersion': 1,
  u'isMultiKey': True,
  u'isPartial': False,
  u'isSparse': False,
  u'isUnique': False,
  u'keyPattern': {u'tags': 1},
  u'stage': u'IXSCAN'},
 u'stage': u'FETCH'}

In [18]:
db.blog.find({'tags': 'mongodb'}).hint([('_id', 1)]).explain()['queryPlanner']['winningPlan']
#con hint suggerisco a mongo di usare l'indice

{u'filter': {u'tags': {u'$eq': u'mongodb'}},
 u'inputStage': {u'direction': u'forward',
  u'indexBounds': {u'_id': [u'[MinKey, MaxKey]']},
  u'indexName': u'_id_',
  u'indexVersion': 1,
  u'isMultiKey': False,
  u'isPartial': False,
  u'isSparse': False,
  u'isUnique': True,
  u'keyPattern': {u'_id': 1},
  u'stage': u'IXSCAN'},
 u'stage': u'FETCH'}

In [None]:
db.blog.find({'title': 'Post 1'}).explain()['queryPlanner']['winningPlan']

In [None]:
db.blog.ensure_index([('author.name', 1), ('title', 1)])
db.blog.find({'author.name': 'Alex'}, {'title': True, '_id': False}).explain()['queryPlanner']['winningPlan']

# Import Atleti DB

Here you can find a simple mongo Connector used to fill a db "olimpiadi" with a collection named "atleti" that contains all the atlethes from the CSV source file.

To run this script you can simply run from a terminal: 
    
    python import_athletes.py athletes_sochi.csv

In [None]:
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')

db = client.olimpiadi

# DATA FORMAT
# age,birthdate,gender,height,name,weight,
# gold_medals,silver_medals,bronze_medals,
# total_medals,sport,country

with open('athletes_sochi.csv', 'r') as file:
    for line in file.readlines():
        splitted = line.split(',')
        if splitted[0] != 'age':
            data = {'age':splitted[0],
                'birthdate':splitted[1],
                'gender':splitted[2],
                'height':splitted[3],
                'name':splitted[4],
                'weight':splitted[5],
                'gold_medals':int(splitted[6]),
                'silver_medals':int(splitted[7]),
                'bronze_medals':int(splitted[8]),
                'total_medals':int(splitted[9]),
                'sport':splitted[10],
                'country':splitted[11][:-2]}
            db.atleti.insert(data)


## Mongo IMPORT / EXPORT

Instead to use the python script we can also use **MONGOIMPORT** provided by MongoDB itself:

    mongoimport --db olimpiadi --collection atleti --file atleti.json

To export the entire collection we can use **MONGOEXPORT**:

    mongoexport --db olimpiadi --collection atleti --out atleti.json
    
Short version of *mongoexport*:
    
    mongoexport -d olimpiadi -c atleti -o atleti.json

Aggregation Pipeline
----------------------

The aggreation pipeline provided by the aggreation framework is a powerful feature in MongoDB that permits to perform complex data analysis by passing the documents through a pipeline of operations.

MongoDB was created with the cover philosophy that you are going to store your documents depending on the way you are going to read them. So to properly design your schema you need to know how you are going to use the documents. While this approach provides great performance benefits and is more concrete in case of web application, it might not always be feasible.

In case you need to perform some kind of analysis your documents are not optimized for, you can rely on the aggreation framework to create a pipeline that transforms them in a way more practical for the kind of analysis you need.

### How it works

The aggregation pipeline is a list of operations that gets executed one after the other on the documents of the collections. The first operation will be performed on all the documents, while successive operations are performed on the result of the previous steps.

If steps are able to take advantage of **indexes** they will, that is the case for a **match** or **sort** operator, if it appears at the begin of the pipeline. All operators start with a <span><strong>$</strong></span> sign

### Stage Operators


* **project**	Reshapes each document in the stream, such as by adding new fields or removing existing fields. For each input document, outputs one document.
* **match**	Filters the document stream to allow only matching documents to pass unmodified into the next pipeline stage. **match** uses standard MongoDB queries. For each input document, outputs either one document (a match) or zero documents (no match).
* **limit**	Passes the first n documents unmodified to the pipeline where n is the specified limit. For each input document, outputs either one document (for the first n documents) or zero documents (after the first n documents).
* **skip**	Skips the first n documents where n is the specified skip number and passes the remaining documents unmodified to the pipeline. For each input document, outputs either zero documents (for the first n documents) or one document (if after the first n documents).
* **unwind**	Deconstructs an array field from the input documents to output a document for each element. Each output document replaces the array with an element value. For each input document, outputs n documents where n is the number of array elements and can be zero for an empty array.
* **group**	Groups input documents by a specified identifier expression and applies the accumulator expression(s), if specified, to each group. Consumes all input documents and outputs one document per each distinct group. The output documents only contain the identifier field and, if specified, accumulated fields.
* **sort**	Reorders the document stream by a specified sort key. Only the order changes; the documents remain unmodified. For each input document, outputs one document.
* **geoNear**	Returns an ordered stream of documents based on the proximity to a geospatial point. Incorporates the functionality of **match**, **sort**, and **limit** for geospatial data. The output documents include an additional distance field and can include a location identifier field.
* **out**	Writes the resulting documents of the aggregation pipeline to a collection. To use the $out stage, it must be the last stage in the pipeline.

#### Expression Operators

Each stage operator can work with one or more **expression operator** which allow to perform actions during that stage, for a list of expression operators see http://docs.mongodb.org/manual/reference/operator/aggregation/#expression-operators




## Pipeline Examples

In [19]:
db = client.olimpiadi

# How many people from Italy and France?
print len(list(db.atleti.aggregate([
                {'$match': {'country': {'$in':['Italy', 'France']}}}
            ])))

print "\n-----\n"

# Count them using only the pipeline 
print db.atleti.aggregate([
        {'$match': {'country': {'$in':['Italy', 'France']}}},
        {'$group': {'_id': 'count', 'count': {'$sum': 1}}}
    ]).next()

print "\n-----\n"

# Count people with at least 10 medals
print list(db.atleti.aggregate([
        {'$project': {'country': 1, 'total_medals': 1, '_id': 0}},
        {'$group': {'_id': '$country', 'medals': {'$sum': '$total_medals'}}},
        {'$match': {'medals': {'$gt': 20}}},
        {'$sort': {'medals': 1}}
    ]))


print "\n-----\n"

# The same as before
country_with_more_than_ten_medals = db.atleti.aggregate([
        {'$project': {'country': 1, 'total_medals': 1, '_id': 0}},
        {'$group': {'_id': '$country', 'medals': {'$sum': '$total_medals'}}},
        {'$match': {'medals': {'$gt': 20}}},
        {'$sort': {'medals': 1}}
    ])
for entry in country_with_more_than_ten_medals:
    print entry



226

-----

{u'count': 226, u'_id': u'count'}

-----

[{u'_id': u'Germany', u'medals': 26}, {u'_id': u'United States', u'medals': 28}, {u'_id': u'Canada', u'medals': 28}, {u'_id': u'Russian Fed.', u'medals': 38}]

-----

{u'_id': u'Germany', u'medals': 26}
{u'_id': u'United States', u'medals': 28}
{u'_id': u'Canada', u'medals': 28}
{u'_id': u'Russian Fed.', u'medals': 38}


MapReduce
----------

MongoDB is powered by the V8 javascript engine, this means that each mongod node is able to run javascript code.
With an high enough number of mongod nodes, you actually end up with a powerful execution environment for distributed code that also copes with the major problem of data locality.

For this reason MongoDB exposes a **mapreduce** function which can be leveraged in shareded environments to run map reduce jobs.
Note that the Aggregation Pipeline is usually faster than the mapReduce feature, and it scales with the number of nodes as mapReduce, so you should rely on MapReduce only when the algorithm cannot be efficiently expressed with the Aggregation Pipeline.

In [21]:
db.atleti.map_reduce(map='''function(){
            var country = this.country;
            var medals = parseInt(this.total_medals);
            emit(country, medals);
        }''', reduce='''function(key, values){
            return Array.sum(values)
        }''',out='medalsfrequency')

for entry in db.medalsfrequency.find().sort('value'):
    if entry['value'] > 0:
        print entry

{u'_id': u'Croatia', u'value': 1.0}
{u'_id': u'Kazakhstan', u'value': 1.0}
{u'_id': u'Slovakia', u'value': 1.0}
{u'_id': u'Ukraine', u'value': 1.0}
{u'_id': u'Great Britain', u'value': 2.0}
{u'_id': u'Australia', u'value': 3.0}
{u'_id': u'Poland', u'value': 4.0}
{u'_id': u'Finland', u'value': 5.0}
{u'_id': u'Belarus', u'value': 6.0}
{u'_id': u'China', u'value': 6.0}
{u'_id': u'Czech Republic', u'value': 6.0}
{u'_id': u'Slovenia', u'value': 6.0}
{u'_id': u'Latvia', u'value': 7.0}
{u'_id': u'Korea', u'value': 8.0}
{u'_id': u'Italy', u'value': 9.0}
{u'_id': u'Japan', u'value': 9.0}
{u'_id': u'Switzerland', u'value': 9.0}
{u'_id': u'France', u'value': 11.0}
{u'_id': u'Austria', u'value': 13.0}
{u'_id': u'Sweden', u'value': 15.0}
{u'_id': u'Norway', u'value': 18.0}
{u'_id': u'Netherlands', u'value': 20.0}
{u'_id': u'Germany', u'value': 26.0}
{u'_id': u'Canada', u'value': 28.0}
{u'_id': u'United States', u'value': 28.0}
{u'_id': u'Russian Fed.', u'value': 38.0}


In [22]:
db.atleti.map_reduce(map='''function() {
                    var country = this.country;
                    var medals = this.total_medals;
                    emit(country,medals);
                    }''',
                    reduce='''function(country,medals){
                    return Array.sum(medals);
                    }''',
                    out="medalsfrequency")

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'olimpiadi'), u'medalsfrequency')

## Esercizio 1

Utilizzando il dataset, calcolare per ogni età il numero di atleti.

## Esercizio 2

Partendo dall'esercizio precedente, restituire come output l'eta' con il numero maggiore di occorrenze.

## Esercizio 3

Calcolare il numero di uomini e donne per ogni eta'.


In [40]:
#Esercizio 2 --> Esercizio 1 è dopo

#Esercizio 1
db.atleti.map_reduce(map='''function() {
                    var age = this.age;
                    emit(age,1);
                    }''',
                    reduce='''function(age,number){
                    return Array.sum(number);
                    }''',
                    out="agefrequency")

max(db.agefrequency.find(), key=lambda x:x['value'])
#for x in db.agefrequency.find():
#   print x['value']


{u'_id': u'24', u'value': 234.0}

In [41]:
#Esercizio 2 rivisto
db.atleti.map_reduce(map='''function() {
                    var age = this.age;
                    emit(age,1);
                    }''',
                    reduce='''function(age,number){
                    return Array.sum(number);
                    }''',
                    out="agefrequency")
count = 0
for entry in db.agefrequency.find().sort('value',-1):
      if count==0:
            print entry
            count=count+1
      


{u'_id': u'24', u'value': 234.0}


In [42]:
#Esercizio 1
db.atleti.map_reduce(map='''function() {
                    var age = this.age;
                    emit(age,1);
                    }''',
                    reduce='''function(age,number){
                    return Array.sum(number);
                    }''',
                    out="agefrequency")

for entry in db.agefrequency.find().sort('id'):
      print entry



{u'_id': u'15', u'value': 9.0}
{u'_id': u'16', u'value': 18.0}
{u'_id': u'17', u'value': 39.0}
{u'_id': u'18', u'value': 85.0}
{u'_id': u'19', u'value': 100.0}
{u'_id': u'20', u'value': 127.0}
{u'_id': u'21', u'value': 181.0}
{u'_id': u'22', u'value': 220.0}
{u'_id': u'23', u'value': 229.0}
{u'_id': u'24', u'value': 234.0}
{u'_id': u'25', u'value': 211.0}
{u'_id': u'26', u'value': 225.0}
{u'_id': u'27', u'value': 207.0}
{u'_id': u'28', u'value': 199.0}
{u'_id': u'29', u'value': 165.0}
{u'_id': u'30', u'value': 135.0}
{u'_id': u'31', u'value': 110.0}
{u'_id': u'32', u'value': 83.0}
{u'_id': u'33', u'value': 73.0}
{u'_id': u'34', u'value': 46.0}
{u'_id': u'35', u'value': 52.0}
{u'_id': u'36', u'value': 32.0}
{u'_id': u'37', u'value': 17.0}
{u'_id': u'38', u'value': 13.0}
{u'_id': u'39', u'value': 13.0}
{u'_id': u'40', u'value': 8.0}
{u'_id': u'41', u'value': 10.0}
{u'_id': u'42', u'value': 5.0}
{u'_id': u'43', u'value': 3.0}
{u'_id': u'44', u'value': 4.0}
{u'_id': u'45', u'value': 1.0}
{

In [43]:
# Esercizio 1

db.atleti.map_reduce(map='''function(){
            emit(this.age, 1);
        }''', reduce='''function(key, values){
            return Array.sum(values)
        }''',out='agefrequency')

for entry in db.agefrequency.find().sort('id'):
    print entry

{u'_id': u'15', u'value': 9.0}
{u'_id': u'16', u'value': 18.0}
{u'_id': u'17', u'value': 39.0}
{u'_id': u'18', u'value': 85.0}
{u'_id': u'19', u'value': 100.0}
{u'_id': u'20', u'value': 127.0}
{u'_id': u'21', u'value': 181.0}
{u'_id': u'22', u'value': 220.0}
{u'_id': u'23', u'value': 229.0}
{u'_id': u'24', u'value': 234.0}
{u'_id': u'25', u'value': 211.0}
{u'_id': u'26', u'value': 225.0}
{u'_id': u'27', u'value': 207.0}
{u'_id': u'28', u'value': 199.0}
{u'_id': u'29', u'value': 165.0}
{u'_id': u'30', u'value': 135.0}
{u'_id': u'31', u'value': 110.0}
{u'_id': u'32', u'value': 83.0}
{u'_id': u'33', u'value': 73.0}
{u'_id': u'34', u'value': 46.0}
{u'_id': u'35', u'value': 52.0}
{u'_id': u'36', u'value': 32.0}
{u'_id': u'37', u'value': 17.0}
{u'_id': u'38', u'value': 13.0}
{u'_id': u'39', u'value': 13.0}
{u'_id': u'40', u'value': 8.0}
{u'_id': u'41', u'value': 10.0}
{u'_id': u'42', u'value': 5.0}
{u'_id': u'43', u'value': 3.0}
{u'_id': u'44', u'value': 4.0}
{u'_id': u'45', u'value': 1.0}
{

In [44]:
# Esercizio 2

db.atleti.map_reduce(map='''function(){
            emit(this.age, 1);
        }''', reduce='''function(key, values){
            return Array.sum(values)
        }''',out='agefrequency')

max(db.agefrequency.find(), key=lambda x:x['value'])


{u'_id': u'24', u'value': 234.0}

In [45]:
# Esercizio 3

db.atleti.map_reduce(map='''function(){
            emit(this.gender + " " + this.age, 1);
        }''', reduce='''function(key, values){
            return Array.sum(values)
        }''',out='genderfrequency')

for entry in db.genderfrequency.find().sort('id'):
    if entry['value'] > 10:
        print entry

{u'_id': u'Female 16', u'value': 14.0}
{u'_id': u'Female 17', u'value': 23.0}
{u'_id': u'Female 18', u'value': 45.0}
{u'_id': u'Female 19', u'value': 48.0}
{u'_id': u'Female 20', u'value': 53.0}
{u'_id': u'Female 21', u'value': 92.0}
{u'_id': u'Female 22', u'value': 91.0}
{u'_id': u'Female 23', u'value': 102.0}
{u'_id': u'Female 24', u'value': 108.0}
{u'_id': u'Female 25', u'value': 83.0}
{u'_id': u'Female 26', u'value': 88.0}
{u'_id': u'Female 27', u'value': 80.0}
{u'_id': u'Female 28', u'value': 74.0}
{u'_id': u'Female 29', u'value': 55.0}
{u'_id': u'Female 30', u'value': 44.0}
{u'_id': u'Female 31', u'value': 38.0}
{u'_id': u'Female 32', u'value': 25.0}
{u'_id': u'Female 33', u'value': 22.0}
{u'_id': u'Female 34', u'value': 13.0}
{u'_id': u'Female 35', u'value': 19.0}
{u'_id': u'Male 17', u'value': 16.0}
{u'_id': u'Male 18', u'value': 40.0}
{u'_id': u'Male 19', u'value': 52.0}
{u'_id': u'Male 20', u'value': 74.0}
{u'_id': u'Male 21', u'value': 89.0}
{u'_id': u'Male 22', u'value': 12

Scaling - Sharding
====================

Sharding, or horizontal scaling, divides the data set and distributes the data over multiple servers, or shards. Each shard is an independent database, and collectively, the shards make up a single logical database.

**Chunk**
The whole set of data is divided in Chunks, chunk are then distributed as equally as possible through all the nodes

**Shard Key**
The shard key is the *Document* property on which chunks are decided, the range of shard key possible values is divided in chunks and each chunk is assigned to a node. Document which near values for the shard key will end up being in the same chunk and so on the same node.

**Shard**
Each MongoDB node or ReplicaSet that contains part of the sharded data.

**Router**
The routers is the interface to the cluster, each query and operation will be performed against the router. The router is then in charge of forwarding the operation to one or multiple shards and gather the results.

**Config Server**
The config servers keep track of chunks distribution, they know which shard contains which chunk and which values are kept inside each chunk. Whenever the router has to perform an operation or split chunks that became too big it will read and write chunks distribution from the config servers.

Setting Up a Sharder Cluster
================================================

To properly setup a shared environment I suggest you to check [THIS REPOSITORY](https://github.com/alexcomu/mongodb_howto). Is a collection of *How To* for:

___01 - Install & Play with MongoDB___

___02 - Replica Set tutorial___

___03 - Arbiter configuration___

___04 - Sharding___

___05 - From Replica Set to Sharding - Tutorial___

## Map Reduce and MongoDB

# Esercizio

Utilizzare il dataset FakeFriends del capitolo 05 per creare uno script MapReduce che come output del processo vada a scrivere all'interno di MongoDB una entry per ogni riga del dataset. La entry dovrà contenere tutte le informazioni presenti all'interno del dataset.