# 3.5 Analysing Data

Exploring data using the MongoDB aggregation framework: initial analysis and additional cleaning.

e.g. Twitter Data Set

Examples of questions: Understand behaviour of users and networks involved.

Aggregation Framework provides a powerful tool for analysing data.
E.g.: Determine which user has produced the most tweets. 
Process:
1. Group tweets by user
2. Count each user's tweets
3. Sort into descending order (of number of tweets)
4. Select user at the top (the one with most tweets)

Aggregation queries in MongoDB issued using 'aggregate', done using a pipeline.

In [None]:
from pymongo import MongoClient
import pprint

client = MongoClient("mongodb://localhost:27017")
db = client.twitter

def most_tweets():
    # Issue aggregation query
    result = db.tweets.aggregate([
            # For user subdocument, I want the screen name field
            # "$user.screen_name" don't make it a string. Not an operator.
            # Want value of user.screen_name
            { "$group" : { "_id" : "$user.screen_name",
                          # Accumulator operator "$sum": For all docs that have the same value for _id,
                          # Increment count by 1.
                           "count" : { "$sum" : 1 } } },
            # Sort docs passed into this stage (output of "$group")
            # based on the count in descending order
            { "$sort" : { "count" : -1} } ])
    return result

if __name__ = "__main__":
    result = most_tweets()
    pprint.pprint(result)

### Aggregation Pipline 

* diagram
* e.g. "\$group" -> "\$sort" 
* Collection fed into group stage. Finds tweets per user and accumulates them.
* Depending on which operator is used in a given stage, stage may be reshaping data. Collection of tweets have dozens of fields, putting through "\$group" stage turns it into data with 2 fields.
* Use aggregation operators to produce stages.

### Exercise: Fix

In [None]:
#!/usr/bin/env python
"""
The tweets in our twitter collection have a field called "source". This field describes the application
that was used to create the tweet. Following the examples for using the $group operator, your task is 
to modify the 'make-pipeline' function to identify most used applications for creating tweets. 
As a check on your query, 'web' is listed as the most frequently used application.
'Ubertwitter' is the second most used. The number of counts should be stored in a field named 'count'
(see the assertion at the end of the script).

Please modify only the 'make_pipeline' function so that it creates and returns an aggregation pipeline
that can be passed to the MongoDB aggregate function. As in our examples in this lesson, the aggregation 
pipeline should be a list of one or more dictionary objects. 
Please review the lesson examples if you are unsure of the syntax.

Your code will be run against a MongoDB instance that we have provided. 
If you want to run this code locally on your machine, you have to install MongoDB, 
download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.

Please note that the dataset you are using here is a smaller version of the twitter dataset 
used in examples in this lesson. 
If you attempt some of the same queries that we looked at in the lesson examples,
your results will be different.
"""


def get_db(db_name):
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client[db_name]
    return db

def make_pipeline():
    # complete the aggregation pipeline
    pipeline = [{"$group" : {"source" : "$source",
                             "count" : {"$sum" : 1} } },
                {"$sort" : {"count" : -1} } ]
    return pipeline

def tweet_sources(db, pipeline):
    return [doc for doc in db.tweets.aggregate(pipeline)]

if __name__ == '__main__':
    db = get_db('twitter')
    pipeline = make_pipeline()
    result = tweet_sources(db, pipeline)
    import pprint
    pprint.pprint(result[0])
    assert result[0] == {u'count': 868, u'_id': u'web'}


## Aggregation Operators: An overview
### 1. Project
Reshaping data
e.g. Selecting which fields you are interested in