# Getting familiar with Aggregation Pipeline

----

### Conneting to MongoDB using Pymongo
----

In [1]:
# Importing the required libraries
import pymongo
import pprint as pp

pp.sorted = lambda x, key=None: x

---
### HR Analytics

An analytics training company wants to connect their enrollees with their clients who are looking to hire employees working in the same domain.The collection contains student information related to demographics, education, experience and features related to training as well.


---

In [2]:
# # mongorestore HR data
# !mongorestore --db training /home/avadmin/Desktop/Mongo/Aggregation/training/training/

In [3]:
# Connect to local server
client = pymongo.MongoClient('mongodb://localhost:27017/')

In [4]:
# training database
db = client.training

In [5]:
# Collections
db.list_collection_names()

['hr']

In [6]:
# Sample document
pp.pprint(
    db.hr.find_one())

{'_id': ObjectId('60bc95fb12d1778df87722e2'),
 'enrollee_id': 23798,
 'gender': 'Male',
 'date_of_enrollment': datetime.datetime(2016, 8, 4, 8, 4, 14, 780000),
 'city': {'name': 'city_149', 'development_index': 0.689},
 'education': {'level': 'Graduate', 'discipline': 'STEM'},
 'experience': {'years': 3,
                'company_type': 'Pvt Ltd',
                'last_new_job': 1,
                'relevent_experience': 1},
 'training_hours': 106}


---
---

**HR document**

- `enrollee_id` - Unique ID for enrollee

- `gender` - Gender

- `date_of_enrollment` - Date of enrollment

- `city` - Embedded document related to city of enrollee.
> - `name` - City code
> - `development_index` - Developement index of the city

- `education` - Embedded document about education of enrollee.
> - `level` - Education level
> - `discipline` - Major discipline

- `experience` - Embedded document about working experience of enrollee.
> - `year` - Total experience in years
> - `company_type` - Type of current employer
> - `last_new_job` - Difference in years between previous job and current job
> - `relevant_experience` - Relevent experience

- `training_hours` - Training hours completed

----

----
### find() vs aggregate() method

----

`find()` method returns a cursor to the documents in a collection.

-----

In [7]:
# find()
cur = db.hr.find()

for doc in cur:
    pp.pprint(doc)
    break

{'_id': ObjectId('60bc95fb12d1778df87722e2'),
 'enrollee_id': 23798,
 'gender': 'Male',
 'date_of_enrollment': datetime.datetime(2016, 8, 4, 8, 4, 14, 780000),
 'city': {'name': 'city_149', 'development_index': 0.689},
 'education': {'level': 'Graduate', 'discipline': 'STEM'},
 'experience': {'years': 3,
                'company_type': 'Pvt Ltd',
                'last_new_job': 1,
                'relevent_experience': 1},
 'training_hours': 106}


---
---
[**aggregate method**](https://docs.mongodb.com/manual/reference/method/db.collection.aggregate/#mongodb-method-db.collection.aggregate)

- The aggregate() method uses the aggregation pipeline to process documents into aggregated results. 

- An aggregation pipeline consists of stages with each stage processing the documents as they pass along the pipeline. 

- Documents pass through the stages in sequence.

- It returns a cursor.

**Syntax -** `db.collection.aggregate(pipeline, options)`


---

In [8]:
# Aggregate method
db.hr.aggregate([])

<pymongo.command_cursor.CommandCursor at 0x7f52085f8dd8>

In [9]:
# Iterate over the result
cur = db.hr.aggregate([])

for doc in cur:
    pp.pprint(doc)
    break

{'_id': ObjectId('60bc95fb12d1778df87722e2'),
 'enrollee_id': 23798,
 'gender': 'Male',
 'date_of_enrollment': datetime.datetime(2016, 8, 4, 8, 4, 14, 780000),
 'city': {'name': 'city_149', 'development_index': 0.689},
 'education': {'level': 'Graduate', 'discipline': 'STEM'},
 'experience': {'years': 3,
                'company_type': 'Pvt Ltd',
                'last_new_job': 1,
                'relevent_experience': 1},
 'training_hours': 106}


---
### Filtering documents

----

Query expression within `find()` method filters the documents.

For example, filter documents where the `transaction_hours` field value is 150.

----

In [10]:
# Filter documents
cur = db.hr.find({
                    'training_hours':150
                })

for doc in cur:
    pp.pprint(doc)
    break

{'_id': ObjectId('60bc95fb12d1778df877266c'),
 'enrollee_id': 17610,
 'gender': 'Male',
 'date_of_enrollment': datetime.datetime(2016, 12, 1, 17, 56, 44, 493000),
 'city': {'name': 'city_41', 'development_index': 0.827},
 'education': {'level': 'Graduate', 'discipline': 'No Major'},
 'experience': {'years': 8,
                'company_type': 'Pvt Ltd',
                'last_new_job': 3,
                'relevent_experience': 1},
 'training_hours': 150}


---
**`$match` stage**

- Filtering in aggregate pipelie is done using [$match](https://docs.mongodb.com/manual/reference/operator/aggregation/match/) stage operator. 

- It filters the documents on the given condition and before passing them to the next stage in the pipeline.

**Syntax -** `{ $match: { <query> } }`

---

In [11]:
# Filter documents using pipeline
cur = db.hr.aggregate(
                      # Pipeline
                      [
                        # Stage
                        {
                            '$match':{
                                        'training_hours':150
                                     }
                        }
                      ])

for doc in cur:
    pp.pprint(doc)
    break

{'_id': ObjectId('60bc95fb12d1778df877266c'),
 'enrollee_id': 17610,
 'gender': 'Male',
 'date_of_enrollment': datetime.datetime(2016, 12, 1, 17, 56, 44, 493000),
 'city': {'name': 'city_41', 'development_index': 0.827},
 'education': {'level': 'Graduate', 'discipline': 'No Major'},
 'experience': {'years': 8,
                'company_type': 'Pvt Ltd',
                'last_new_job': 3,
                'relevent_experience': 1},
 'training_hours': 150}


---
### Projecting fields

----
Projection in `find()` method returns the required fields from the documents.

For example, project `enrollee_id`, `training_hours`, and suppress `_id` field where the documents have `training_hours` >150.

---

In [12]:
# Projecting fields
cur = db.hr.find({
                    'training_hours':150
                 },
                 {
                     'enrollee_id':1,
                     'training_hours':1,
                     '_id':0
                 }
                )

for doc in cur:
    pp.pprint(doc)
    break

{'enrollee_id': 17610, 'training_hours': 150}


---
**`$project` stage**

- Field projection are achieved using [$project](https://docs.mongodb.com/manual/reference/operator/aggregation/project/) stage operator.

-  It passes along the documents with the requested fields to the next stage in the pipeline.  

**Syntax -** `{ $project: { <specification(s)> } }`

----

In [13]:
# Projecting document fields using pipeline
cur = db.hr.aggregate(
                      # Pipeline
                      [
                        # Stage 1
                        {
                            '$match':{
                                        'training_hours':150
                                     }
                        },
                        # Stage 2
                        {
                            '$project':{
                                            'enrollee_id':1,
                                            'training_hours':1,
                                            '_id':0
                                        }
                        }
                      ])

for doc in cur:
    pp.pprint(doc)
    break

{'enrollee_id': 17610, 'training_hours': 150}


----
### Aggregation Stages

----

In [14]:
# Aggregation stages in pipelines
cur = db.hr.aggregate(
                      # Pipeline
                      [
                        # Stage 1
                        {
                            '$match':{
                                        'training_hours':150
                                     }
                        },
                        # Stage 2
                        {
                            '$project':{
                                            'enrollee_id':1,
                                            'training_hours':1,
                                            '_id':0
                                        }
                        }
                        # Aggregation Stages
                      ])

for doc in cur:
    pp.pprint(doc)
    break