# Stages in Aggregation Pipeline

----

- [$match](https://docs.mongodb.com/manual/reference/operator/aggregation/match/#-match--aggregation-) - Filters the documents to pass only the documents that match the specified condition(s) to the next pipeline stage.

- [$count](https://docs.mongodb.com/manual/reference/operator/aggregation/count/) - Passes a document to the next stage that contains a count of the number of documents input to the stage.

- [$skip](https://docs.mongodb.com/manual/reference/operator/aggregation/skip/) - Skips over the specified number of documents that pass into the stage. 

- [$limit](https://docs.mongodb.com/manual/reference/operator/aggregation/limit/) - Limits the number of documents passed to the next stage in the pipeline.

---
### Connecting to MongoDB using Pymongo
----

In [1]:
# Importing the required libraries
import pymongo
import pprint as pp

pp.sorted = lambda x, key=None: x

In [2]:
# Connect to local server
client = pymongo.MongoClient('mongodb://localhost:27017/')

In [3]:
# training database
db = client.training

In [4]:
# Sample hr document
pp.pprint(
    db.hr.find_one()
)

{'_id': ObjectId('60af5db0b2f5ad99212f9464'),
 'enrollee_id': 23798,
 'gender': 'Male',
 'date_of_enrollment': datetime.datetime(2016, 1, 23, 0, 0),
 'city': {'name': 'city_149', 'development_index': 0.689},
 'education': {'level': 'Graduate', 'discipline': 'STEM'},
 'experience': {'years': 3,
                'company_type': 'Pvt Ltd',
                'last_new_job': 1,
                'relevent_experience': 1},
 'training_hours': 106}


---
### **`$match` operator**

[$match](https://docs.mongodb.com/manual/reference/operator/aggregation/match/) filters the documents on the given condition and before passing them to the next stage in the pipeline.

**Syntax -** `{ $match: { <query> } }`

----

For example, we can filter and retrieve all those documents that have the `gender` field as `Female`.

---

In [5]:
# Keeping documents where `gender` is Female using `$match` 

result = db.hr.aggregate(
                                # Pipeline 
                                [
                                    # Stage 1
                                    {
                                        '$match':{'gender':'Female'}
                                    }
                                ]
                            )

# Print result
for doc in result:
    pp.pprint(doc)
    break

{'_id': ObjectId('60af5db0b2f5ad99212f9469'),
 'enrollee_id': 13342,
 'gender': 'Female',
 'date_of_enrollment': datetime.datetime(2020, 1, 8, 0, 0),
 'city': {'name': 'city_21', 'development_index': 0.624},
 'education': {'level': 'Graduate', 'discipline': 'Other'},
 'experience': {'years': 8,
                'company_type': 'Pvt Ltd',
                'last_new_job': 2,
                'relevent_experience': 1},
 'training_hours': 34}


---
---
**We can include multiple filter conditions with `$match` within the same pipeline.**

We can also query on embedded document fields.

For example, we retrieve only those documents where `gender` is `Female` and `education.level` is `Masters`.

---

In [6]:
# Multiple filter conditions using `$match`

result = db.hr.aggregate(
                                # Pipeline
                                [
                                    # Stage 1
                                    {
                                        '$match':{
                                                    'gender':'Female',
                                                    'education.level': 'Masters'
                                                }
                                    }
                                ]
                            )

# Print result
for doc in result:
    pp.pprint(doc)
    break

{'_id': ObjectId('60af5db0b2f5ad99212f950e'),
 'enrollee_id': 23114,
 'gender': 'Female',
 'date_of_enrollment': datetime.datetime(2017, 7, 18, 0, 0),
 'city': {'name': 'city_160', 'development_index': 0.92},
 'education': {'level': 'Masters', 'discipline': 'STEM'},
 'experience': {'years': 20,
                'company_type': nan,
                'last_new_job': 2,
                'relevent_experience': 1},
 'training_hours': 92}


---
**Multiple stages**

We can even create multiple stages for different filter conditions.

For example :-
- Stage 1 = retrieve documents where `gender` is `Female`
- Stage 2 = retrieve documents where `eduaction.level` is `Masters`.

----

In [7]:
# Multiple filter conditions using `$match`

result = db.hr.aggregate(
                                # Pipeline
                                [
                                    # Stage 1
                                    {
                                        '$match':{'gender':'Female'}
                                    },
                                    # Stage 2
                                    {
                                        '$match':{'education.level':'Masters'}
                                    }
                                ]
                            )

# Print result
for doc in result:
    pp.pprint(doc)
    break

{'_id': ObjectId('60af5db0b2f5ad99212f950e'),
 'enrollee_id': 23114,
 'gender': 'Female',
 'date_of_enrollment': datetime.datetime(2017, 7, 18, 0, 0),
 'city': {'name': 'city_160', 'development_index': 0.92},
 'education': {'level': 'Masters', 'discipline': 'STEM'},
 'experience': {'years': 20,
                'company_type': nan,
                'last_new_job': 2,
                'relevent_experience': 1},
 'training_hours': 92}


---

***Note is such a case where two `$match` stages follow each other, the two `$match` stages can get coalesced into a single `$match`. This is an [optimization](https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/#-match----match-coalescence) done by MongoDB.***

---
**Query operators can be used within `$match` operator.**

For example, we retrieve documents where `gender` is `Male` and where `training_hours >= 100`.

----

In [8]:
# Query operator can be used inside $match

result = db.hr.aggregate(
                        [
                            # Stage 1
                            {
                                '$match':{
                                            'gender':'Male',
                                            'training_hours':{'$gte':100}
                                        }
                            }
                        ]
                    )

# Print results
for doc in result:
    pp.pprint(doc)
    break

{'_id': ObjectId('60af5db0b2f5ad99212f9464'),
 'enrollee_id': 23798,
 'gender': 'Male',
 'date_of_enrollment': datetime.datetime(2016, 1, 23, 0, 0),
 'city': {'name': 'city_149', 'development_index': 0.689},
 'education': {'level': 'Graduate', 'discipline': 'STEM'},
 'experience': {'years': 3,
                'company_type': 'Pvt Ltd',
                'last_new_job': 1,
                'relevent_experience': 1},
 'training_hours': 106}


---
For example, we retrieve documents where `gender` is `Female` and where either `experience.years <= 5` or `experience.years >= 7`.

----

In [9]:
# Query operator can be used inside $match

result = db.hr.aggregate(
                    [
                        # Stage 1
                        {
                            '$match':{
                                        'gender':'Female',
                                        '$or':[
                                                    {'experience.years':{'$lte':5}},
                                                    {'experience.years':{'$gte':7}}
                                              ]
                                    }
                        }
                    ]
                )


# Print results
for doc in result:
    pp.pprint(doc)
    break

{'_id': ObjectId('60af5db0b2f5ad99212f9469'),
 'enrollee_id': 13342,
 'gender': 'Female',
 'date_of_enrollment': datetime.datetime(2020, 1, 8, 0, 0),
 'city': {'name': 'city_21', 'development_index': 0.624},
 'education': {'level': 'Graduate', 'discipline': 'Other'},
 'experience': {'years': 8,
                'company_type': 'Pvt Ltd',
                'last_new_job': 2,
                'relevent_experience': 1},
 'training_hours': 34}


---
**Note -**
- `$match` operator should come as early in the pipeline as possible.
- Since `$match` filters the documents, it reduces the number of documents to work upon in the subsequent stages.

----

----
### `$count` stage

[$count](https://docs.mongodb.com/manual/reference/operator/aggregation/count/) aggregation pipeline operator returns the count of the documents to the next stage of pipeline.

We can provide name of the output field as string.

**Syntax -** `{ $count: <string> }`

----

For example, we can count number of documents in `hr` collection.

----

In [2]:
# Count documents

result = db.hr.aggregate(
                            [
                                # Stage 1
                                {
                                    '$count': 'Total_docs'
                                }
                            ])

# Print results
for doc in result:
    pp.pprint(doc)

NameError: name 'db' is not defined

---
Compare it to `count()` that we used in querying.

---

In [11]:
# Count documents
db.hr.find().count()

  


18359

---
For example, we can retrieve all the documents where the `gender` is `Female` and then count the number of documents that are retrieved.


---

In [12]:
# Count filtered documents

result = db.hr.aggregate(
                            [
                                # Stage 1 - filter
                                {
                                    '$match':{'gender':'Female'}
                                },
                                # Stage 2 - count
                                {
                                    '$count': 'Female_candidates'
                                }
                            ])

# Print results
for doc in result:
    pp.pprint(doc)

{'Female_candidates': 1188}


----
### `$skip` stage

[$skip](https://docs.mongodb.com/manual/reference/operator/aggregation/skip/) stage operator skips over the specified number of documents that pass into the stage. 

Passes the remaining documents to the next stage in the pipeline.

**Syntax -** `{ $skip: <positive integer> }`

----

For example, we can skip a few documents and check the count of the documents returned.

----

In [13]:
# Count documents

result = db.hr.aggregate(
                            [
                                # Stage 1
                                {
                                    '$count': 'Total_docs'
                                }
                            ])

# Print results
for doc in result:
    pp.pprint(doc)

{'Total_docs': 18359}


In [14]:
# Skip documents

result = db.hr.aggregate(
                            [
                                # Stage 1 - skip
                                {
                                    '$skip': 10
                                },
                                # Stage 2 - count
                                {
                                    '$count': 'Altered_count'
                                }
                            ])

# Print results
for doc in result:
    pp.pprint(doc)

{'Altered_count': 18349}


----
### `$limit` stage

[$limit](https://docs.mongodb.com/manual/reference/operator/aggregation/limit/) stage operator limits the number of documents passed to the next stage in the pipeline.

**Syntax -** `{ $limit: <positive integer> }`

----

For example, return only the top 5 documents from collection where `gender` is `Other`.

----

In [23]:
# Limit documents

result = db.hr.aggregate(
                            [
                                # Stage 1 - Filter
                                {
                                    '$match':{'gender':'Other'}
                                },
                                 # Stage 2 - limit
                                {
                                    '$limit': 5
                                }
                            ])

# Print results
for doc in result:
    pp.pprint(doc)

{'_id': ObjectId('60af5db0b2f5ad99212f94c3'),
 'enrollee_id': 27425,
 'gender': 'Other',
 'date_of_enrollment': datetime.datetime(2017, 4, 10, 0, 0),
 'city': {'name': 'city_75', 'development_index': 0.939},
 'education': {'level': 'Graduate', 'discipline': 'STEM'},
 'experience': {'years': 16,
                'company_type': nan,
                'last_new_job': 0,
                'relevent_experience': 1},
 'training_hours': 34}
{'_id': ObjectId('60af5db0b2f5ad99212f94d5'),
 'enrollee_id': 15775,
 'gender': 'Other',
 'date_of_enrollment': datetime.datetime(2019, 3, 19, 0, 0),
 'city': {'name': 'city_103', 'development_index': 0.92},
 'education': {'level': 'Graduate', 'discipline': 'Arts'},
 'experience': {'years': 4,
                'company_type': 'Funded Startup',
                'last_new_job': 1,
                'relevent_experience': 1},
 'training_hours': 31}
{'_id': ObjectId('60af5db0b2f5ad99212f9566'),
 'enrollee_id': 13501,
 'gender': 'Other',
 'date_of_enrollment': datetime

---
----
### Question -

Count the number of enrollees that are from `STEM discipline`.

---

In [16]:
# Question
result = db.hr.aggregate(
                            [
                                # Stage 1
                                {
                                    '$match': {'education.discipline': 'STEM'}
                                },
                                # Stage 2
                                {
                                    '$count': 'STEM_students_count'
                                }
                            ]
    )

for doc in result:
    pp.pprint(doc)

{'STEM_students_count': 13738}


----
### Question - 

How many enrollees have either an experience of more than 5 years or an education level as either Graduate, Masters, or Phd?

----

In [17]:
# Distinct education levels
db.hr.distinct('education.level')

['Graduate', 'High School', 'Masters', 'Phd', 'Primary School']

In [18]:
# Question
result = db.hr.aggregate(
        [
            # Stage 1
            {
                '$match': {
                            '$or':[
                                    {'experience.years':{'$gt':5}},
                                    {
                                        'education.level':{
                                                            '$in': ['Graduate',
                                                                    'Masters',
                                                                    'Phd']
                                                        }
                                    }
                                  ]
                }
            },
            # Stage 2
            {
                '$count': 'Answer'
            }
        ]
    )

for doc in result:
    pp.pprint(doc)

{'Answer': 17019}


----
----
### Exercise 1 - 

How many female enrollees with experience of less than 5 years work for an NGO?

----

----
### Exercise 2 - 

How many enrollees have either a relevant experience of at least 1 year or more than 100 hours of training?

----