# Grouping data in MongoDB

----

- [$group](https://docs.mongodb.com/manual/reference/operator/aggregation/group/) - Groups the documents as per the specified condition.

- [Accumulator operators](https://docs.mongodb.com/manual/reference/operator/aggregation/group/#accumulator-operator) - Can be used along with `$group` operator to define an expression.

- [$bucket](https://docs.mongodb.com/manual/reference/operator/aggregation/bucket/#-bucket--aggregation-) - Categories documents into several groups, or buckets. The operations are then performed on each of these bucket.


---
### Connecting to MongoDB using Pymongo
----

In [1]:
# Importing the required libraries
import pymongo
import pprint as pp

pp.sorted = lambda x, key=None: x

In [2]:
# Connect to the mongo client - Atlas Cluster
client = pymongo.MongoClient('mongodb://localhost:27017/')

In [3]:
# training dataset
db = client.training

In [4]:
# Sample data
pp.pprint(
    db.hr.find_one()
)

{'_id': ObjectId('60bc95fb12d1778df87722e2'),
 'enrollee_id': 23798,
 'gender': 'Male',
 'date_of_enrollment': datetime.datetime(2016, 8, 4, 8, 4, 14, 780000),
 'city': {'name': 'city_149', 'development_index': 0.689},
 'education': {'level': 'Graduate', 'discipline': 'STEM'},
 'experience': {'years': 3,
                'company_type': 'Pvt Ltd',
                'last_new_job': 1,
                'relevent_experience': 1},
 'training_hours': 106}


---
---
### $group stage

[$group](https://docs.mongodb.com/manual/reference/operator/aggregation/group/) groups the documents as per the specified condition.

- Groups input documents by the specified *_id* expression and for each distinct grouping, outputs a document. 

- The *_id* expression of specifies the field to group the documents on.

- The *_id* field of each output document contains the unique group by value. 

- The output documents contain computed fields that hold the values of some [accumulator expression](https://docs.mongodb.com/manual/reference/operator/aggregation/group/#accumulator-operator).


**Syntax -**

`{$group:{_id: <expression>,<field1>: { <accumulator1> : <expression1> },...}}`
 
 
 ---
 
 For example, group documents on `education.discipline`.
 
 ----

In [5]:
# Group on a field
cur = db.hr.aggregate([
                        # Stage 1 - Group
                        {
                            '$group': {
                                        # Field to group on
                                        '_id': '$education.discipline'
                                        }
                        }
                    ])

for doc in cur:
    pp.pprint(doc)

{'_id': 'Business Degree'}
{'_id': 'No Major'}
{'_id': 'Arts'}
{'_id': 'Other'}
{'_id': 'STEM'}
{'_id': 'Humanities'}


---
### [Accumulator operators](https://docs.mongodb.com/manual/reference/operator/aggregation/group/#accumulator-operator) 

Accumulator operators can be used along with `$group` operator to define an expression.

----
For example, group documents by the `education.discipline` field and take the average of the `training_hours` field. Here we have to make use of the `$avg` accumulator operator.

---

In [6]:
# Group on a field
cur = db.hr.aggregate([
                        {
                            '$group': {
                                        # Field to group on
                                        '_id': '$education.discipline',
                                        # Accumulator expression
                                        'avg_hours':{'$avg':'$training_hours'}
                            }
                        }
                    ])

for doc in cur:
    pp.pprint(doc)

{'_id': 'Business Degree', 'avg_hours': 64.59609120521172}
{'_id': 'No Major', 'avg_hours': 64.92718446601941}
{'_id': 'Arts', 'avg_hours': 62.0}
{'_id': 'Other', 'avg_hours': 66.65073876139579}
{'_id': 'STEM', 'avg_hours': 65.80615810161595}
{'_id': 'Humanities', 'avg_hours': 66.50436046511628}


In [7]:
# Group on a field
cur = db.hr.aggregate([
                        {
                            '$group': {
                                        # Field to group on
                                        '_id': '$education.discipline',
                                        # Accumulator expression
                                        'avg_hours':{'$sum':1}
                            }
                        }
                    ])

for doc in cur:
    pp.pprint(doc)

{'_id': 'Business Degree', 'avg_hours': 307}
{'_id': 'No Major', 'avg_hours': 206}
{'_id': 'Arts', 'avg_hours': 239}
{'_id': 'Other', 'avg_hours': 3181}
{'_id': 'STEM', 'avg_hours': 13738}
{'_id': 'Humanities', 'avg_hours': 688}


---
To retrieve the maximum training hours, we can use the `$max` accumulator operator.

----

In [7]:
# Group on a field

cur = db.hr.aggregate([
                        {
                            '$group': {
                                        # Field to group on
                                        '_id': '$education.discipline',
                                        # Accumulator expression
                                        'max_hours':{'$max':'$training_hours'}
                                    }
                        }
                    ])

for doc in cur:
    pp.pprint(doc)

{'_id': 'STEM', 'max_hours': 336}
{'_id': 'Business Degree', 'max_hours': 312}
{'_id': 'Humanities', 'max_hours': 336}
{'_id': 'Other', 'max_hours': 336}
{'_id': 'No Major', 'max_hours': 314}
{'_id': 'Arts', 'max_hours': 322}


---
We can even define multiple accumulator expression on the grouped data.

----

In [8]:
# Multiple accumulator expressions

cur = db.hr.aggregate([
                        {
                            '$group': {
                                        # Field to group on
                                        '_id': '$education.discipline',
                                        # Accumulator expression
                                        'max_hours':{'$max':'$training_hours'},
                                        'min_hours':{'$min':'$training_hours'}
                                    }
                        }
                    ])

for doc in cur:
    pp.pprint(doc)

{'_id': 'STEM', 'max_hours': 336, 'min_hours': 1}
{'_id': 'Business Degree', 'max_hours': 312, 'min_hours': 1}
{'_id': 'Humanities', 'max_hours': 336, 'min_hours': 1}
{'_id': 'Other', 'max_hours': 336, 'min_hours': 1}
{'_id': 'No Major', 'max_hours': 314, 'min_hours': 2}
{'_id': 'Arts', 'max_hours': 322, 'min_hours': 2}


---
**We can even group on multiple fields from the documents.**

For example, group documents by `education.level` and `education.discipline` fields and determine the average `training_hours`.

----

In [9]:
# Group on fields

cur = db.hr.aggregate(
            [
                {
                    '$group': {
                                # Fields to group on
                                '_id': {
                                            'Level':'$education.level',
                                            'Discipline':'$education.discipline'
                                        },
                                # Accumulator expression
                                'avg_hours':{'$avg':'$training_hours'}
                    }
                }
            ])

for doc in cur:
    pp.pprint(doc)

{'_id': {'Level': 'Graduate', 'Discipline': 'Other'},
 'avg_hours': 69.68888888888888}
{'_id': {'Level': 'Phd', 'Discipline': 'STEM'}, 'avg_hours': 69.04106280193237}
{'_id': {'Level': 'Masters', 'Discipline': 'No Major'},
 'avg_hours': 51.03448275862069}
{'_id': {'Level': 'Graduate', 'Discipline': 'Humanities'},
 'avg_hours': 66.4696261682243}
{'_id': {'Level': 'Phd', 'Discipline': 'Other'}, 'avg_hours': 43.55555555555556}
{'_id': {'Level': 'High School', 'Discipline': 'Other'},
 'avg_hours': 66.24753937007874}
{'_id': {'Level': 'Phd', 'Discipline': 'Humanities'},
 'avg_hours': 75.37037037037037}
{'_id': {'Level': 'Graduate', 'Discipline': 'STEM'},
 'avg_hours': 66.41476014760147}
{'_id': {'Level': 'Primary School', 'Discipline': 'Other'},
 'avg_hours': 65.61609907120743}
{'_id': {'Level': 'Masters', 'Discipline': 'STEM'},
 'avg_hours': 63.953633758791355}
{'_id': {'Level': 'Graduate', 'Discipline': 'Arts'},
 'avg_hours': 62.35820895522388}
{'_id': {'Level': 'Phd', 'Discipline': 'Busi

----
**We can group and then sort the results.**

For example, group documents by `education.level` and `education.discipline` fields and determine the total number of reviews `training_hours`. Then return the documents in decreasing order their average training hours.

----

In [10]:
# Group and Sort

cur = db.hr.aggregate(
            [
                # Stage 1
                {
                    '$group': {
                                # Fields to group on
                                '_id': {
                                            'Level':'$education.level',
                                            'Discipline':'$education.discipline'
                                        },
                                # Accumulator expression
                                'avg_hours':{'$avg':'$training_hours'}
                    }
                },
                # Stage 2
                {
                    '$sort': {'avg_hours':-1}
                }
            ])

for doc in cur:
    pp.pprint(doc)

{'_id': {'Level': 'Phd', 'Discipline': 'Humanities'},
 'avg_hours': 75.37037037037037}
{'_id': {'Level': 'Graduate', 'Discipline': 'Other'},
 'avg_hours': 69.68888888888888}
{'_id': {'Level': 'Phd', 'Discipline': 'STEM'}, 'avg_hours': 69.04106280193237}
{'_id': {'Level': 'Graduate', 'Discipline': 'No Major'},
 'avg_hours': 67.20338983050847}
{'_id': {'Level': 'Graduate', 'Discipline': 'Humanities'},
 'avg_hours': 66.4696261682243}
{'_id': {'Level': 'Graduate', 'Discipline': 'STEM'},
 'avg_hours': 66.41476014760147}
{'_id': {'Level': 'High School', 'Discipline': 'Other'},
 'avg_hours': 66.24753937007874}
{'_id': {'Level': 'Graduate', 'Discipline': 'Business Degree'},
 'avg_hours': 65.71627906976744}
{'_id': {'Level': 'Primary School', 'Discipline': 'Other'},
 'avg_hours': 65.61609907120743}
{'_id': {'Level': 'Masters', 'Discipline': 'Humanities'},
 'avg_hours': 65.54077253218884}
{'_id': {'Level': 'Phd', 'Discipline': 'Business Degree'}, 'avg_hours': 64.2}
{'_id': {'Level': 'Masters', '

---
### $bucket

[$bucket](https://docs.mongodb.com/manual/reference/operator/aggregation/bucket/#-bucket--aggregation-) operator categories documents into several groups, or buckets. The operations are then performed on each of these bucket.

**Syntax -** 

{

  $bucket: { 
  
      groupBy: <expression>,
      boundaries: [ <lowerbound1>, <lowerbound2>, ... ],
      default: <literal>,
      output: {<output1>: { <$accumulator expression> }}
      
       } 
   
}

- groupBy - Defines the grouping condition

- boundaries - An array that defines the boundary of each bucket. Each adjacent pair of values acts as the inclusive lower boundary and the exclusive upper boundary for the bucket.

- default - Default bucket for documents that do not fall in the defined buckets.

- output - Define the operations to perform on the buckets.

----

For example, we can bucket the `city.development_index` into several buckets and find the total number of documents in each bucket.

---

In [11]:
# Bucket

cur = db.hr.aggregate(
            [
                # Stage 1 - bucket
                {
                    '$bucket': {
                                    # Group by condition
                                    'groupBy': '$city.development_index',
                                    # Bucket boundaries
                                    'boundaries': [0, 0.6, 1],
                                    # Operation
                                    'output':{
                                                'Count': {'$sum': 1}
                                            }
                                }
                }
            ])

for doc in cur:
    pp.pprint(doc)

{'_id': 0, 'Count': 448}
{'_id': 0.6, 'Count': 17911}


---
Default bucket for dcouments that do not fall in the defined buckets.

---

In [12]:
# Bucket

cur = db.hr.aggregate(
            [
                {
                    '$bucket': {
                                    # Group by condition
                                    'groupBy': '$city.development_index',
                                    # Bucket boundaries
                                    'boundaries': [0, 0.6, 0.8],
                                    # Default 
                                    'default': 'Other',
                                    # Operation
                                    'output':{
                                                'Count': {'$sum': 1}
                                            }
                                }
                }
            ])

for doc in cur:
    pp.pprint(doc)

{'_id': 0, 'Count': 448}
{'_id': 0.6, 'Count': 4165}
{'_id': 'Other', 'Count': 13746}


---
**We can define multiple operations to be performed on each bucket.**

For example, we can bucket the `city.development_index` into several buckets and find the total number of documents and average training hours in each.

----


In [13]:
# Bucket

cur = db.hr.aggregate(
            [
                {
                    '$bucket': {
                                    # Group by condition
                                    'groupBy': '$city.development_index',
                                    # Bucket boundaries
                                    'boundaries': [0, 0.6, 1],
                                    # Operation
                                    'output':{
                                                'Count': {'$sum': 1},
                                                'Avg_training_hours':{'$avg':'$training_hours'}
                                            }
                                }
                }
            ])

for doc in cur:
    pp.pprint(doc)

{'_id': 0, 'Count': 448, 'Avg_training_hours': 68.05580357142857}
{'_id': 0.6, 'Count': 17911, 'Avg_training_hours': 65.84506727709228}


---
### $bucketAuto

[$bucketAuto](https://docs.mongodb.com/manual/reference/operator/aggregation/bucketAuto/) operator categorizes documents into a specific number of buckets, based on a specified expression. 

Bucket boundaries are automatically determined in an attempt to evenly distribute the documents into the specified number of buckets.

**Syntax -** 

{

  $bucketAuto: { 
  
      groupBy: <expression>,
      buckets: <number>,
      output: {<output1>: { <$accumulator expression> }}
       } 
   
}

- groupBy - Defines the grouping condition

- buckets - An integer that specifies the number of buckets into which input documents are grouped.

- output - Define the operations to perform on the buckets.

----

For example, we can bucket the `city.development_index` into several buckets automatically.

---

In [14]:
# bucketAuto

cur = db.hr.aggregate(
        [
            {
            '$bucketAuto': {
                        # Group by condition
                        'groupBy': '$city.development_index',
                        # Bucket boundaries
                        'buckets': 3,
                        # Operation
                        'output':{
                                    'Count': {'$sum': 1},
                                    'Avg_training_hours':{'$avg':'$training_hours'}
                                }
                        }
            }
        ])

for doc in cur:
    pp.pprint(doc)

{'_id': {'min': 0.448, 'max': 0.865},
 'Count': 6152,
 'Avg_training_hours': 66.40848504551366}
{'_id': {'min': 0.865, 'max': 0.921},
 'Count': 9457,
 'Avg_training_hours': 65.862852913186}
{'_id': {'min': 0.921, 'max': 0.949},
 'Count': 2750,
 'Avg_training_hours': 64.88363636363637}


---
### Question - 

What is average training hours delivered per year?

----

In [15]:
# Question
result = db.hr.aggregate(
                        [
                            {
                                '$project':{
                                                'Year':{'$year':'$date_of_enrollment'},
                                                'training_hours':1
                                            }
                            },
                            {
                                '$group':{
                                            '_id':'$Year',
                                            'avg_hours':{'$avg':'$training_hours'}
                                        }
                            }
                        ]
                )

for doc in result:
    pp.pprint(doc)

{'_id': 2016, 'avg_hours': 66.66699604743083}
{'_id': 2020, 'avg_hours': 65.10072178477691}
{'_id': 2017, 'avg_hours': 67.51004784688995}
{'_id': 2018, 'avg_hours': 66.30190538764784}
{'_id': 2019, 'avg_hours': 64.57091620476035}
{'_id': 2015, 'avg_hours': 65.20501815780786}


---
### Question - 

What is average training hours delivered per quarter for all the years combined?

*Hint - Make use of $bucket operator.*

----

In [16]:
# Question
result = db.hr.aggregate(
        [
            {
            '$project':{
                            'Month':{'$month':'$date_of_enrollment'},
                            'training_hours':1
                        }
            },
            {
            '$bucket':{
                            'groupBy': '$Month',
                            'boundaries':[1, 4, 7, 10, 13],
                            'default':'Other',
                            'output':{
                                        'Avg_training_hours':{'$avg':'$training_hours'}
                                    }
                      }
            }
        ]
)

for doc in result:
    pp.pprint(doc)

{'_id': 1, 'Avg_training_hours': 66.155545735749}
{'_id': 4, 'Avg_training_hours': 64.65135559041003}
{'_id': 7, 'Avg_training_hours': 67.67432286023835}
{'_id': 10, 'Avg_training_hours': 65.10283911671924}


---
---
### Exercise - 

What is average total experince in years for enrollees grouped by discipline and level of education? Find the one with highest average total experience.

----

### Exercise - 

What is average training hours delivered per month for all the years combined?

*Hint - Make use of $bucket operator.*

----