# MSDE631 SQL-NoSQL | Regis University
### Instructor: Dr. Busch

TITLE  | Date | Author |   Notes |
:-----:|:----:|:------:|:-------:|
Project 6: MongoDB Aggregate |17 April 2023 | Ken Dizon | 

**TASK / Content**
1. import empData.csv (file is attached).empdata
    * mongoimport --db db_A7 --collection employees --type=csv --file ‘/empData.csv’ --headerline

2. Write mongo aggregates using PyMongo to answer the following questions:
- Q1: all employees grouped by state and the number of employees in that state
- Q2: average age of people in each state, grouped by state
- Q3: number of people and average age of people in each manager's group, grouped by and sorted by manager id
- Q4: average age and number of employees in each manager group, where the employees ages are > 40, grouped and sorted by manager id

In [1]:
import pymongo
from pymongo import MongoClient
import pprint
import pandas as pd

## 1. Data Import

In [2]:
# loading data
df = pd.read_csv('empdata.csv')
df.head()

Unnamed: 0,pid,name,managerId,state,age,birth,salary
0,1,Mary1,1,WI,24,1995,46436
1,2,Mary2,1,MI,38,1981,68949
2,3,Mary3,1,VT,28,1991,25529
3,4,Mary4,1,MD,38,1981,77397
4,5,Mary5,1,AZ,32,1987,67701


In [3]:
df.tail()

Unnamed: 0,pid,name,managerId,state,age,birth,salary
1995,1996,Mary1996,37,CO,20,1999,84413
1996,1997,Mary1997,30,IA,50,1969,58411
1997,1998,Mary1998,12,AK,43,1976,77986
1998,1999,Mary1999,14,NV,47,1972,15958
1999,2000,Mary2000,98,KY,39,1980,92540


In [4]:
df.shape

(2000, 7)

In [5]:
# Converting into dictionary
empdata = df.to_dict(orient='records')

In [6]:
empdata[:5]

[{'pid': 1,
  'name': 'Mary1',
  'managerId': 1,
  'state': 'WI',
  'age': 24,
  'birth': 1995,
  'salary': 46436},
 {'pid': 2,
  'name': 'Mary2',
  'managerId': 1,
  'state': 'MI',
  'age': 38,
  'birth': 1981,
  'salary': 68949},
 {'pid': 3,
  'name': 'Mary3',
  'managerId': 1,
  'state': 'VT',
  'age': 28,
  'birth': 1991,
  'salary': 25529},
 {'pid': 4,
  'name': 'Mary4',
  'managerId': 1,
  'state': 'MD',
  'age': 38,
  'birth': 1981,
  'salary': 77397},
 {'pid': 5,
  'name': 'Mary5',
  'managerId': 1,
  'state': 'AZ',
  'age': 32,
  'birth': 1987,
  'salary': 67701}]

In [8]:
# Connect to MongoDB
client = MongoClient()
db = client.lab7
employees = db.lab7

# Insert data into MongoDB collection
employees.insert_many(empdata)

<pymongo.results.InsertManyResult at 0x7fe10afa3b40>

In [11]:
emp_count = employees.count_documents({})
print(f"Number of documents in the collection: {emp_count}")

Number of documents in the collection: 2000


## 2. Answer the questions
### Q1: all employees grouped by state and the number of employees in that state

In [13]:
q1 = [
    {'$group': {'_id': '$state', 'count': {'$sum': 1}}},
    {'$sort': {'_id': 1}}
]
result_q1 = employees.aggregate(q1)
for state_num in result_q1:
    print(f"State: {state_num['_id']}, Number of employees: {state_num['count']}")

State: AK, Number of employees: 45
State: AL, Number of employees: 40
State: AZ, Number of employees: 69
State: CA, Number of employees: 45
State: CO, Number of employees: 38
State: CT, Number of employees: 43
State: DE, Number of employees: 39
State: FL, Number of employees: 40
State: GA, Number of employees: 45
State: HI, Number of employees: 34
State: IA, Number of employees: 42
State: ID, Number of employees: 47
State: IL, Number of employees: 43
State: IN, Number of employees: 42
State: KS, Number of employees: 29
State: KY, Number of employees: 39
State: LA, Number of employees: 45
State: MA, Number of employees: 44
State: MD, Number of employees: 44
State: ME, Number of employees: 36
State: MI, Number of employees: 37
State: MN, Number of employees: 31
State: MO, Number of employees: 42
State: MS, Number of employees: 46
State: MT, Number of employees: 41
State: NC, Number of employees: 47
State: ND, Number of employees: 28
State: NE, Number of employees: 30
State: NH, Number of

### Q2: Average age of people in each state, grouped by state

In [15]:
q2 = [
    {'$group': {'_id': '$state', 'avg_age': {'$avg': '$age'}}},
    {'$sort': {'_id': 1}}
]
result_q2 = employees.aggregate(q2)
for avg_age_state in result_q2:
    print(f"State: {avg_age_state ['_id']}, Average age: {avg_age_state ['avg_age']}")

State: AK, Average age: 45.4
State: AL, Average age: 46.025
State: AZ, Average age: 46.79710144927536
State: CA, Average age: 40.13333333333333
State: CO, Average age: 42.05263157894737
State: CT, Average age: 40.23255813953488
State: DE, Average age: 39.17948717948718
State: FL, Average age: 36.65
State: GA, Average age: 40.666666666666664
State: HI, Average age: 41.64705882352941
State: IA, Average age: 43.95238095238095
State: ID, Average age: 41.212765957446805
State: IL, Average age: 43.46511627906977
State: IN, Average age: 41.166666666666664
State: KS, Average age: 43.55172413793103
State: KY, Average age: 44.02564102564103
State: LA, Average age: 41.977777777777774
State: MA, Average age: 38.27272727272727
State: MD, Average age: 43.54545454545455
State: ME, Average age: 42.25
State: MI, Average age: 43.75675675675676
State: MN, Average age: 38.16129032258065
State: MO, Average age: 42.595238095238095
State: MS, Average age: 40.630434782608695
State: MT, Average age: 40.0731707

### Q3: number of people and average age of people in each manager's group, grouped by and sorted by manager id

In [17]:
q3 = [
    {'$group': {'_id': '$managerId', 'count': {'$sum': 1}, 'avg_age': {'$avg': '$age'}}},
    {'$sort': {'_id': 1}}
]
result_q3 = employees.aggregate(q3)
for ppl_group in result_q3:
    print(f"Manager ID: {ppl_group['_id']}, Number of people: {ppl_group['count']}, Average age: {ppl_group['avg_age']}")

Manager ID: 1, Number of people: 100, Average age: 41.64
Manager ID: 2, Number of people: 26, Average age: 39.80769230769231
Manager ID: 3, Number of people: 22, Average age: 43.5
Manager ID: 4, Number of people: 19, Average age: 45.78947368421053
Manager ID: 5, Number of people: 32, Average age: 43.40625
Manager ID: 6, Number of people: 17, Average age: 37.705882352941174
Manager ID: 7, Number of people: 20, Average age: 39.25
Manager ID: 8, Number of people: 19, Average age: 42.94736842105263
Manager ID: 9, Number of people: 18, Average age: 45.05555555555556
Manager ID: 10, Number of people: 13, Average age: 44.61538461538461
Manager ID: 11, Number of people: 21, Average age: 42.904761904761905
Manager ID: 12, Number of people: 16, Average age: 41.0
Manager ID: 13, Number of people: 18, Average age: 39.888888888888886
Manager ID: 14, Number of people: 25, Average age: 41.72
Manager ID: 15, Number of people: 21, Average age: 42.333333333333336
Manager ID: 16, Number of people: 18, Av

### Q4: average age and number of employees in each manager group, where the employees ages are > 40, grouped and sorted by manager id

In [18]:
q4 = [
    {'$match': {'age': {'$gt': 40}}},
    {'$group': {'_id': '$managerId', 'count': {'$sum': 1}, 'avg_age': {'$avg': '$age'}}},
    {'$sort': {'_id': 1}}
]
result_q4 = employees.aggregate(q4)
for ppl_group_40 in result_q4:
    print(f"Manager ID: {ppl_group_40['_id']}, Number of employees: {ppl_group_40['count']}, Average age: {ppl_group_40['avg_age']}")

Manager ID: 1, Number of employees: 51, Average age: 53.666666666666664
Manager ID: 2, Number of employees: 13, Average age: 52.07692307692308
Manager ID: 3, Number of employees: 14, Average age: 55.42857142857143
Manager ID: 4, Number of employees: 11, Average age: 54.63636363636363
Manager ID: 5, Number of employees: 19, Average age: 54.1578947368421
Manager ID: 6, Number of employees: 8, Average age: 49.0
Manager ID: 7, Number of employees: 8, Average age: 52.75
Manager ID: 8, Number of employees: 13, Average age: 51.76923076923077
Manager ID: 9, Number of employees: 11, Average age: 56.63636363636363
Manager ID: 10, Number of employees: 10, Average age: 48.5
Manager ID: 11, Number of employees: 11, Average age: 55.72727272727273
Manager ID: 12, Number of employees: 9, Average age: 51.888888888888886
Manager ID: 13, Number of employees: 9, Average age: 53.333333333333336
Manager ID: 14, Number of employees: 12, Average age: 55.333333333333336
Manager ID: 15, Number of employees: 11,