# Big Data Modelling and Management 2022

Group number: 


1. Joana Tavares, m20210621

2. Laura Santos, r20181094

3. Maria Oliveira, m20210612

4. Mariana Ferreira, r20181071


## 🚚 BDMM Second Homework Assignment 🚚 

_The Wide World Importers (WWI) is a wholesales novelty goods importer and distributor operating from the San Francisco bay area. In this assignment we will be working with their database._ 
You can get more information and details about the WWI database in the following link: https://docs.microsoft.com/en-us/sql/samples/wide-world-importers-what-is?view=sql-server-ver15

The focus of the second assignment is modelling. We will use the World Wide Importers database and convert it to a document-based database. To that end, we will be leveraging concepts like data denormalization, indices, and mongodb design patterns. 

More information on the extended datamodel to be found here: </br>  
https://docs.microsoft.com/en-us/sql/samples/wide-world-importers-oltp-database-catalog?view=sql-server-ver15

## Problem Description

Your team has just arrived at WWI (a leading company in logistics). Welcome!   <br>
Even though business is thriving, the IT department is going through a bad time.   <br>
Digitalization was never a priority for the company and now the company operational and analytical requirements are starting to grow beyond the capabilities of their existing data architecture.   <br>

WWI data are spread accross different systems, but we've already managed to pull them all into a mongo dump file. This data file is an exact dump of the SQL data so includes all the same structure, the SQL tables become collections and the rows become documents. This means all the original SQL keys are included in the data.<br>
Currently, the costs to develop the necessary queries to collect data to answer questions asked by the different departments are too high. <br>

Management concluded it is the right time to revise and revamp the data architecture, in order to speed up operations. 

In that context, your team was tasked with merging all the company data into a single and coherent Mongo database. <br>
It is expected that, with your solution, WWI will have a better understanding of their business and that the different departments will be able to obtain efficiently the answers they need.

The WWI team shared with you an ERD of their current datamodel:<br>
![datamodel](./WWI.png)

**Note** You can open the file WWI.png that is in the same directory as this notebook to see the above image in more detail and zoom in as you need.

Addtionally, the WWI team asked you the deliver the following outputs in **4 weeks**:
- Understand and model the database in MongoDB.
- Setup the database so that it is performs well for the queries they have provided. You should include reasoning in comments for the decisions you make on modelling the database.
- Answer the questions (queries) on the data provided.  
- Submit the results by following the instructions.  

With these deliveries, you will have created a prototype and allows the management to decide whether MongoDB is a good solution that meets their requirements.

### Design Requirements

Note that WWI has the following query requirements for the database.

1. The web team needs to know:  
    1. Which state province do we have the most suppliers in?  
    2. How many people have three or more `OtherLanguage`? 
    3. Top 10 most common `OtherLanguage` for people records. 
    4. How many customer records are valid after `November 2015`? 
    5. What percentage of people records don't have the UserPreferences field? 

2. The warehouse group needs to know:  
    1. What is the average difference in days between OrderDate and ExpectedDeliveryDate for orders sold by (`SalespersonPersonID`) person with the name `Jack Potter`?
    2. Which items get ordered the most in bulk (largest average quantity ordered)?  
    3. Which two items get ordered together the most?
    4. For each customer category which 3 items have the ordered the most?
    5. What is the current stock of each stockgroup?

3. The CFO needs to know:  
    1. What is the monthly total order count for each month?  
    2. How many orders are there from the customer `Tailspin Toys (Head Office)`?
    3. What are the average monthly sales prices of all goods sold? 
    4. In each state province what is the average customer credit limit?   
    5. What are the yearly expenditures with each supplier (per supplier name)?  

4. Partnerships needs to know:  
    1. What is the most common payment type?  
    2. What percentage of people have their `Title` as `Team Member`?
    3. Which supplier of the category `Novelty Goods Supplier` has the most transactions?  
    4. What is the highest `CommissionRate` that a person has?

5. The marketing team needs to know:  
    1. What is the name of the sales person with the largest sum of invoice values in 2013 (person whose customers paid the most money)?
    2. Who are the most common `PickedByPersonID` person names for orders done by customer `Adriana Pena`?
    3. How many people have in their name the string `Sara`?
    5. What are the top 10 most Common Names (Primary or Surnames) of people?

Transform the mongo dump file provided with this notebook and model a database following mongodb's best practices. You should adjust the data model to best fit the use cases provided above. Think about collections, embedding, linking, indexing, and the patterns learned in class. Provide justifications for each decision you make. What, if any, are the trade-off's or disadvantages of your approach.

Use MongoDB queries to answer the questions on your transformed database.

### Deliverables

1. Notebook with all DB creation operations and CRUD operations to create the data model. **Important** you should include in comments justification for your decisions on modelling the data.;
2. Second notebook with all required queries and answers for the questions, **Important** please indicate with comments the steps in the data model you took to optimise each query;


# Additional Information

## Groups  

Students should form groups of at least 4 and at most 5. <br>

## Submission  Deadline

The submission includes two notebooks with outputs (cells must be run). 
Please make sure to indicate:
1. group number,
2. group members with student names and numbers,
3. the name of the database that you created. <br>

Upload the notebook on moodle before **23:59 on June 22nd**

## Evaluation   

The second homework assignment counts 40% towards your final mark of the curricular unit. <br>
The assignment will be scored from 0 to 20. <br>

Each group submission will be evaluated on three components:
1. correctness of results;
2. simplicity and performance characteristics of the solution;
3. justification of decisions.

50% -  Database design  
50% -  Query results including performance

Please note that all code delivered in this assignment will go through plagiarism automated checks. <br>
Groups with high similarity levels in their code will undergo investigation.


# Imports

In [33]:
import pandas as pd
from tqdm.notebook import tqdm
from pprint import pprint
import numpy as np
from pymongo import MongoClient
from datetime import datetime

# Connect to database

In [2]:
#!pip install pymongo

user ="r20181094"
password="password"
host="localhost"
port="27017"
protocol="mongodb"

client = MongoClient(f"{protocol}://{user}:{password}@{host}:{port}")

In [3]:
db = client.WideWorldImporters
print(f"Database info: {db}\n")

db.list_collection_names()

Database info: Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'WideWorldImporters')



['customercategories',
 'suppliers',
 'purchaseorders_embed',
 'customers',
 'stockitems',
 'invoicelines',
 'orders',
 'orders_emb',
 'paymentmethods',
 'colors',
 'customertransactions',
 'orderlines',
 'invoices_emb',
 'cities_embed',
 'supplier_transactions_embed',
 'stockitemsstockgroups',
 'suppliercategories',
 'cities',
 'suppliers_embed',
 'stockgroups',
 'stockitemstransactions',
 'customer_transactions_embed',
 'suppliertransactions',
 'purchaseorderlines',
 'purchaseorders',
 'supplier_embed',
 'customers_emb',
 'invoices',
 'suppliers_cities_embed',
 'people',
 'stateprovinces',
 'customer_embed',
 'transactiontypes',
 'packagetypes',
 'countries',
 'deliverymethods']

In [4]:
for collection in db.list_collection_names():
    print(collection)
    pprint(db[collection].find_one())
    print()

customercategories
{'CustomerCategoryID': 1,
 'CustomerCategoryName': 'Agent',
 'LastEditedBy': 1,
 'ValidFrom': datetime.datetime(2013, 1, 1, 0, 0),
 'ValidTo': datetime.datetime(9999, 12, 31, 23, 59, 59, 999000),
 '_id': ObjectId('6287c57e636e5a12693dc7ee')}

suppliers
{'AlternateContactPersonID': 22,
 'BankAccountBranch': 'Woodgrove Bank Zionsville',
 'BankAccountCode': '356981',
 'BankAccountName': 'A Datum Corporation',
 'BankAccountNumber': '8575824136',
 'BankInternationalCode': '25986',
 'DeliveryAddressLine1': 'Suite 10',
 'DeliveryAddressLine2': '183838 Southwest Boulevard',
 'DeliveryCityID': 38171,
 'DeliveryLocation': None,
 'DeliveryMethodID': 7,
 'DeliveryPostalCode': '46077',
 'FaxNumber': '(847) 555-0101',
 'InternalComments': None,
 'LastEditedBy': 1,
 'PaymentDays': 14,
 'PhoneNumber': '(847) 555-0100',
 'PostalAddressLine1': 'PO Box 1039',
 'PostalAddressLine2': 'Surrey',
 'PostalCityID': 38171,
 'PostalPostalCode': '46077',
 'PrimaryContactPersonID': 21,
 'Supplier

# Data Denormalization, Indexes and Modifications in Studio 3T

For Data Denormalization, information about indexes and alterations in studio 3T, please run the jupyter notebook called "...._CRUD-OPERATIONS" first.

# Question 1

**A. Which state province do we have the most suppliers in?**

In [34]:
#we group by cities to access StateProvinceName to get the name of the provinces and count how many suppliers
#there are in each province
query_1a_1 = {
    '$group':{
            '_id':'$cities_embed.provinces.StateProvinceName',
            'count':{'$sum':1}
    }
}

#sorted the ouput by count in descending order
query_1a_2 = {
    "$sort": {"count": -1}
}

#limits the results to 10
query_1a_3 = {
    '$limit': 10
}

pipeline = [query_1a_1, query_1a_2, query_1a_3]

q_1_a = list(db.suppliers_cities_embed.aggregate(pipeline))

q_1_a

[{'_id': 'California', 'count': 3},
 {'_id': 'Tennessee', 'count': 2},
 {'_id': 'New Jersey', 'count': 1},
 {'_id': 'Washington', 'count': 1},
 {'_id': 'Missouri', 'count': 1},
 {'_id': 'North Carolina', 'count': 1},
 {'_id': 'Minnesota', 'count': 1},
 {'_id': 'Indiana', 'count': 1},
 {'_id': 'South Dakota', 'count': 1},
 {'_id': 'Kentucky', 'count': 1}]

**Answer:** The state province in which we have the most number of suppliers is California.

**B. How many people have three or more ``OtherLanguage`` ?**

In [35]:
#we use the function $cond to define the size of the array, that is, to see how many languages the client has
#or to return 0 if the field in the other langages document is empty
query_1b_1 = {
    '$project' : {
        '_id' : 0,
        'OtherLanguages' : 1,
        'languages' : {'$cond': { 'if': {'$gt': ['$OtherLanguages', 2]}, 'then': {'$size':"$OtherLanguages"}, 'else': "0" }}
    }
}

#We used $match to search all records in Otherlanguages document were greater than 2
query_1b_2 = {
    '$match' : {
        'languages' : {'$gt' : 2}
    }
}

#we count how many records exist with these conditions
query_1b_3 = {
    '$count' : 'People that have three or more Other languages'
}

pipeline = [query_1b_1, query_1b_2, query_1b_3]

q_1_b = list(db.people.aggregate(pipeline))

q_1_b

[{'People that have three or more Other languages': 4}]

**Answer:** 4 people have three or more OtherLanguage.

**C. Top 10 most common ``OtherLanguage`` for people records.**

In [36]:
#we unwind the OtherLanguages field because this field has multiple languages inside the array
#and if we didn't do that it would return an array with the languages inside, and we want to count the languages separately
query_1c_1 = {'$unwind':"$OtherLanguages"}

#we count and order the OtherLanguages field
query_1c_2 = {
    '$group':{
            '_id':"$OtherLanguages",
            'count':{'$sum':1}
    }
}

#sorted the ouput by count in descending order
query_1c_3 = {
    "$sort": {"count": -1}
}

#limits the results to 10
query_1c_4 = {
    '$limit': 10
}

pipeline = [query_1c_1, query_1c_2, query_1c_3, query_1c_4]

q_1_c = list(db.people.aggregate(pipeline))

q_1_c

[{'_id': 'Greek', 'count': 3},
 {'_id': 'Dutch', 'count': 3},
 {'_id': 'Finnish', 'count': 3},
 {'_id': 'Romanian', 'count': 2},
 {'_id': 'Polish', 'count': 2},
 {'_id': 'Slovak', 'count': 2},
 {'_id': 'Lithuanian', 'count': 2},
 {'_id': 'Croatian', 'count': 2},
 {'_id': 'Arabic', 'count': 2},
 {'_id': 'Bulgarian', 'count': 1}]

**Answer:** The 10 most common OtherLanguage for people records are:Dutch, Finnish, Greek, Slovak, Romanian, Arabic, Croatian, Polish, Lithuanian and Bokmål.

**D. How many customers records are valid after ``November 2015``?**

Partial indexes only index the documents in a collection that meet a specified filter expression. By indexing a subset of the documents in a collection, partial indexes have lower storage requirements and reduced performance costs for index creation and maintenance (source: https://www.mongodb.com/docs/manual/core/index-partial/).

Since the web team needed to know how many customer records were valid after November 2015, we created a partial index to index only the documents with a validTo field greater than november 2015

In [31]:
db.customers.create_index(
    [('ValidTo', 1)],
    partialFilterExpression = {"ValidTo": {"$gte": datetime(2015, 11, 30, 23, 59, 59, 999000)}},
    name='ValidTo', 
)

'ValidTo'

In [37]:
#we went to see how many customer records were valid after November 2015
#we use datetime because the Validto document had the type with datetime.datetime
len(list(db.customers.find({"ValidTo": {"$gt": datetime(2015, 11, 30, 23, 59, 59, 999000)}})))

663

**Answer:** 663 customer records are valid after November 2015.

**E. What percentage of people records don't have the ``UserPreferences`` field?**

In [38]:
#we use the expression $cond to define if the UserPreferences field is null the value will be 1 and 0 if the field is filled
query_1e_1 = {
    '$group': {
            '_id':0,
            'count': { '$sum': 1 },
            'nullcount': {'$sum': {'$cond': [{'$eq':["$UserPreferences",None]}, 1, 0]} }
        }
    }

#then we use $divide to calculate the percentage of the nullcount relative to the document
#count within the userpreferences field.
query_1e_2= {
    '$project': {
            'Percentage': {'$multiply': ["$nullcount", {'$divide': [100, "$count"]}] }
        }
}

pipeline = [query_1e_1, query_1e_2]

q_1_e  = list(db.people.aggregate(pipeline))

q_1_e

[{'_id': 0, 'Percentage': 83.61836183618362}]

**Answer:** 83,62% of people do not have the UserPreferences field.

# Question 2

**A. What is the average difference in days between OrderDate and ExpectedDeliveryDate for orders sold by (`SalespersonPersonID`) person with the name `Jack Potter`?**


We want to acess information about order and expected delivery date - orders collection, and sales person - people collection. Let's take a look at each step that was performed:
1. in the first query we use lookup to perform a equality match between the Field PersonID from the people document with the field SalespersonPersonID of the orders collection
2. info_person is the new array field to add to the joined documents
3. we matched the full name from the info person array that matches - or is equal - to Jack Potter
4. in query 3 we projected the fileds to pass to the next query - salespersonID, Full Name of the sales person and the difference in days
5. difference in days was obtained by computing the expected delivery date and order date
6. subtract operator returns the difference between the two dates in milliseconds
7. we grouped by the Name of Jack Potter and computed the average of the difference in days of all the orders order by him and projected the field of the difference in days diving the previously computed difference in days by 86400000 to transform  the value that was in milliseconds to days - 1 day is equal to 86.400.000 milliseconds so to convert the value we need to divide it by 86.400.000
8. we rounded the final value to appear with 2 decimal places



The warehouse group was interested in the average difference in days between OrderDate and ExpectedDeliveryDate for orders sold by (SalespersonPersonID) person with the name Jack Potter. In our query we used a lookup in the first step of the pipeline that looked for the fields PersonID in the people collection and SalesPersonID in the orders collection. To improve the performance of this step and the speed, we created indexes for this fields, which improved the query time.

In [48]:
db.people.create_index(
    [('PersonID', 1)],
    name='PersonID', 
)

'PersonID'

In [49]:
db.orders.create_index(
    [('SalespersonPersonID', 1)],
    name='SalespersonPersonID', 
)

'SalespersonPersonID'

In [51]:
query_2a_1 = {
    "$lookup":
    {
       'from': 'people',
       'localField': 'SalespersonPersonID',
       'foreignField': 'PersonID',
       'as': 'info_person'
     }
}

query_2a_2 ={
    '$match': {
        'info_person.FullName' : {'$eq' : 'Jack Potter'}
    }
}

query_2a_3 = {
    '$project' : {
        '_id' : False,
        'SalespersonPersonID' : 1,
        'info_person.FullName' : 1,
        'difference_days' : { '$subtract': [ '$ExpectedDeliveryDate', '$OrderDate' ] },
    }
}


query_2a_4 = {
    '$group': {
        '_id': {'FullName' : '$PersonInfo.FullName'}, 
        'average_dif' : {'$avg' : '$difference_days'}            
    }
}

query_2a_5 = {
    '$project' : {
        '_id' : False,
        'Average difference between expected delivery date and order date in days' : {'$round' : [{'$divide' : ['$average_dif', 86400000]}, 2]}
    }
}

pipeline = [query_2a_1, query_2a_2, query_2a_3, query_2a_4, query_2a_5]
q_2_a = list(db.orders.aggregate(pipeline))
q_2_a

[{'Average difference between expected delivery date and order date in days': 1.45}]

**B. Which items get ordered the most in bulk (largest average quantity ordered)?**

In [52]:
#we grouped the description by their description and calculated the average value for the quantity for each one
query_2b_1 ={
    '$group': {
        '_id': {'Description' : '$Description'},
        'Quantity' : {'$avg' : '$Quantity'}
    }
}

#sorted the ouput by quantity in descending order
query_2b_2 = {
    '$sort' : { 'Quantity' : -1 }
}

#limits the results to 10
query_2b_3 = {
    '$limit': 10}


pipeline = [query_2b_1, query_2b_2, query_2b_3]

q_2_b = list(db.invoicelines.aggregate(pipeline))

q_2_b

[{'_id': {'Description': 'Black and orange fragile despatch tape 48mmx75m'},
  'Quantity': 199.35},
 {'_id': {'Description': 'Black and orange fragile despatch tape 48mmx100m'},
  'Quantity': 198.23950870010236},
 {'_id': {'Description': 'Clear packaging tape 48mmx75m'},
  'Quantity': 145.26190476190476},
 {'_id': {'Description': 'Shipping carton (Brown) 356x356x279mm'},
  'Quantity': 141.64338919925513},
 {'_id': {'Description': '3 kg Courier post bag (White) 300x190x95mm'},
  'Quantity': 141.48096564531104},
 {'_id': {'Description': 'Express post box 5kg (White) 350x280x130mm'},
  'Quantity': 140.68075117370893},
 {'_id': {'Description': 'Shipping carton (Brown) 413x285x187mm'},
  'Quantity': 139.18473138548538},
 {'_id': {'Description': 'Chocolate beetles 250g'},
  'Quantity': 138.9387755102041},
 {'_id': {'Description': 'Shipping carton (Brown) 279x254x217mm'},
  'Quantity': 138.81642512077295},
 {'_id': {'Description': 'Shipping carton (Brown) 480x270x320mm'},
  'Quantity': 137.48

**Answer:** The item with the largest average quantity ordered is "Black and orange fragile despatch tape 48mmx75m".

**C. Which two items get ordered together the most?**

The warehouse group was interested in knowing Which two items get ordered together the most. However, this query was extremely slow because the first step looks for the ORDERID field in the orders_emb. To reduce the performance costs. we created an index for the OrderID and also for the Description of the StockItem (because we will use the description instead of the StockItemID for increased interpretability of the output). After running the query with the indexes the running time decreased significantly. While before it took more than 25 minutes to run, with the indexes it took only 8.94 s ± 301 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) (we used %%timeit to time the running time).

In [53]:
db.orders_emb.create_index(
    [('OrderID', 1)],
    name='OrderID', 
)

'OrderID'

In [25]:
db.orders_emb.create_index(
    [('order_lines.Description', 1)],
    name='Description', 
)

'Description'

In [55]:
# from the collection orders embedded which has the order lines embedded in each order ID (with the indexes created before), we will use a lookup from the same collection
# to add a field in the collection with the other order lines, which will serve to make combinations of 2 items that were bought together
# the second query groups the field created previously (orderlines_2) as outputs the id of the order ID and two fields, one for the description of item 1 and other for description of item 2
# these fields contain arrays composed of all the itens (StockItemID) bought for each id (order ID) - these arrays are equal
# nextly, we unwind the field orderlines_2 which will deconstruct this array field and ouput a document for each element - essentially, the fields of item_1 and item_" will not be inside an array of orderlines_2
# queries 4 and 5 unwind both the fields item_1 and item_2 inside the field order_lines which will deconstruct firstly the field item 1 - decomposing 1 item with all the other items from item_2
# secondly, will do the same for item_2 which will ouput every combination of two bought itens for each Order ID (including combinations of an item with itself)
# we don't want to pass the combinations of two equal itens so we added query 6 adding a field, by using cmp that compares two values
# cmp will ouput 0 if the 2 compared itens are the same, 1 if the first value is greater than the second and -1, otherwise
# therefore, to exclude the combiantions of 2 equal itens we need to exclude the ones with the value "0" in Same_StockItems
# for this, we used match to pass only the documents that match the condition to the next pipeline stage, in this case that are not equal to "0" in the field Same_StockItems
# in query 8 we grouped the combinations using item 1 an item 2 as ids and counted the number of itens those combinations exist
# the next step was to sort by descending order, so the combinations with a higher sum amount (ordered more times together) appeared in the top
# to return the 2 itens bought more times together we just ahd to limit to 1 to return the first 2 itens, which have the highest "Number of times the items are ordered together" (there are more itens with the same amount but by limiting it will return the ones that appeared in the first place)


query_2c_1 = {"$lookup": 
              {
             "from": "orders_emb",
            "localField" : "OrderID",
            "foreignField" : "OrderID",
            "as" : "orderlines_2"
              }
             }

query_2c_2 = {
    "$group":
    {
       "_id": "$OrderID",
        "orderlines_2": {
        "$push":{
        "item_1": "$order_lines.Description",
        "item_2": "$order_lines.Description",
        }
    }
}}

query_2c_3 = {"$unwind": "$orderlines_2"}
query_2c_4 = {"$unwind": "$orderlines_2.item_1"}
query_2c_5 = {"$unwind": "$orderlines_2.item_2"}

query_2c_6 = { "$addFields":{
        "Same_StockItems":{ "$cmp": [ "$orderlines_2.item_1", "$orderlines_2.item_2" ] }
    }
}


query_2c_7 = {"$match":{
    
    "Same_StockItems":{"$ne":0} 
      
}}


query_2c_8 = {
    
    "$group":{"_id":{"Item 1":"$orderlines_2.item_1","Item 2":"$orderlines_2.item_2"}, 
              
              "Number of times the items are ordered together":{                
                  "$sum":1 
              }
              
    }}
    


query_2c_9 = {"$sort":{"Number of times the items are ordered together":-1}} 

query_2c_10 = {"$limit":1}

pipeline=[query_2c_1, query_2c_2, query_2c_3, query_2c_4, query_2c_5, query_2c_6, query_2c_7, query_2c_8, query_2c_9, query_2c_10]

q_2_c_2 = list(db.orders_emb.aggregate(pipeline))
q_2_c_2

[{'_id': {'Item 1': 'Developer joke mug - fun was unexpected at this time (Black)',
   'Item 2': 'Air cushion film 200mmx200mm 325m'},
  'Number of times the items are ordered together': 30}]

**D. For each customer category which 3 items have the ordered the most?**

We want to acess information about customer category -customer_embed collection, and ordered items - invoices_emb collection. Let's look at the step by step resolution:
1. In the first query, we use lookup to perform a equality match between the Field CustomerID from the input document with the same field in the joined collection, which is the customer_embed
2. info_customer is the new array field to add to the joined couments
3. In the second query, we specified the fields that we wanted to pass along to the next query: the CustomerCategoryName which is inside the field customer_category and we also want to pass the field of the StockItemID
4. Next, we unwinded the field invoicelines because for some invoiceIDs this field has multiple invoicelineIDs inside without the unwind the groupby would return an array with the stockItems for each invoice which we did not want
5. After the unwind, we grouped by customer category and stock item id - we want each item to appear in separately - and summed how many orders with a specific item and customer category exists to query returned the number of orders for every combination but we needed to make other groupby to group the information regarding the number of orders for every category in order to be able to sort and limit 3 items for every category
6. Therefore, we performed another group by, goruping by customer category and pusehd the values of the item and number of orders to an array
7. After this, we wanted to limit for each category, the most 3 ordered items but to order the number of orders we needed to unwind the OrderInfo field - where the number of orders was located- and after that we sorted from an descending order so that the most ordered items appeared in the beggining
8. Following, we did other group by, similar to the previous one, but now with the values in a descending order
9. Finally, we projected the fields to appear in the ouput - the id which is the customer category and the most 3 ordered items to get the 3 most ordered items, we used the slice projection operator, which returned the 3 first values in the array since it was ordered, the first 3 items were the ones with a bigger amount of orders

Finally we created an index for the question about the which 3 items have the ordered the most for each customer category. Similarly, we used a lookup aggreagtion in the first step of the pipeline which looks for the fields CustomerID and this query was also taking too much time. Thus, we created 2 indexes one for this field in the invoices_emb collection and another for the customer_embed collection. After creating the indexes, the performance increased significantly, with a running time of 7.88 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).

In [58]:
db.invoices_emb.create_index(
    [('CustomerID', 1)],
    name='CustomerID', 
)

'CustomerID'

In [59]:
db.customer_embed.create_index(
    [('CustomerID', 1)],
    name='CustomerID', 
)

'CustomerID'

In [60]:
query_2d_1 = {
    "$lookup":
    {
       'from': 'customer_embed',
       'localField': 'CustomerID',
       'foreignField': 'CustomerID',
       'as': 'info_customer'
     }
}


query_2d_2 = {
    '$project' : 
    {
        '_id' : False,
        'info_customer.customer_category.CustomerCategoryName' : 1,
        'invoicelines.StockItemID':1
    }
}

query_2d_3 = {'$unwind':  "$invoicelines"}


query_2d_4 = {
    '$group': 
    {
        '_id': {'Customer Category' : '$info_customer.customer_category.CustomerCategoryName', 'item': '$invoicelines.StockItemID'}, 
        'number of order items' : {'$sum' : 1}
    }
}


query_2d_5 = {
    "$group": 
    {
    "_id": "$_id.Customer Category",
    "OrderInfo": 
        {"$push": 
            {"item": '$_id.item',"nr_orders": "$number of order items"}
        }
    }
}

query_2d_6 = {'$unwind': '$OrderInfo'}

query_2d_7 = {'$sort': {'OrderInfo.nr_orders' : -1 }}

query_2d_8 = { 
    "$group": 
    {
    "_id": "$_id",
    "OrderInfo": 
        {"$push": "$OrderInfo"}
    }
}


query_2d_9 = {
    '$project': 
    {
    "_id": False,
    "Customer Category": "$_id",
    "Most ordered items": { "$slice": [ "$OrderInfo.nr_orders", 3 ]}
    }
}


In [61]:
pipeline = [query_2d_1, query_2d_2, query_2d_3, query_2d_4, query_2d_5,query_2d_6, query_2d_7, query_2d_8, query_2d_9]
q_2_d = list(db.invoices_emb.aggregate(pipeline))
q_2_d

[{'Customer Category': ['Supermarket'], 'Most ordered items': [110, 106, 104]},
 {'Customer Category': ['Gift Store'], 'Most ordered items': [95, 89, 88]},
 {'Customer Category': ['Novelty Shop'],
  'Most ordered items': [821, 814, 806]},
 {'Customer Category': ['Corporate'], 'Most ordered items': [92, 91, 91]},
 {'Customer Category': ['Computer Store'], 'Most ordered items': [98, 94, 94]}]

**E. What is the current stock of each stockgroup?**

In [63]:
query_2e_1 = {
    '$project': {
        '_id': 1,
        'StockGroupID': 1,
        'StockItemID': 1,
        'GroupName': '$stockgroup.StockGroupName'
    }
}


query_2e_2 = {
    '$group': {
            '_id': '$GroupName',
            'NumberOfItems': {'$sum': 1}
    }
}


query_2e_3 = { 
        '$sort' : {'NumberOfItems': -1}
    }


pipeline = [query_2e_1, query_2e_2, query_2e_3] #, query_2, query_3, query_4, query_5, query_6, query_7,, query_3,  query_6, ]
q_2_e = list(db.itemstockgroups_emb.aggregate(pipeline))
q_2_e

[]

# Question 3

**A. What is the monthly total order count for each month?**

In [64]:
# we grouped by the month of each Order Date and computed the number of orders for each month using the sum aggregation function
query_3a_1 = {
    "$group": 
    {
    "_id": {"month": { "$month": { "$toDate": "$OrderDate"}}}, 
    "numberoforders": {"$sum": 1} 
    }
}

# sorted the ouput by month number
query_3a_2 = {"$sort":{"_id":1}}


pipeline = [query_3a_1, query_3a_2]
q_3_a = list(db.orders.aggregate(pipeline))
q_3_a

[{'_id': {'month': 1}, 'numberoforders': 7239},
 {'_id': {'month': 2}, 'numberoforders': 6115},
 {'_id': {'month': 3}, 'numberoforders': 7129},
 {'_id': {'month': 4}, 'numberoforders': 7497},
 {'_id': {'month': 5}, 'numberoforders': 7722},
 {'_id': {'month': 6}, 'numberoforders': 5551},
 {'_id': {'month': 7}, 'numberoforders': 6167},
 {'_id': {'month': 8}, 'numberoforders': 4908},
 {'_id': {'month': 9}, 'numberoforders': 5319},
 {'_id': {'month': 10}, 'numberoforders': 5504},
 {'_id': {'month': 11}, 'numberoforders': 5014},
 {'_id': {'month': 12}, 'numberoforders': 5430}]

**B. How many orders are there from the customer `Tailspin Toys (Head Office)`?**

In [65]:
# firstly, we matched the CustomerName field to Tailspin Toys (Head Office)
query_3b_1 = {'$match': {'CustomerName': 'Tailspin Toys (Head Office)'}}


# then we use the lookup to join the orders and customers coleections by the common field - CustomerID
query_3b_2 = {
    "$lookup":
    {
       "from": "orders",
       "localField": "CustomerID",
       "foreignField": "CustomerID",
       "as": "orders"
     }
}


# then we projected the fields to appear in the ouput - the id that corresponds to the name of the customer we were interested in
# and the size of the array created in lookup which is referent to the orders of that customer, using the size operator
query_3b_3 = {
    "$project":
    {
    "_id": 0,
    'Customer Name': '$CustomerName',
    "Number of Orders": {'$size':"$orders"}
    }
}


pipeline = [query_3b_1, query_3b_2, query_3b_3]
q_3_b = list( db.customers.aggregate(pipeline))
q_3_b

[{'Customer Name': 'Tailspin Toys (Head Office)', 'Number of Orders': 129}]

**Answer:** The customer ``Tailspin Toys (Head Office)`` made 129 orders.

**C. What are the average monthly sales prices of all goods sold?** 

In [67]:
# we grouped the purchase orders by month of the Order Date and calculated the average monthly sales
# using the average operator on the field Expected Unit Price Per Outer in each Purchase order lines field

query_3c_1 = {
    "$group":
    {
    "_id": {"month": { "$month": { "$toDate": "$OrderDate"}}}, 
    "Avg monthly Sales": {"$avg": "$purchaseorderslines.ExpectedUnitPricePerOuter"} 
    }
}

# nextly we sorted by month and projected the ouput fields, rounding the average monthly sales to 2 decimal places
query_3c_2 = {"$sort":{"_id":1}}

query_3c_3 = {
    '$project':
    {
    '_id' : True,
    'Average Monthly Sales' : {'$round' : ["$Avg monthly Sales", 2]}
    }
}

pipeline = [query_3c_1, query_3c_2, query_3c_3]
q_3_c = list(db.purchaseorders_embed.aggregate(pipeline))
q_3_c

[{'_id': {'month': 1}, 'Average Monthly Sales': Decimal128('124.99')},
 {'_id': {'month': 2}, 'Average Monthly Sales': Decimal128('127.87')},
 {'_id': {'month': 3}, 'Average Monthly Sales': Decimal128('129.00')},
 {'_id': {'month': 4}, 'Average Monthly Sales': Decimal128('127.69')},
 {'_id': {'month': 5}, 'Average Monthly Sales': Decimal128('127.29')},
 {'_id': {'month': 6}, 'Average Monthly Sales': Decimal128('125.70')},
 {'_id': {'month': 7}, 'Average Monthly Sales': Decimal128('127.41')},
 {'_id': {'month': 8}, 'Average Monthly Sales': Decimal128('128.56')},
 {'_id': {'month': 9}, 'Average Monthly Sales': Decimal128('129.18')},
 {'_id': {'month': 10}, 'Average Monthly Sales': Decimal128('129.39')},
 {'_id': {'month': 11}, 'Average Monthly Sales': Decimal128('129.94')},
 {'_id': {'month': 12}, 'Average Monthly Sales': Decimal128('131.24')}]

**D. In each state province what is the average customer credit limit?**  

In [74]:
# we used the State Pronvince in which the city corresponding to the Postal City is, from the customer_embed collection
#we grouped the pronvinces by their Porvince Name and calculated the avreage value for the credit limit for each one

query_3d_1 = {
    "$group": 
    {
    "_id": {"Province": "$postal_cities.provinces.StateProvinceName"},
    "credit limit":{"$avg":"$CreditLimit"}
    }
}

# sorted the values by the Province Name
query_3d_2 = {"$sort":{"_id":1}}


# Projected the fields to appear in the ouput - ID that corresponds to the Province and the credit limit average value rounded to 2 decimal places
query_3d_3 = {
    '$project':
    {
    '_id' : True,
    'Average Customer Credit Limit' : {'$round' : ["$credit limit", 2]}
    }
}

pipeline = [query_3d_1, query_3d_2, query_3d_3]
q_3_d = list(db.customer_embed.aggregate(pipeline))
q_3_d

[{'_id': {'Province': 'Alabama'},
  'Average Customer Credit Limit': Decimal128('1985.00')},
 {'_id': {'Province': 'Alaska'},
  'Average Customer Credit Limit': Decimal128('2858.33')},
 {'_id': {'Province': 'Arizona'},
  'Average Customer Credit Limit': Decimal128('2950.00')},
 {'_id': {'Province': 'Arkansas'},
  'Average Customer Credit Limit': Decimal128('3044.17')},
 {'_id': {'Province': 'California'},
  'Average Customer Credit Limit': Decimal128('2787.31')},
 {'_id': {'Province': 'Colorado'},
  'Average Customer Credit Limit': Decimal128('2033.33')},
 {'_id': {'Province': 'Connecticut'},
  'Average Customer Credit Limit': Decimal128('3500.00')},
 {'_id': {'Province': 'Florida'},
  'Average Customer Credit Limit': Decimal128('2800.00')},
 {'_id': {'Province': 'Georgia'},
  'Average Customer Credit Limit': Decimal128('2631.75')},
 {'_id': {'Province': 'Hawaii'}, 'Average Customer Credit Limit': None},
 {'_id': {'Province': 'Idaho'},
  'Average Customer Credit Limit': Decimal128('270

**E. What are the yearly expenditures with each supplier (per supplier name)?**

In [78]:
# we assume the yearly expenditures are the Transactions Amounts relative to the Suppliers Invoice
# the first query matches the transactions whose TransactionTypeID is 5 because they correspond to the transactions of "Suppliers Invoices"
# which is what we are interested in to compute the yearly expenditure
query_3e_1 = {'$match': {'TransactionTypeID': 5}}

# in the second query we grouped the transactions by the year of the Transaction Date and by the supplier name and calculated the sum of the transactions
query_3e_2 = {
    "$group": 
    {
    "_id": {"year": {"$year" : "$TransactionDate"}, "supplier_name":"$supplier.SupplierName"},
    "Expenditure":{"$sum":"$TransactionAmount"}
    }}


# finally, ordered the ouput by the id 
query_3e_3 = {"$sort":{"_id":1}}

pipeline = [ query_3e_1, query_3e_2, query_3e_3]
q_3_e = list(db.supplier_embed.aggregate(pipeline))
q_3_e

[{'_id': {'year': 2013}, 'Expenditure': Decimal128('70428667.37')},
 {'_id': {'year': 2014}, 'Expenditure': Decimal128('258734867.13')},
 {'_id': {'year': 2015}, 'Expenditure': Decimal128('484868081.11')},
 {'_id': {'year': 2016}, 'Expenditure': Decimal128('271961897.45')}]

In [93]:
#suppliers_embed only contains suppliers which have supplier transactions
#trsanaction type ID 5 is respective to "Suppliers Invoices" which is what we are interested in to compute the yearly expenditures

query_3e_1 = {'$match': {'TransactionTypeID': 5}}

query_3e_2 = {
    "$group": {
        "_id": {"year": {"$year" : "$TransactionDate"}, "supplier_name":"$supplier_info.SupplierName"},
        "Expenditure":{"$sum":"$TransactionAmount"}
    }}

query_3e_3 = {"$sort":{"_id":1}}

pipeline = [ query_3e_1, query_3e_2, query_3e_3]
result_3e = list(db.supplier_embed.aggregate(pipeline))
result_3e

[{'_id': {'year': 2013, 'supplier_name': 'Contoso, Ltd.'},
  'Expenditure': Decimal128('360.53')},
 {'_id': {'year': 2013, 'supplier_name': 'Fabrikam, Inc.'},
  'Expenditure': Decimal128('60480866.71')},
 {'_id': {'year': 2013, 'supplier_name': 'Graphic Design Institute'},
  'Expenditure': Decimal128('7462.45')},
 {'_id': {'year': 2013, 'supplier_name': 'Litware, Inc.'},
  'Expenditure': Decimal128('9790895.67')},
 {'_id': {'year': 2013, 'supplier_name': 'Northwind Electric Cars'},
  'Expenditure': Decimal128('90639.00')},
 {'_id': {'year': 2013, 'supplier_name': 'The Phone Company'},
  'Expenditure': Decimal128('58443.01')},
 {'_id': {'year': 2014, 'supplier_name': 'Fabrikam, Inc.'},
  'Expenditure': Decimal128('192272508.90')},
 {'_id': {'year': 2014, 'supplier_name': 'Litware, Inc.'},
  'Expenditure': Decimal128('66462358.23')},
 {'_id': {'year': 2015, 'supplier_name': 'Fabrikam, Inc.'},
  'Expenditure': Decimal128('339016809.90')},
 {'_id': {'year': 2015, 'supplier_name': 'Litware,

# Question 4

**A. What is the most common payment type?**

In [80]:
#First thing, we will group the payments by method name
query_4a_1 = {
    '$group': {
        '_id' : '$paymentmethods.PaymentMethodName',
        'count' : {'$sum' : 1}
    }
}

#Then, we will use $lookup to perform a left outer join of supplier_transactions_embed to paymentmethods
query_4a_2 = {
    "$lookup":{
        "from": "supplier_transactions_embed",
        "localField": "_id",
        "foreignField": "paymentmethods.PaymentMethodName",
        "as": "supplier_transactions_embed"
     }
}


#After, use unwind to take the array of information from supplier_transactions_embed 
query_4a_3 = {
    "$unwind": "$supplier_transactions_embed"
}


#For each payment type, count the number of payments made by customers and suppliers
query_4a_4 = {
    '$group': {
        '_id' : '$_id',
        'customers' : { '$first': '$count' },
        'suppliers' : {'$sum' : 1}
    }
}

#Finally, sum the total number of payments
query_4a_5 = {
    '$project': {
        '_id' : 0,
        'payment_type' : '$_id',
        'count': { '$sum': [ "$customers", "$suppliers" ] }
    }
}



pipeline = [query_4a_1, query_4a_2, query_4a_3, query_4a_4, query_4a_5]

q_4_a = list(db.customer_transactions_embed.aggregate(pipeline))
q_4_a

[{'payment_type': 'EFT', 'count': 29075}]

**Answer:** The monst common way to pay is using ``EFT`` with a total of 29075 transaction being made with said payment type.

**B. What percentage of people have their Title as Team Member?**

In [81]:
q_4_b = ((db.people.count_documents({'CustomFields.Title' : 'Team Member'}))/db.people.count_documents({}))*100
print("The percentage of people that have their Title as Team Member is:", q_4_b,'%')

The percentage of people that have their Title as Team Member is: 1.17011701170117 %


**C. Which supplier of the category `Novelty Goods Supplier` has the most transactions?**

In [85]:
#Use match to search for all records in suppliers_embed where the SupplierCategoryName is Novelty Goods Supplier
query_4c_1 = {
    '$match':{
                'SupplierID':{'$in' : list(db.suppliers_embed.distinct('SupplierID',
                                    {'suppliercategories.SupplierCategoryName':'Novelty Goods Supplier'}))
    }
}}


#We will set SupplierID as the id of a supplier
query_4c_2 = {
    '$project':{
                '_id':False,
                'supplier_id':'$SupplierID'
    }
}

#Query 3 and 4 will count and order the transactions of supplier with said category in a descending order
query_4c_3 = {
    '$group': {
        '_id': '$supplier_id', 
        'count' : {'$sum' : 1}            
}
}

query_4c_4 = {
    '$sort': {
        'count' : -1}            
}


#Once again, we will use $lookup to perform a left outer join
query_4c_5 =     {
        '$lookup': {
           "from": "suppliers_embed",
           "localField": "_id",
           "foreignField": "SupplierID",
           "as": "suppliers_embed"
        }}

#Finally, create a result set 
query_4c_6 = {
    '$project' :{
        '_id' : '$suppliers_embed.SupplierName',
        'count' : '$count'
    }
}


pipeline = [query_4c_1, query_4c_2, query_4c_3, query_4c_4, query_4c_5, query_4c_6]

q_4_c = list(db.suppliertransactions.aggregate(pipeline))
q_4_c

[{'_id': ['Graphic Design Institute'], 'count': 16},
 {'_id': ['A Datum Corporation'], 'count': 7},
 {'_id': ['The Phone Company'], 'count': 7},
 {'_id': ['Contoso, Ltd.'], 'count': 2}]

**Answer:** ``Graphic Design Institute`` is the supplier that has the most transactions (16), out of the category ``Novelty Goods Supplier``

**D. What is the highest ComissionRate that a person has?**

In [86]:
query_4d_1 = {
    '$project' : {
        '_id' : 0,
        'CustomFields' : 1
    }
}

query_4d_2 = {
    '$match' : {
        'CustomFields.CommissionRate' : {'$exists': 1}
    }
}

query_4d_3 = {
    '$project' : {
        '_id' : 0,
        'CommissionRate' : '$CustomFields.CommissionRate'
    }
}

query_4d_4 = {
    '$sort' : {'CommissionRate' : -1}
}

query_4d_5 = {
    '$limit' : 1
}

pipeline = [query_4d_1, query_4d_2, query_4d_3, query_4d_4, query_4d_5]

q_4_d = list(db.people.aggregate(pipeline))

q_4_d

[{'CommissionRate': '4.55'}]

# Question 5

5. The marketing team needs to know:  



**A. What is the name of the sales person with the largest sum of invoice values in 2013 (person whose customers paid the most money)?**

In [88]:
query_5a_1 = {
        '$unwind': '$invoicelines'
    }


query_5a_2 = {
    '$project': {
        '_id': 1,
        'SalespersonPersonID': 1,
        'InvoiceYear': {'$year': '$InvoiceDate'},
        'SalesAmount': {'$multiply': ['$invoicelines.Quantity','$invoicelines.UnitPrice']}
    }
}

query_5a_3 = { 
        '$match' : {
            'InvoiceYear': {'$eq': 2013}
        } 
    }

query_5a_4 = {
    '$group': {
            '_id': '$SalespersonPersonID',
            'TotalSalesAmount': {'$sum': '$SalesAmount'}
    }
}

query_5a_5 = {
        '$lookup':{
           'from': 'people',
           'localField': '_id',
           'foreignField': 'PersonID',
           'as': 'SalesPerson'
        }
    }


query_5a_6 = {
    '$project': {
        '_id': 0,
        'FullName': '$SalesPerson.FullName',
        'TotalSalesAmount': 1
    }
}

query_5a_7 = { 
        '$sort' : {'TotalSalesAmount': -1}
    }

query_5a_8 = {
    '$limit': 1
}

pipeline = [query_5a_1, query_5a_2, query_5a_3, query_5a_4, query_5a_5, query_5a_6, query_5a_7, query_5a_8]

q_5_a = list(db.invoices_emb.aggregate(pipeline))
q_5_a

[{'TotalSalesAmount': Decimal128('4864279.75'), 'FullName': ['Hudson Onslow']}]

**Answer:** The sales person with the largest sum of invoices value in 2013 is Hudson Onslow, with a total of 4864279.75.

**B. Who are the most common `PickedByPersonID` person names for orders done by customer `Adriana Pena`?**

In [89]:
query_5b_1 = {
        '$unwind': '$orders'
    }

query_5b_2 = { 
    '$match' : {
        'CustomerName': 'Adriana Pena'
        } 
    }

query_5b_3 = {
    '$project': {
        '_id': 0,
        "CustomerName": 1,
        'PickedByPersonID':'$orders.PickedByPersonID'
    }
}


query_5b_4 = {
        '$lookup':{
           'from': 'people',
           'localField': 'PickedByPersonID',
           'foreignField': 'PersonID',
           'as': 'People'
        }
    }


query_5b_5 = {
    '$unwind': '$People'
}


query_5b_6 = {
    '$project': {
        '_id': 0,
        "CustomerName": 1,
        'PickedByPersonID': 1,
        'PickedByPersonName':'$People.FullName',
        
    }
}

query_5b_7 = {
    '$group': {
        '_id': "$PickedByPersonName",
        'NumberOrders':{'$sum' : 1}
    }
}


query_5b_8 = { 
        '$sort' : {'NumberOrders': -1}
    }

query_5b_9 = {
    '$limit': 3
}

pipeline = [query_5b_1, query_5b_2, query_5b_3, query_5b_4, query_5b_5, query_5b_6, query_5b_7, query_5b_8, query_5b_9] 

q_5_b = list(db.customers_emb.aggregate(pipeline))
q_5_b

[{'_id': 'Piper Koch', 'NumberOrders': 3},
 {'_id': 'Katie Darwin', 'NumberOrders': 3},
 {'_id': 'Anthony Grosse', 'NumberOrders': 3}]

**C. How many people have in their name the string `Sara`?**

In [90]:
db.people.count_documents({"FullName":{'$regex' : '.*Sara.*'}})

5

In [91]:
list(db.people.find({"FullName":{'$regex' : '.*Sara.*'}}))

[{'_id': ObjectId('6287c4dc636e5a12693da1f6'),
  'PersonID': 40,
  'FullName': 'Sara Karlsson',
  'PreferredName': 'Sara',
  'SearchName': 'Sara Sara Karlsson',
  'IsPermittedToLogon': False,
  'LogonName': 'NO LOGON',
  'IsExternalLogonProvider': False,
  'HashedPassword': None,
  'IsSystemUser': True,
  'IsEmployee': False,
  'IsSalesperson': False,
  'UserPreferences': '{"theme":"le-frog","dateFormat":"mm/dd/yy","timeZone": "PST","table":{"pagingType":"numbers","pageLength": 10},"favoritesOnDashboard":true}',
  'PhoneNumber': '(201) 555-0100',
  'FaxNumber': '(201) 555-0106',
  'EmailAddress': 'sarak@northwindelectriccars.com',
  'Photo': None,
  'CustomFields': None,
  'OtherLanguages': None,
  'LastEditedBy': 1,
  'ValidFrom': datetime.datetime(2016, 5, 31, 23, 14),
  'ValidTo': datetime.datetime(9999, 12, 31, 23, 59, 59, 999000)},
 {'_id': ObjectId('6287c4dc636e5a12693da375'),
  'PersonID': 1377,
  'FullName': 'Sara Charlton',
  'PreferredName': 'Sara',
  'SearchName': 'Sara Sara

**D. What are the top 10 most Common Names (Primary or Surnames) of people?**

In [92]:
query_5d_1 = {
    '$project': {
        '_id': 1,
        'PersonID': 1,
        'FullName': 1,
        'SplitName': {'$split': [ "$FullName", " " ]}        
    }
}




query_5d_2 = {
    '$limit': 3
}

pipeline = [query_5d_1, query_5d_2]

q_5_d = list(db.people.aggregate(pipeline))
q_5_d

[{'_id': ObjectId('6287c4dc636e5a12693da1cf'),
  'PersonID': 1,
  'FullName': 'Data Conversion Only',
  'SplitName': ['Data', 'Conversion', 'Only']},
 {'_id': ObjectId('6287c4dc636e5a12693da1d0'),
  'PersonID': 2,
  'FullName': 'Kayla Woodcock',
  'SplitName': ['Kayla', 'Woodcock']},
 {'_id': ObjectId('6287c4dc636e5a12693da1d1'),
  'PersonID': 3,
  'FullName': 'Hudson Onslow',
  'SplitName': ['Hudson', 'Onslow']}]