# Big Data Modeling and Management 2021


## 🚚 BDMM Third Homework Assignment 🚚 

_The Wide World Importers (WWI) is a wholesales novelty goods importer and distributor operating from the San Francisco bay area. In this assignment we will be working with their database._ 
You can get more information and details about the WWI database can be found in the following link: https://docs.microsoft.com/en-us/sql/samples/wide-world-importers-what-is?view=sql-server-ver15

The focus of the third assignment is modelling. We will use the same data source that was used the previous assignment, the World Wide Importers database, and convert it to a document-based database. To that end, we will be  leveraging concepts like data denormalization, indexes, and mongodb design patterns. 

More information on the extended datamodel to be found here: </br>  
https://docs.microsoft.com/en-us/sql/samples/wide-world-importers-oltp-database-catalog?view=sql-server-ver15

## Problem Description

Your team has just arrived at WWI (a leading company in logitics). Welcome!   <br>
Even though business is striving, the IT department is going through a bad time.   <br>
Digitalization was never a priority for the company and now the company operational and analytical requirements is starting to grow beyond the capabilities of their existing data architecture.   <br>

WWI data is spread accross different systems. Namely, an old SQL database, data extracted through an API, and data stored in CSV files. <br>
Currently, the costs to develop the necessary queries to collect data to answer questions asked by the different departments are too high. <br>
Management concluded it is the right time to revise and revamp the data architecture, in order to speed up operations. 

In that context, your team was tasked with merging all the company data into a single and coherent Mongo database. <br>
It is expected that, with your solution, WWI will have a better understanding of their business and that the different departments will be able to obtain efficiently the answers they desperatly need.

The WWI team shared with you an ERD of their current datamodel:<br>
![datamodel](WWI.png)

Addtionally, the WWI team asked you the deliver the following outputs in **10 days**:
- Understand and model the database.  
- Migrate all data to the database
- Answer the questions.  
- Submit the results by following the instructions.  
- Prepare a short oral presentation to explain your design choices and the results you obtained.

With these deliveries, you will have created a prototype and allows the management to decide whether MongoDB is a good solution that meets their requirements.

### Design Requirements

You have been informed that the WWI has the following query requirements to the database.

The web team needs:  
- From which state province are our suppliers from?   
- From which state province are the customers who have a higher credit limit?  


The warehouse group needs:  
- To know which items get ordered together the most?   
- Which items get ordered the most in bulk (bigger amounts)?  
- Which customers have delivery addresses under 10km of distance?  

The CFO:  
- Would like to know the monthly order count?  
- Would like to know the average monthly sales prices?  
- Would like to know the yearly expenditures with suppliers (per supplier name)?  

Partnerships:  
- Would like to know what's the most common payment type?  
- Which supplier of `Novelty Goods Supplier` as the most transactions?  

The marketing team:  
- Want to make an appreciation post and needs the name of the sales person with the most invoices in 2013 (person who's customers brought the most money)?

---

Transform the SQL tables, API results and CSV files provided in the annex with this file and model a database following mongo's best practises.

Write MongoDB queries to awnser the above mentioned queries

Take advantage of database indexes to improve your query speeds

### Deliverables

1. Notebook with all DB creation operations and CRUD operations;
2. Second notebook with all required 'queries to

### Data Source Materials

For the development of this assignment you will have access to the RDBMS/SQL database hosting the original WWI database. To connect to the database use the following credentials:
```
host:rhea.isegi.unl.pt
user:wwi-read-only-user
pass:jGp2GCqrss6nfTEu5ZawhW3mksLsQYQb
database:WWI

# !pip install mysql-connector-python
import mysql.connector
mydb = mysql.connector.connect(host={host}, user={user}, database={database}, port=3306, password={password})
mycursor = mydb.cursor()

mycursor.execute('SHOW TABLES;')
print(f"Tables: {mycursor.fetchall()}")
mycursor.execute('DESCRIBE Purchasing_PurchaseOrderLines;')
print(f"Purchasing_PurchaseOrderLines schema: {mycursor.fetchall()}")
```

Additionally you have access to the following documents.

CSV with Warehouse Data  
**https://liveeduisegiunl-my.sharepoint.com/:f:/g/personal/fpinheiro_novaims_unl_pt/Eh8Mj-m6r4dOt84tPDGUnhUBd5oMC0CJKAeyJm3urNB-8g?e=JuPMuW**

API with Application data  
**http://rhea.isegi.unl.pt:8080/**

## Additional Information

#### Groups  

This is a group activity. <br>
Students should form groups of at least 4 and at most 5. <br>
We will use the current defined groups that have been established during the previous assignments, and that are identified on Moodle.

#### MongoDB database access  

Each group will have access to its own mongodb instance.<br>
Each group will receive an email with their access credentials. <br>
You will use the database to store your results. <br>

Connection details will have the following template:<br>
```
Host: rhea.isegi.unl.pt:27017  
Username: {groups_username}  
Password: {groups_password}  
```
Which then can be used as follows:
```
client = MongoClient(f"{protocol}://{user}:{password}@{host}:{port}/")
```

#### Submission  Deadline

The submission must contain a notebook with the queries and their results, also indicate the name of the database that you created. <br>
Upload the notebook on moodle before **23:59 of May 30nd**

#### Evaluation   

The third homework assignment counts 20% towards your final mark of the curricular unit. <br>
The assignment will be scored from 0 to 20. <br>
Your final task will be to present the owner of the company your database proposal and how would it make everyone satisfied. <br>

Each group submission will be evaluated on two components:
1. correctness of results;
2. simplicity of the solution;

50% -  Database design  
50% -  Query results  
*    25% - Correctness of queries   
*    25% - Right results

Please note that all code delivered in this assignment will go through plagiarism automated checks. <br>
Groups high similarity levels in their code will undergo investigation.

**Presentations**

Presentations will be held between the 2nd and 3rd of June and you need to sign up your group in this calendly link:<br>
https://calendly.com/d/m9sj-qwpk/presentations (Please try to avoid empty windows)

## Imports

In [3]:
import pandas as pd
from tqdm.notebook import tqdm
from pprint import pprint
import numpy as np
from pymongo import MongoClient

## Connect to mongo database

In [4]:
# Connect to Mongo server
host="rhea.isegi.unl.pt"
port="27049"
user="GROUP_32"
password="bRG2XjRZhrRA9IfpmENyXxMlWQDUJdzL"
protocol="mongodb"
client = MongoClient(f"{protocol}://{user}:{password}@{host}:{port}")

# Connect to mongo db
db = client.denormalised

In [7]:
# List collections of this db
db.list_collection_names()

['Application_Countries',
 'Warehouse_Colors',
 'Sales_OrderLines',
 'Warehouse_StockItemTransactions',
 'Application_StateProvinces',
 'Sales_Invoices',
 'Sales_Orders',
 'Purchasing_PurchaseOrders',
 'Application_DeliveryMethods',
 'Application_Cities',
 'Application_TransactionTypes',
 'Warehouse_StockItems',
 'Sales_InvoiceLines',
 'Warehouse_StockItemStockGroups',
 'Purchasing_PurchaseOrderLines',
 'Purchasing_SupplierCategories',
 'Application_PaymentMethods',
 'Sales_Customers',
 'Sales_CustomerCategories',
 'Purchasing_SupplierTransactions',
 'Application_People',
 'Warehouse_PackageTypes',
 'Purchasing_Suppliers',
 'Warehouse_StockGroups',
 'Sales_CustomerTransactions']

In [8]:
# Print one document
pprint(db['Application_Cities'].find_one())

{'CityID': 1,
 'CityName': 'Aaronsburg',
 'LatestRecordedPopulation': 613,
 'Location': '0xE6100000010C07E11B542C73444087C09140035D53C0',
 'StateProvince': {'Country': {'Continent': 'North America',
                               'CountryID': 230,
                               'CountryName': 'United States',
                               'CountryType': 'UN Member State',
                               'FormalName': 'United States of America',
                               'IsoAlpha3Code': 'USA',
                               'IsoNumericCode': 840,
                               'LatestRecordedPopulation': 313973000,
                               'Region': 'Americas',
                               'Subregion': 'Northern America'},
                   'CountryID': 230,
                   'LatestRecordedPopulation': 13284753,
                   'SalesTerritory': 'Mideast',
                   'StateProvinceCode': 'PA',
                   'StateProvinceID': 39,
                   'Stat