# Big Data Modeling and Management 2021


## 🚚 BDMM Third Homework Assignment 🚚 

_The Wide World Importers (WWI) is a wholesales novelty goods importer and distributor operating from the San Francisco bay area. In this assignment we will be working with their database._ 
You can get more information and details about the WWI database can be found in the following link: https://docs.microsoft.com/en-us/sql/samples/wide-world-importers-what-is?view=sql-server-ver15

The focus of the third assignment is modelling. We will use the same data source that was used the previous assignment, the World Wide Importers database, and convert it to a document-based database. To that end, we will be  leveraging concepts like data denormalization, indexes, and mongodb design patterns. 

More information on the extended datamodel to be found here: </br>  
https://docs.microsoft.com/en-us/sql/samples/wide-world-importers-oltp-database-catalog?view=sql-server-ver15

## Problem Description

Your team has just arrived at WWI (a leading company in logitics). Welcome!   <br>
Even though business is striving, the IT department is going through a bad time.   <br>
Digitalization was never a priority for the company and now the company operational and analytical requirements is starting to grow beyond the capabilities of their existing data architecture.   <br>

WWI data is spread accross different systems. Namely, an old SQL database, data extracted through an API, and data stored in CSV files. <br>
Currently, the costs to develop the necessary queries to collect data to answer questions asked by the different departments are too high. <br>
Management concluded it is the right time to revise and revamp the data architecture, in order to speed up operations. 

In that context, your team was tasked with merging all the company data into a single and coherent Mongo database. <br>
It is expected that, with your solution, WWI will have a better understanding of their business and that the different departments will be able to obtain efficiently the answers they desperatly need.

The WWI team shared with you an ERD of their current datamodel:<br>
![datamodel](WWI.png)

Addtionally, the WWI team asked you the deliver the following outputs in **10 days**:
- Understand and model the database.  
- Migrate all data to the database
- Answer the questions.  
- Submit the results by following the instructions.  
- Prepare a short oral presentation to explain your design choices and the results you obtained.

With these deliveries, you will have created a prototype and allows the management to decide whether MongoDB is a good solution that meets their requirements.

### Design Requirements

You have been informed that the WWI has the following query requirements to the database.

The web team needs:  
- From which state province are our suppliers from?   
- From which state province are the customers who have a higher credit limit?  


The warehouse group needs:  
- To know which items get ordered together the most?   
- Which items get ordered the most in bulk (bigger amounts)?  
- Which customers have delivery addresses under 10km of distance?  

The CFO:  
- Would like to know the monthly order count?  
- Would like to know the average monthly sales prices?  
- Would like to know the yearly expenditures with suppliers (per supplier name)?  

Partnerships:  
- Would like to know what's the most common payment type?  
- Which supplier of `Novelty Goods Supplier` as the most transactions?  

The marketing team:  
- Want to make an appreciation post and needs the name of the sales person with the most invoices in 2013 (person who's customers brought the most money)?

---

Transform the SQL tables, API results and CSV files provided in the annex with this file and model a database following mongo's best practises.

Write MongoDB queries to awnser the above mentioned queries

Take advantage of database indexes to improve your query speeds

### Deliverables

1. Notebook with all DB creation operations and CRUD operations;
2. Second notebook with all required 'queries to

### Data Source Materials

For the development of this assignment you will have access to the RDBMS/SQL database hosting the original WWI database. To connect to the database use the following credentials:
```
host:rhea.isegi.unl.pt
user:wwi-read-only-user
pass:jGp2GCqrss6nfTEu5ZawhW3mksLsQYQb
database:WWI

# !pip install mysql-connector-python
import mysql.connector
mydb = mysql.connector.connect(host={host}, user={user}, database={database}, port=3306, password={password})
mycursor = mydb.cursor()

mycursor.execute('SHOW TABLES;')
print(f"Tables: {mycursor.fetchall()}")
mycursor.execute('DESCRIBE Purchasing_PurchaseOrderLines;')
print(f"Purchasing_PurchaseOrderLines schema: {mycursor.fetchall()}")
```

Additionally you have access to the following documents.

CSV with Warehouse Data  
**https://liveeduisegiunl-my.sharepoint.com/:f:/g/personal/fpinheiro_novaims_unl_pt/Eh8Mj-m6r4dOt84tPDGUnhUBd5oMC0CJKAeyJm3urNB-8g?e=JuPMuW**

API with Application data  
**http://rhea.isegi.unl.pt:8080/**

## Additional Information

#### Groups  

This is a group activity. <br>
Students should form groups of at least 4 and at most 5. <br>
We will use the current defined groups that have been established during the previous assignments, and that are identified on Moodle.

#### MongoDB database access  

Each group will have access to its own mongodb instance.<br>
Each group will receive an email with their access credentials. <br>
You will use the database to store your results. <br>

Connection details will have the following template:<br>
```
Host: rhea.isegi.unl.pt:27017  
Username: {groups_username}  
Password: {groups_password}  
```
Which then can be used as follows:
```
client = MongoClient(f"{protocol}://{user}:{password}@{host}:{port}/")
```

#### Submission  Deadline

The submission must contain a notebook with the queries and their results, also indicate the name of the database that you created. <br>
Upload the notebook on moodle before **23:59 of May 30nd**

#### Evaluation   

The third homework assignment counts 20% towards your final mark of the curricular unit. <br>
The assignment will be scored from 0 to 20. <br>
Your final task will be to present the owner of the company your database proposal and how would it make everyone satisfied. <br>

Each group submission will be evaluated on two components:
1. correctness of results;
2. simplicity of the solution;

50% -  Database design  
50% -  Query results  
*    25% - Correctness of queries   
*    25% - Right results

Please note that all code delivered in this assignment will go through plagiarism automated checks. <br>
Groups high similarity levels in their code will undergo investigation.

**Presentations**

Presentations will be held between the 2nd and 3rd of June and you need to sign up your group in this calendly link:<br>
https://calendly.com/d/m9sj-qwpk/presentations (Please try to avoid empty windows)

## Imports

In [1]:
from tqdm.notebook import tqdm
from pprint import pprint

## Connection

In [26]:
# Connect to mySQL server
import mysql.connector

def connect_mysql():

    mydb = mysql.connector.connect(host='rhea.isegi.unl.pt', 
                                   user='wwi-read-only-user', 
                                   database='WWI',
                                   password='jGp2GCqrss6nfTEu5ZawhW3mksLsQYQb',
                                   port=3306
                                  )
    mycursor = mydb.cursor()
    
    return mydb, mycursor

    # Getting the table names
    #mycursor.execute('SHOW TABLES;')
    #print(f"Tables: {mycursor.fetchall()}")

    # Getting a tables column descriptios
    #mycursor.execute('DESCRIBE Purchasing_PurchaseOrderLines;')
    #print(f"Purchasing_PurchaseOrderLines schema: {mycursor.fetchall()}")

In [4]:
# Connect to Mongo server

from pymongo import MongoClient

host="rhea.isegi.unl.pt"
port="27049"
user="GROUP_32"
password="bRG2XjRZhrRA9IfpmENyXxMlWQDUJdzL"
protocol="mongodb"
client = MongoClient(f"{protocol}://{user}:{password}@{host}:{port}")

In [5]:
# Create a database
db = client.philipp
mycursor.execute('SHOW TABLES;')
tables = mycursor.fetchall()

## For deleting everything and flushing the cursor

In [6]:
db.list_collection_names()

['Purchasing_SupplierCategories',
 'Sales_CustomerCategories',
 'Sales_Invoices',
 'Sales_CustomerTransactions',
 'Sales_InvoiceLines',
 'Sales_OrderLines',
 'Sales_Customers',
 'Purchasing_PurchaseOrderLines',
 'Purchasing_SupplierTransactions',
 'Purchasing_Suppliers',
 'Purchasing_PurchaseOrders']

In [6]:
#for i in range(len(tables)):
#    
#    table_name = tables[i][0]
#    
#    db[table_name].drop()

In [7]:
db.list_collection_names()

['Purchasing_SupplierCategories',
 'Sales_CustomerCategories',
 'Sales_Invoices',
 'Sales_CustomerTransactions',
 'Sales_InvoiceLines',
 'Sales_OrderLines',
 'Sales_Customers',
 'Purchasing_PurchaseOrderLines',
 'Purchasing_SupplierTransactions',
 'Purchasing_Suppliers',
 'Purchasing_PurchaseOrders']

In [8]:
mycursor.fetchall()

[]

## Get row counts of each table

In [7]:
for i in range(len(tables)):
    
    # Get MySQL table name
    table_name = tables[i][0]
    
    # Define our query
    query = ( "SELECT * " 
             "FROM " + table_name )
    
    # Execute query
    mycursor.execute( query )
    
    # Print row counts
    print(table_name)
    print(len(mycursor.fetchall()))

Purchasing_PurchaseOrderLines
8367
Purchasing_PurchaseOrders
2074
Purchasing_SupplierCategories
9
Purchasing_SupplierTransactions
2438
Purchasing_Suppliers
13
Sales_CustomerCategories
8
Sales_CustomerTransactions
52000
Sales_Customers
663
Sales_InvoiceLines
30000
Sales_Invoices
70510
Sales_OrderLines
76000
Sales_Orders
73595


## Migrate data from MySQL database to mongo database

In [9]:
# Adapted from:
# https://nicksardo.wordpress.com/2015/11/24/transferring-data-between-mysql-and-mongodb/

def migrate_table(table_index):
    
    # Get MySQL table name
    table_name = tables[table_index][0]
    
    # Print table name
    print('Migrating table', table_name)
    
    # Get names of the columns of this table
    describe = 'DESCRIBE ' + table_name + ';'
    mycursor.execute(describe)
    describe_out = mycursor.fetchall()

    cols = []
    for col_index in range(len(describe_out)):
        col = describe_out[col_index][0]
        cols.append(col)
    
    # Create mongodb collection
    collection = db[table_name]
    
    # Define our query
    query = ( "SELECT * " 
             "FROM " + table_name )
    
    # Get row count and print it
    mycursor.execute( query )
    print('Rows in this table:', str(len(mycursor.fetchall())))
    
    # Execute query again
    mycursor.execute( query )
    
    #mongo client specifically requires python dict
    cus = dict()

    #custom record id rather than mongodb default hash id                                          
    cid = 0                                                 

    #cycle through each mySQL row
    for ( row ) in tqdm(mycursor):
        cid        += 1   #increment id
        cus['_id'] = cid                                    

        #check if current row is null
        for i in range( 0, len( row ) ):
            
            if row[i] == None:
                #if the record is null, skip it                  
                continue
            else:
                #conversion to string
                row_title      = "".join( cols[i] )  
                #conversion to string 
                field          = str( row[i] ) 

                #add current record's field's title and value             
                cus[row_title] = field
                
        #we've completed processing this row, insert it into mongoldb      
        collection.insert_one( cus )

In [15]:
def migrate_table_10k_entries(table_index, ten_k_step):
    
    from_entry = (ten_k_step * 10000) + 1
    till_entry = ((ten_k_step + 1) * 10000) + 1
    print('Migrate entries larger or equal to', str(from_entry), 'and smaller than', str(till_entry))
    
    # Get MySQL table name
    table_name = tables[table_index][0]
    
    # Print table name
    print('Migrating table', table_name)
    
    # Get names of the columns of this table
    describe = 'DESCRIBE ' + table_name + ';'
    mycursor.execute(describe)
    describe_out = mycursor.fetchall()

    cols = []
    for col_index in range(len(describe_out)):
        col = describe_out[col_index][0]
        cols.append(col)
    
    # Create mongodb collection
    collection = db[table_name]
    
    collection.delete_many({'_id' : { "$gte": from_entry}})
    
    # Define our query
    query = ( "SELECT * " 
             "FROM " + table_name )
    
    # Get row count and print it
    mycursor.execute( query )
    print('Rows in this table:', str(len(mycursor.fetchall())))
    
    # Execute query again
    mycursor.execute( query )
    
    #mongo client specifically requires python dict
    cus = dict()

    #custom record id rather than mongodb default hash id                                          
    cid = 0                                                 

    #cycle through each mySQL row
    for ( row ) in tqdm(mycursor):
        cid        += 1   #increment id
        cus['_id'] = cid   
        
        if cid >= from_entry and cid < till_entry:

            #check if current row is null
            for i in range( 0, len( row ) ):

                if row[i] == None:
                    #if the record is null, skip it                  
                    continue
                else:
                    #conversion to string
                    row_title      = "".join( cols[i] )  
                    #conversion to string 
                    field          = str( row[i] ) 

                    #add current record's field's title and value             
                    cus[row_title] = field

            #we've completed processing this row, insert it into mongoldb      
            collection.insert_one( cus )

In [11]:
db['Sales_OrderLines'].count_documents({})

10000

In [12]:
db['Sales_OrderLines'].delete_many({'_id' : { "$gt": 10000}})

<pymongo.results.DeleteResult at 0x7fa8dff18c40>

In [13]:
db['Sales_OrderLines'].count_documents({})

10000

In [11]:
for i in range(len(tables)):
    migrate_table(i)

Migrating table Purchasing_PurchaseOrderLines
Rows in this table: 8367


0it [00:00, ?it/s]

Migrating table Purchasing_PurchaseOrders
Rows in this table: 2074


0it [00:00, ?it/s]

Migrating table Purchasing_SupplierCategories
Rows in this table: 9


0it [00:00, ?it/s]

Migrating table Purchasing_SupplierTransactions
Rows in this table: 2438


0it [00:00, ?it/s]

Migrating table Purchasing_Suppliers
Rows in this table: 13


0it [00:00, ?it/s]

Migrating table Sales_CustomerCategories
Rows in this table: 8


0it [00:00, ?it/s]

Migrating table Sales_CustomerTransactions
Rows in this table: 52000


0it [00:00, ?it/s]

Migrating table Sales_Customers
Rows in this table: 663


0it [00:00, ?it/s]

Migrating table Sales_InvoiceLines
Rows in this table: 30000


0it [00:00, ?it/s]

Migrating table Sales_Invoices
Rows in this table: 70510


0it [00:00, ?it/s]

OperationalError: 2013 (HY000): Lost connection to MySQL server during query

In [26]:
# Drop the table that was last added
table_name = tables[i][0]
print(table_name)
db[table_name].drop()

Sales_Invoices


In [27]:
i

9

In [28]:
# Starting again at that table
for i in range(9,len(tables)):
    migrate_table(i)

Migrating table Sales_Invoices
Rows in this table: 70510


0it [00:00, ?it/s]

Migrating table Sales_OrderLines
Rows in this table: 76000


0it [00:00, ?it/s]

OperationalError: 2013 (HY000): Lost connection to MySQL server during query

In [54]:
i

10

In [15]:
# Drop the table that was last added
table_name = tables[i][0]
print(table_name)
db[table_name].drop()

Sales_OrderLines


In [17]:
# Starting again at that table
for i in range(10,len(tables)):
    migrate_table(i)

Migrating table Sales_OrderLines
Rows in this table: 76000


0it [00:00, ?it/s]

OperationalError: 2013 (HY000): Lost connection to MySQL server during query

In [16]:
i=10
i

10

In [28]:
migrate_table_10k_entries(i, 1)

Migrate entries larger or equal to 10001 and smaller than 20001
Migrating table Sales_OrderLines
Rows in this table: 76000


0it [00:00, ?it/s]

OperationalError: 2013 (HY000): Lost connection to MySQL server during query

In [30]:
migrate_table_10k_entries(i, 2)

Migrate entries larger or equal to 20001 and smaller than 30001
Migrating table Sales_OrderLines
Rows in this table: 76000


0it [00:00, ?it/s]

OperationalError: 2013 (HY000): Lost connection to MySQL server during query

In [32]:
migrate_table_10k_entries(i, 3)

Migrate entries larger or equal to 30001 and smaller than 40001
Migrating table Sales_OrderLines
Rows in this table: 76000


0it [00:00, ?it/s]

OperationalError: 2013 (HY000): Lost connection to MySQL server during query

In [34]:
migrate_table_10k_entries(i, 4)

Migrate entries larger or equal to 40001 and smaller than 50001
Migrating table Sales_OrderLines
Rows in this table: 76000


0it [00:00, ?it/s]

OperationalError: 2013 (HY000): Lost connection to MySQL server during query

In [36]:
migrate_table_10k_entries(i, 5)

Migrate entries larger or equal to 50001 and smaller than 60001
Migrating table Sales_OrderLines
Rows in this table: 76000


0it [00:00, ?it/s]

In [37]:
migrate_table_10k_entries(i, 6)

Migrate entries larger or equal to 60001 and smaller than 70001
Migrating table Sales_OrderLines
Rows in this table: 76000


0it [00:00, ?it/s]

In [38]:
migrate_table_10k_entries(i, 7)

Migrate entries larger or equal to 70001 and smaller than 80001
Migrating table Sales_OrderLines
Rows in this table: 76000


0it [00:00, ?it/s]

In [39]:
i+=1
i

11

In [40]:
migrate_table_10k_entries(i, 0)

Migrate entries larger or equal to 1 and smaller than 10001
Migrating table Sales_Orders
Rows in this table: 73595


0it [00:00, ?it/s]

OperationalError: 2013 (HY000): Lost connection to MySQL server during query

In [42]:
migrate_table_10k_entries(i, 1)

Migrate entries larger or equal to 10001 and smaller than 20001
Migrating table Sales_Orders
Rows in this table: 73595


0it [00:00, ?it/s]

OperationalError: 2013 (HY000): Lost connection to MySQL server during query

In [44]:
migrate_table_10k_entries(i, 2)

Migrate entries larger or equal to 20001 and smaller than 30001
Migrating table Sales_Orders
Rows in this table: 73595


0it [00:00, ?it/s]

OperationalError: 2013 (HY000): Lost connection to MySQL server during query

In [46]:
migrate_table_10k_entries(i, 3)

Migrate entries larger or equal to 30001 and smaller than 40001
Migrating table Sales_Orders
Rows in this table: 73595


0it [00:00, ?it/s]

OperationalError: 2013 (HY000): Lost connection to MySQL server during query

In [48]:
migrate_table_10k_entries(i, 4)

Migrate entries larger or equal to 40001 and smaller than 50001
Migrating table Sales_Orders
Rows in this table: 73595


0it [00:00, ?it/s]

In [49]:
migrate_table_10k_entries(i, 5)

Migrate entries larger or equal to 50001 and smaller than 60001
Migrating table Sales_Orders
Rows in this table: 73595


0it [00:00, ?it/s]

In [50]:
migrate_table_10k_entries(i, 6)

Migrate entries larger or equal to 60001 and smaller than 70001
Migrating table Sales_Orders
Rows in this table: 73595


0it [00:00, ?it/s]

In [51]:
migrate_table_10k_entries(i, 7)

Migrate entries larger or equal to 70001 and smaller than 80001
Migrating table Sales_Orders
Rows in this table: 73595


0it [00:00, ?it/s]

In [47]:
mydb, mycursor = connect_mysql()

## Check newly created mongo database

In [52]:
db.list_collection_names()

['Purchasing_SupplierCategories',
 'Sales_CustomerCategories',
 'Sales_Invoices',
 'Sales_CustomerTransactions',
 'Sales_InvoiceLines',
 'Sales_OrderLines',
 'Sales_Customers',
 'Purchasing_PurchaseOrderLines',
 'Purchasing_SupplierTransactions',
 'Purchasing_Suppliers',
 'Purchasing_PurchaseOrders',
 'Sales_Orders']

In [53]:
for collection in db.list_collection_names():
    print(collection)
    pprint(db[collection].find_one())
    print()

Purchasing_SupplierCategories
{'SupplierCategoryID': '1',
 'SupplierCategoryName': 'Other Wholesaler',
 '_id': 1}

Sales_CustomerCategories
{'CustomerCategoryID': '1', 'CustomerCategoryName': 'Agent', '_id': 1}

Sales_Invoices
{'AccountsPersonID': '3032',
 'BillToCustomerID': '832',
 'ConfirmedDeliveryTime': '2013-01-02 07:05:00',
 'ConfirmedReceivedBy': 'Aakriti Byrraju',
 'ContactPersonID': '3032',
 'CustomerID': '832',
 'CustomerPurchaseOrderNumber': '12126',
 'DeliveryInstructions': 'Suite 24, 1345 Jun Avenue',
 'DeliveryMethodID': '3',
 'DeliveryRun': '',
 'InvoiceDate': '2013-01-01',
 'InvoiceID': '1',
 'IsCreditNote': '0',
 'OrderID': '1',
 'PackedByPersonID': '14',
 'ReturnedDeliveryData': '{"Events": [{ "Event":"Ready for '
                         'collection","EventTime":"2013-01-01T12:00:00","ConNote":"EAN-125-1051"},{ '
                         '"Event":"DeliveryAttempt","EventTime":"2013-01-02T07:05:00","ConNote":"EAN-125-1051","DriverID":15,"Latitude":41.3617214,"Longitu