# Big Data Modelling and Management - 2nd Project

**Group 24 members:** <br>
Filipe Lourenço (R20170799), <br>
Guilherme Neves (R20170749), <br>
Rui Monteiro (R20170796), <br>
Vasco Pestana (R20170803)

**MSc:** Data Science and Advanced Analytics - Nova IMS <br>
2020/2021

In [1]:
from pymongo import MongoClient
import datetime

### Connecting to the MongoDB database

In [2]:
host="rhea.isegi.unl.pt"
port="27017"
user="mongo_group_24"
password="T0b5Sez8prB8jo9ohdb907GL5U0S24GX"
protocol="mongodb"
database="worldwideimporters"

client = MongoClient(f"{protocol}://{user}:{password}@{host}:{port}/{database}")

In [3]:
group_24_db = client.worldwideimporters

print(f"Database info: {group_24_db}\n")

Database info: Database(MongoClient(host=['rhea.isegi.unl.pt:27017'], document_class=dict, tz_aware=False, connect=True), 'worldwideimporters')



### Questions

0. __Example Question__ _How many orders exist in the database?_

**Answer:** 73595

In [5]:
# The query should find all orders in the 'orders' collection and retrieve its length

len(list(group_24_db.orders.find()))

73595

1. How many people records don't have the UserPreferences field?

**Answer:** 929

In [6]:
# This query finds all people who don't have the UserPreferences field and returns its length

len(list(group_24_db.people.find(
    {'UserPreferences': {'$exists': False}})))

929

2. How many customer records are valid after `November 2015`?

**Answer:** 1111

In [7]:
# Use the ValidTo field to find the customer records which are valid after November 2015, since 'ValidTo' catches all customers 
# which have registered before November 2015, but whose records might be still valid until a date after November 2015

len(list(group_24_db.people.find(
    {"ValidTo": {"$gt": datetime.datetime(2015, 11, 30, 23, 59, 59)}})))

1111

3. How many people have their `Title` equal to `Team Member`?

**Answer:** 13

In [8]:
# Find people that have a title equal to Team Member and return its length

len(list(group_24_db.people.find(
    {"CustomFields.Title": {"$eq": "Team Member"}})))

13

4. How many people have in their name the string `Sara`?

**Answer:** 5

In [9]:
# Use a Regular Expression that catches all peoples' names that include the 'Sara' substring and return its length

len(list(group_24_db.people.find(
    {"FullName": {"$regex": ".*Sara.*"}})))

5

5. Return 5 full names that have in their name the string `Sara`?

**Answer:** Sara Karlsson, Sara Charlton, Saraswati Beniwal, Sara Huiting and Sara Walkky

In [10]:
# Use a Regular Expression that catches all peoples' names that include the 'Sara' substring and project FullName (from the
# previous question we already know the query only retrieves 5 people, so we don't need to use the '$limit' operator)

list(group_24_db.people.find(
    {"FullName": {"$regex": ".*Sara.*"}},
    {"_id":0, "FullName":1}))

[{'FullName': 'Sara Karlsson'},
 {'FullName': 'Sara Charlton'},
 {'FullName': 'Saraswati Beniwal'},
 {'FullName': 'Sara Huiting'},
 {'FullName': 'Sara Walkky'}]

6. What is the highest `CommissionRate` that a person has?

**Answer:** 4.55

In [5]:
list(group_24_db.people.find().limit(4))

[{'_id': ObjectId('6091cbcf45ad05f8e5c847b9'),
  'PersonID': 1,
  'FullName': 'Data Conversion Only',
  'PreferredName': 'Data Conversion Only',
  'SearchName': 'Data Conversion Only Data Conversion Only',
  'IsPermittedToLogon': False,
  'LogonName': 'NO LOGON',
  'IsExternalLogonProvider': False,
  'IsSystemUser': False,
  'IsEmployee': False,
  'IsSalesperson': False,
  'UserPreferences': '{"theme":"blitzer","dateFormat":"yy-mm-dd","timeZone": "PST","table":{"pagingType":"full_numbers","pageLength": 25},"favoritesOnDashboard":true}',
  'Photo': nan,
  'LastEditedBy': 1,
  'ValidFrom': datetime.datetime(2016, 5, 31, 23, 14),
  'ValidTo': datetime.datetime(9999, 12, 31, 23, 59, 59)},
 {'_id': ObjectId('6091cbcf45ad05f8e5c847ba'),
  'PersonID': 2,
  'FullName': 'Kayla Woodcock',
  'PreferredName': 'Kayla',
  'SearchName': 'Kayla Kayla Woodcock',
  'IsPermittedToLogon': True,
  'LogonName': 'kaylaw@wideworldimporters.com',
  'IsExternalLogonProvider': False,
  'HashedPassword': '0x616E9

In [4]:
# Use project to just select and include in the output the 'CommissionRate' sub-field form the 'CustomFields' field. 
# Then, order this sub-field in descending order (-1), only including the top 1 'CommissionRate' using the '$limit' operator

list(group_24_db.people.aggregate([{ '$project' : {'CustomFields.CommissionRate': 1}},
                                   { '$sort': {'CustomFields.CommissionRate': -1}},
                                   { '$limit': 1},
                                   { '$project': {'_id': 0,
                                                  'CommissionRate':'$CustomFields.CommissionRate'}}]))

[{'CommissionRate': '4.55'}]

7. And what are the top 10 most Common Names (Primary or Surnames)?

**Answer:** Bose, Ganguly, Roman, Thakur, PrabhupÄ\x81da, De, David, Mukherjee, Dhanishta and Van

In [12]:
# Start by splitting the FullName by space (" ") and using the sortByCount to groupby and count each splitted name
# The '$unwind' operator is used to deconstruct the array with splitted names, to get each element (name) separately
# The name in the 1st position was ignored since it was an empty string

list(group_24_db.people.aggregate( [ { '$project': { 'split_name' : { '$split': ["$FullName", " "] } } },
                                     { '$unwind': '$split_name' },
                                     { '$sortByCount': "$split_name" },
                                     { '$limit': 11 }]))[1:11]

[{'_id': 'Bose', 'count': 8},
 {'_id': 'Ganguly', 'count': 7},
 {'_id': 'Roman', 'count': 6},
 {'_id': 'Thakur', 'count': 6},
 {'_id': 'PrabhupÄ\x81da', 'count': 5},
 {'_id': 'De', 'count': 5},
 {'_id': 'David', 'count': 5},
 {'_id': 'Mukherjee', 'count': 5},
 {'_id': 'Dhanishta', 'count': 5},
 {'_id': 'Van', 'count': 5}]

8. How many orders has the Customer `Tailspin Toys (Head Office)`?

**Answer:** 129

In [13]:
# Start by grouping and counting the CustomerIDs in the 'orders' collection, do the lookup with customers using the CustomerID
# field, match the customer called 'Tailspin Toys (Head Office)' and, in the end, project the CustomerID, CustomerName and count

query_1 = {
    "$group": {
        "_id": {"cust_id": "$CustomerID"},
        "count": {"$sum": 1}
    }
}

query_2 = {
    "$lookup":
    {
       "from": "customers",
       "localField": "_id.cust_id",
       "foreignField": "CustomerID",
       "as": "cust_id"
     }
}

query_3 = {
    "$project":
    {
       "fields": {"$arrayElemAt": ["$cust_id", 0]},
       "count": "$count",
     }
}

query_4 = {
    '$match': {
        'fields.CustomerName' : {'$eq': 'Tailspin Toys (Head Office)'}
    }
}

query_5 = {
    "$project":
    {
       "_id": 0,
       "CustomerID": "$fields.CustomerID",
       "CustomerName": "$fields.CustomerName",
       "count": 1,
     }
}

pipeline = [query_1, query_2, query_3, query_4, query_5]

r = group_24_db.orders.aggregate(pipeline)

result = list(r)

result

[{'count': 129,
  'CustomerID': 1,
  'CustomerName': 'Tailspin Toys (Head Office)'}]

9. How many people that have more or equal than three `OtherLanguage`?

**Answer:** 4

In [14]:
# This query finds all documents whose 'OtherLanguages' array has a length of more than 2 strings ('OtherLanguages.2'), or in
# other words, more or equal than 3, as the 'exists' operator set to True means only documents where 'OtherLanguages' is stated
# will be considered. Finally, the number of documents found is retrieved

len(list(group_24_db.people.find({'OtherLanguages.2': {'$exists': True}})))

4

10. Top 10 most common `OtherLanguage` for people records?

**Answer:** Greek, Finnish, Dutch, Lithuanian, Arabic, Polish, Romanian, Croatian, Slovak and Chinese

In [15]:
# Use '$project' to only select and include in the output the field 'OtherLanguages'. After it, the '$unwind' operator is used 
# to separate the list of languages in 'OtherLanguages' into different documents of the same ObjectId, to then be counted and
# sorted in the '$sortByCount' operator in descending order, limiting it to just the top 10 languages with the highest count

list(group_24_db.people.aggregate( [ { '$project': {'OtherLanguages': 1 }},
                                     { '$unwind': '$OtherLanguages' },
                                     { '$sortByCount': "$OtherLanguages" },
                                     { '$limit': 10 },
                                     { '$project': {'_id': 0, 
                                                    'OtherLanguage':'$_id', 
                                                    'count': 1}}]))

[{'count': 3, 'OtherLanguage': 'Greek'},
 {'count': 3, 'OtherLanguage': 'Finnish'},
 {'count': 3, 'OtherLanguage': 'Dutch'},
 {'count': 2, 'OtherLanguage': 'Lithuanian'},
 {'count': 2, 'OtherLanguage': 'Arabic'},
 {'count': 2, 'OtherLanguage': 'Polish'},
 {'count': 2, 'OtherLanguage': 'Romanian'},
 {'count': 2, 'OtherLanguage': 'Croatian'},
 {'count': 2, 'OtherLanguage': 'Slovak'},
 {'count': 1, 'OtherLanguage': 'Chinese'}]

11. Who is the most common `PickedByPersonID` person name for orders done by customer `Adriana Pena`?

**Answer:** Three people have 'count' equal to 3, which is the highest value: Anthony Grosse, Piper Koch and Katie Darwin

In [16]:
# Firstly, find the record for Adriana Pena in the 'customers' collection, in order to know the 'PrimaryContactPersonID', which
# corresponds to the 'CustomerPersonID' in the 'orders' collection

# We decided to do the question with this initial query, used to get Adriana Pena's ID a priori, and then using that ID in the 
# aggregate query (next cell). This improved the computational efficiency of the queries, because instead of doing another 
# '$lookup', we did this 'match' in the beggining.

list(group_24_db.customers.find({'CustomerName':'Adriana Pena'},
                                {'_id':0, 'CustomerName':1, 'PrimaryContactPersonID':1}))

[{'CustomerName': 'Adriana Pena', 'PrimaryContactPersonID': 3255}]

In [17]:
# Using the 'match' statement to select just the orders' records for the person with 'ContactPersonID' equal to 3255, sorting
# them by count for each 'PickeddByPersonID' field. Then, 'lookup' is used to join 'orders' and 'people' collections by the
# fields '_id', which correspond to 'PickedByPersonID', and 'PersonID' from 'people'. After this, 'project' operator is used to
# retrieve only the 'count' and the 'FullName' subfield from the 'FullPersonInformation' field. At the same time, the unwind is
# stated in order to just query 'PersonFullName' which are not empty, as the number of total documents is restricted to 3, with
# the 'limit' operator.
 
list(group_24_db.orders.aggregate([{'$match': {'ContactPersonID': 3255}},
                                   {'$sortByCount': "$PickedByPersonID" },
                                   {'$lookup': {
                                                'from': 'people',
                                                'localField': '_id',
                                                'foreignField': 'PersonID',
                                                'as': 'FullPersonInformation'}},
                                   {'$project': {'_id': 0, 
                                                 'count': 1,
                                                 'PersonFullName':'$FullPersonInformation.FullName'}},
                                   {'$unwind': '$PersonFullName'},
                                   {'$limit': 3}
                                   ]))

[{'count': 3, 'PersonFullName': 'Anthony Grosse'},
 {'count': 3, 'PersonFullName': 'Piper Koch'},
 {'count': 3, 'PersonFullName': 'Katie Darwin'}]

12. What is the average difference in days between OrderDate and ExpectedDeliveryDate for orders sold by (`SalespersonPersonID`) person with name `Jack Potter`?

**Answer:** Negative 1.4 days (approximately)

In [18]:
# Find the PersonID of Jack Potter

# Similarly to question 11, we decided to do this question with this initial query, used to get Jack Potter's ID a priori, 
# and then using that ID in the aggregate query (next cell). This improved the computational efficiency of the queries, 
# because instead of doing one '$lookup', we did this 'match' in the beggining.

list(group_24_db.people.find({'FullName': 'Jack Potter'},
                             {"_id":0, "FullName":1, "PersonID":1}))

[{'PersonID': 20, 'FullName': 'Jack Potter'}]

In [19]:
# Start by matching with the PersonID corresponding with Jack Potter, calculating the average of the differences between dates.
# The differences between dates were in milliseconds so we have to do the division, to convert it to days 

list(group_24_db.orders.aggregate([{'$match': {'SalespersonPersonID': 20}},
                                  { "$group": {
                                    "_id": 'Null',
                                    "avg_time": {
                                      "$avg": { 
                                            '$divide': [{ 
                                                '$subtract': ['$OrderDate', '$ExpectedDeliveryDate'] }, 1000 * 60 * 60 * 24]    
                                      }
                                    }
                                  }},
                                  {'$project': {
                                        "_id": 0,
                                        "Average Time": "$avg_time"
                                    }}
                                ]))

[{'Average Time': -1.4490320833897388}]