![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

# Big Data Management - Assignment 1
## Document Stores

### by Luis Francisco Alvarez Poli, Mikel Gallo, Clarice Mottet

0. **[Part 0: Set Up](#part0)**
- **Objective**: Initialize programming environment and generate data.

1. **[Part 1: Model Creation](#part1)**
- **Objective**: Create three modeling alternatives using MongoDB.
- **Tasks:**
  - Model1: Two types of documents, one for each class and referenced fields.
  - Model2: One document for “Person” with “Company” as embedded document.
  - Model3: One document for “Company” with “Person” as embedded documents.

2. **[Part 2: Query Execution](#part2)**
- **Objective**: Execute four queries and log run time for each model.
- **Tasks:**
  - Query1: For each person, retrieve their full name and their company’s name.
  - Query2: For each company, retrieve its name and the number of employees.
  - Query3: For each person born before 1988, update their age to “30”.
  - Query4: For each company, update its name to include the word “Company”.

3. **[Part 3: Results & Discussion](#part3)**
- **Objective**: Compare run times for query execution across the three models.
- **Tasks:** 
  - Question1: Order queries from best to worst for Q1. Which model performs best? Why?
  - Question2: Order queries from best to worst for Q2. Which model performs best? Why?
  - Question3: Order queries from best to worst for Q3. Which model performs best? Why?
  - Question4: Order queries from best to worst for Q4. Which model performs best? Why?
  - Question5: What are your conclusions about denormalization or normalization of data in MongoDB? In the case of updates, which others better performance?


## <a id='part0'>Part 0: Set Up</a>
- **Objective**: Initialize programming environment and generate fake data.

In [7]:
#libraries
import datetime
import time
import json
from pymongo import MongoClient
from bson.objectid import ObjectId
from faker import Faker
import pandas as pd
import numpy as np

#global variables
NUMBER_OF_COMPANIES = 2
NUMBER_OF_EMPLOYEES = 4

## <a id='part1'>Part 1: Model Creation</a>
- **Objective**: Create three modeling alternatives using MongoDB.
- **Tasks:**
  - Model1: Two types of documents, one for each class and referenced fields.
  - Model2: One document for “Person” with “Company” as embedded document.
  - Model3: One document for “Company” with “Person” as embedded documents.

### **Model1**: Two types of documents, one for each class and referenced fields.

In [32]:
#model1

class Model1:
    def __init__(self, host='127.0.0.1', port=27017, dbname='test'):
        client = MongoClient(host, port)
        self.db = self.client[dbname]
        self.db.drop_collection("model1")
        self.collection = self.db.create_collection('model1')


    def data_generator(self, n_company, n_person):
        with self.client:
            #create sample data
            fake = Faker(['en_US'])

            #create sample company data
            for c in range(n_company):
                company = {
                    'type': 'company',
                    'domain': fake.domain_word(),
                    'email': fake.ascii_company_email(),
                    'name': fake.company(),
                    'url': fake.uri(),
                    'vatNumber': fake.nic_handles(),
                    'staff': []
                }
                company_id = self.collection.insert_one(company).inserted_id
                list_persons = []
                
                #create sample staff data
                for p in range(n_person):
                    today = datetime.date.today()
                    dob = pd.to_datetime(fake.date_of_birth(minimum_age = 18, maximum_age = 67))
                    age = today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))
                    person = {
                        'type': 'person',
                        'age': age,
                        'companyEmail': fake.ascii_company_email(),
                        'dateOfBirth': dob,
                        'email': fake.email(),
                        'firstName': fake.first_name(),
                        'secondName': fake.last_name(),
                        'job': fake.job(),
                        'worksIn': company_id
                    }
                    list_persons.append(person)
                
                inserted_persons = self.collection.insert_many(list_persons)
                self.collection.update_one(
                    {'_id': company_id},
                    {'$set': {'staff': inserted_persons.inserted_ids}}
                )

    def query1(self):
        with self.client:
            start_time = time.time()
            for person in self.collection.find({"type": "person"}):
                full_name = person['firstName'] + ' ' + person['secondName']
                company = self.collection.find_one({'_id':person['worksIn']})
                company_name = company['name']
                print(" ")
                print("Full Name:",full_name)
                print("Company:",company_name)
            end_time = time.time()
            run_time = end_time - start_time
            return run_time
        
    def query2(self):
        with self.client:
            start_time = time.time()
            for company in self.collection.find({"type": "company"}):
                company_name = company['name']
                number_of_employees = len(company['staff'])
                print(" ")
                print("Company:",company_name)
                print("Number of Employees:",number_of_employees)
            end_time = time.time()
            run_time = end_time - start_time
            return run_time
        
    def query3(self):
        with self.client:
            start_time = time.time()
            for person in self.collection.find({"type": "person"}):
                dob = person['dateOfBirth']
                if dob < pd.to_datetime('1988-01-01'):
                    print(" ")
                    print(" Pre - Age Change:",person['age'])
                    result = self.collection.update_one(
                                {'_id': person['_id']},
                                {'$set': {'age': 30}}
                            )
                    person_ = self.collection.find_one({'_id':person['_id']})
                    print("Post - Age Change:",person_['age'])
            end_time = time.time()
            run_time = end_time - start_time
            return run_time
    
    def query4(self):
        with self.client:
            start_time = time.time()
            for company in self.collection.find({"type": "company"}):
                company_name = company['name']
                print(" ")
                print(" Pre - Name Change:",company_name)
                result = self.collection.update_one(
                            {'_id': company['_id']},
                            {'$set': {'name': "Company "+company_name}}
                        )
                company_ = self.collection.find_one({'_id':company['_id']})
                print("Post - Name Change:",company_['name'])
            end_time = time.time()
            run_time = end_time - start_time
            return run_time

In [63]:
#test code

n_company = NUMBER_OF_COMPANIES
n_person = NUMBER_OF_EMPLOYEES

client = MongoClient('127.0.0.1', 27017)
db = client['test']
db.drop_collection("model1")
collection = db.create_collection('model1')


#create sample data
fake = Faker(['en_US'])

#create sample company data
for c in range(n_company):
    company = {
        'type': 'company',
        'domain': fake.domain_word(),
        'email': fake.ascii_company_email(),
        'name': fake.company(),
        'url': fake.uri(),
        'vatNumber': fake.nic_handles(),
        'staff': []
    }
    company_id = collection.insert_one(company).inserted_id
    list_persons = []
    
    #create sample staff data
    for p in range(n_person):
        today = datetime.date.today()
        dob = pd.to_datetime(fake.date_of_birth(minimum_age = 18, maximum_age = 67))
        age = today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))
        person = {
            'type': 'person',
            'age': age,
            'companyEmail': fake.ascii_company_email(),
            'dateOfBirth': dob,
            'email': fake.email(),
            'firstName': fake.first_name(),
            'secondName': fake.last_name(),
            'job': fake.job(),
            'worksIn': company_id
        }
        list_persons.append(person)
    
    inserted_persons = collection.insert_many(list_persons)
    collection.update_one(
        {'_id': company_id},
        {'$set': {'staff': inserted_persons.inserted_ids}}
    )


In [64]:
#Q1: print full name and print company name

start_time = time.time()
for person in collection.find({"type": "person"}):
    full_name = person['firstName'] + ' ' + person['secondName']
    company = collection.find_one({'_id':person['worksIn']})
    company_name = company['name']
    print(" ")
    print("Full Name:",full_name)
    print("Company:",company_name)
end_time = time.time()
run_time = end_time - start_time


 
Full Name: Kimberly Simpson
Company: Lee, Hernandez and Martinez
 
Full Name: John Lopez
Company: Lee, Hernandez and Martinez
 
Full Name: James Skinner
Company: Lee, Hernandez and Martinez
 
Full Name: Kristin Boone
Company: Lee, Hernandez and Martinez
 
Full Name: Michelle Norman
Company: Parker-Barber
 
Full Name: Francisco Little
Company: Parker-Barber
 
Full Name: Michelle Serrano
Company: Parker-Barber
 
Full Name: Erik Flores
Company: Parker-Barber


In [65]:
#Q2: for each company, retrieve its name and the number of employees

start_time = time.time()
for company in collection.find({"type": "company"}):
    company_name = company['name']
    number_of_employees = len(company['staff'])
    print(" ")
    print("Company:",company_name)
    print("Number of Employees:",number_of_employees)
end_time = time.time()
run_time = end_time - start_time


 
Company: Lee, Hernandez and Martinez
Number of Employees: 4
 
Company: Parker-Barber
Number of Employees: 4


In [70]:
#Q3: for each person born before 1988, update their age to "30"

start_time = time.time()
for person in collection.find({"type": "person"}):
    dob = person['dateOfBirth']
    if dob < pd.to_datetime('1988-01-01'):
        print(" ")
        print(" Pre - Age Change:",person['age'])
        result = collection.update_one(
                    {'_id': person['_id']},
                    {'$set': {'age': 30}}
        )
        person_ = collection.find_one({'_id':person['_id']})
        print("Post - Age Change:",person_['age'])
end_time = time.time()
run_time = end_time - start_time


 
 Pre - Age Change: 30
Post - Age Change: 30
 
 Pre - Age Change: 30
Post - Age Change: 30
 
 Pre - Age Change: 30
Post - Age Change: 30
 
 Pre - Age Change: 30
Post - Age Change: 30


In [71]:
#Q4: for each company, update its name to include the word company

start_time = time.time()
for company in collection.find({"type": "company"}):
    company_name = company['name']
    print(" ")
    print(" Pre - Name Change:",company_name)
    result = collection.update_one(
                {'_id': company['_id']},
                {'$set': {'name': "Company "+company_name}}
            )
    company_ = collection.find_one({'_id':company['_id']})
    print("Post - Name Change:",company_['name'])
end_time = time.time()
run_time = end_time - start_time


{'_id': ObjectId('662c0a37813fe61339781181'), 'type': 'company', 'domain': 'young', 'email': 'jessicagarcia@brewer-norris.net', 'name': 'Lee, Hernandez and Martinez', 'url': 'http://hardy.biz/app/tags/blogsearch.asp', 'vatNumber': ['TN172-NFDS'], 'staff': [ObjectId('662c0a37813fe61339781182'), ObjectId('662c0a37813fe61339781183'), ObjectId('662c0a37813fe61339781184'), ObjectId('662c0a37813fe61339781185')]}
 
 Pre - Name Change: Lee, Hernandez and Martinez
Post - Name Change: Company Lee, Hernandez and Martinez
{'_id': ObjectId('662c0a37813fe61339781186'), 'type': 'company', 'domain': 'henderson', 'email': 'samuellittle@dunlap.biz', 'name': 'Parker-Barber', 'url': 'http://www.chandler-evans.net/tags/tag/wp-contentpost.htm', 'vatNumber': ['SZ21689-WJKB'], 'staff': [ObjectId('662c0a37813fe61339781187'), ObjectId('662c0a37813fe61339781188'), ObjectId('662c0a37813fe61339781189'), ObjectId('662c0a37813fe6133978118a')]}
 
 Pre - Name Change: Parker-Barber
Post - Name Change: Company Parker-Ba

### **Model2**: One document for “Person” with “Company” as embedded document.

In [None]:
#model2


### **Model3**: One document for “Company” with “Person” as embedded documents.

In [None]:
#model3

## <a id='part2'>Part 2: Query Execution</a>
- **Objective**: Execute four queries and log run time for each model.
- **Tasks:**
  - Query1: For each person, retrieve their full name and their company’s name.
  - Query2: For each company, retrieve its name and the number of employees.
  - Query3: For each person born before 1988, update their age to “30”.
  - Query4: For each company, update its name to include the word “Company”.

In [None]:
#initialize a dataframe to hold all query run times for each model here

### **Query1**: For each person, retrieve their full name and their company’s name.


### **Query2**: For each company, retrieve its name and the number of employees.


### **Query3**: For each person born before 1988, update their age to “30”.

### **Query4**: For each company, update its name to include the word “Company”.

## <a id='part3'>Part 3: Results & Discussion</a>
- **Objective**: Compare run times for query execution across the three models.
- **Tasks:** 
  - Question1: Order queries from best to worst for Q1. Which model performs best? Why?
  - Question2: Order queries from best to worst for Q2. Which model performs best? Why?
  - Question3: Order queries from best to worst for Q3. Which model performs best? Why?
  - Question4: Order queries from best to worst for Q4. Which model performs best? Why?
  - Question5: What are your conclusions about denormalization or normalization of data in MongoDB? In the case of updates, which others better performance?

In [None]:
#table of all the run times of queries goes here

### **Question1**: Order queries from best to worst for Q1. Which model performs best? Why?

### **Question2**: Order queries from best to worst for Q2. Which model performs best? Why?

### **Question3**: Order queries from best to worst for Q3. Which model performs best? Why?

### **Question4**: Order queries from best to worst for Q4. Which model performs best? Why?

### **Question5**: What are your conclusions about denormalization or normalization of data in MongoDB? In the case of updates, which others better performance?