![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

# Big Data Management - Assignment 1
## Document Stores

### by Luis Francisco Alvarez Poli, Mikel Gallo, Clarice Mottet

0. **[Part 0: Set Up](#part0)**
- **Objective**: Initialize programming environment.

1. **[Part 1: Model Creation](#part1)**
- **Objective**: Create three modeling alternatives using MongoDB.
- **Tasks:**
  - Model1: Two types of documents, one for each class and referenced fields.
  - Model2: One document for “Person” with “Company” as embedded document.
  - Model3: One document for “Company” with “Person” as embedded documents.

2. **[Part 2: Query Execution](#part2)**
- **Objective**: Execute four queries and log run time for each model.
- **Tasks:**
  - Query1: For each person, retrieve their full name and their company’s name.
  - Query2: For each company, retrieve its name and the number of employees.
  - Query3: For each person born before 1988, update their age to “30”.
  - Query4: For each company, update its name to include the word “Company”.

3. **[Part 3: Results & Discussion](#part3)**
- **Objective**: Compare run times for query execution across the three models.
- **Tasks:** 
  - Question1: Order queries from best to worst for Q1. Which model performs best? Why?
  - Question2: Order queries from best to worst for Q2. Which model performs best? Why?
  - Question3: Order queries from best to worst for Q3. Which model performs best? Why?
  - Question4: Order queries from best to worst for Q4. Which model performs best? Why?
  - Question5: What are your conclusions about denormalization or normalization of data in MongoDB? In the case of updates, which others better performance?


## <a id='part0'>Part 0: Set Up</a>
- **Objective**: Initialize programming environment.

In [88]:
#libraries
import datetime
import time
import json
from pymongo import MongoClient
from bson.objectid import ObjectId
from faker import Faker
import pandas as pd
import numpy as np

#global variables
NUMBER_OF_COMPANIES = 2
NUMBER_OF_EMPLOYEES = 4

## <a id='part1'>Part 1: Model Creation</a>
- **Objective**: Create three modeling alternatives using MongoDB.
- **Tasks:**
  - Model1: Two types of documents, one for each class and referenced fields.
  - Model2: One document for “Person” with “Company” as embedded document.
  - Model3: One document for “Company” with “Person” as embedded documents.

### **Model1**: Two types of documents, one for each class and referenced fields.

In [93]:
#Model1

class Model1:
    #initialize the collection to hold two types of documents
    def __init__(self, host='127.0.0.1', port=27017, dbname='test'):
        self.client = MongoClient(host, port)
        self.db = self.client[dbname]
        self.db.drop_collection("model1")
        self.collection = self.db.create_collection('model1')

    #create a function that generates data and stores it in the collection
    def data_generator(self, n_company, n_person):
        #create sample data
        fake = Faker(['en_US'])

        #create sample company data
        for c in range(n_company):
            company = {
                'type': 'company',
                'domain': fake.domain_word(),
                'email': fake.ascii_company_email(),
                'name': fake.company(),
                'url': fake.uri(),
                'vatNumber': fake.nic_handles(),
                'staff': []
            }
            company_id = self.collection.insert_one(company).inserted_id
            list_persons = []
            
            #create sample staff data
            for p in range(n_person):
                today = datetime.date.today()
                dob = pd.to_datetime(fake.date_of_birth(minimum_age = 18, maximum_age = 67))
                age = today.year - dob.year - ((today.month, today.day) < (dob.month, dob.day))
                person = {
                    'type': 'person',
                    'age': age,
                    'companyEmail': fake.ascii_company_email(),
                    'dateOfBirth': dob,
                    'email': fake.email(),
                    'firstName': fake.first_name(),
                    'secondName': fake.last_name(),
                    'job': fake.job(),
                    'worksIn': company_id
                }
                list_persons.append(person)
            
            inserted_persons = self.collection.insert_many(list_persons)
            self.collection.update_one(
                {'_id': company_id},
                {'$set': {'staff': inserted_persons.inserted_ids}}
            )

    #query1: prints full name and company for all persons
    def query1(self):
        start_time = time.time()
        for person in self.collection.find({"type": "person"}):
            full_name = person['firstName'] + ' ' + person['secondName']
            company = self.collection.find_one({'_id':person['worksIn']})
            company_name = company['name']
            print(" ")
            print("Full Name:",full_name)
            print("Company:",company_name)
        end_time = time.time()
        run_time = end_time - start_time
        return run_time
    
    #query2: prints the name and number of employees for all companies
    def query2(self):
        start_time = time.time()
        for company in self.collection.find({"type": "company"}):
            company_name = company['name']
            number_of_employees = len(company['staff'])
            print(" ")
            print("Company:",company_name)
            print("Number of Employees:",number_of_employees)
        end_time = time.time()
        run_time = end_time - start_time
        return run_time
        
    #query3: update the age to be 30 for all persons whose date of birth is before 1988-01-01
    def query3(self):
        start_time = time.time()
        for person in self.collection.find({"type": "person"}):
            dob = person['dateOfBirth']
            if dob < pd.to_datetime('1988-01-01'):
                print(" ")
                print(" Pre - Age Change:",person['age'])
                result = self.collection.update_one(
                            {'_id': person['_id']},
                            {'$set': {'age': 30}}
                        )
                person_ = self.collection.find_one({'_id':person['_id']})
                print("Post - Age Change:",person_['age'])
        end_time = time.time()
        run_time = end_time - start_time
        return run_time

    #query4: update the company name to include the word "Company"    
    def query4(self):
        start_time = time.time()
        for company in self.collection.find({"type": "company"}):
            company_name = company['name']
            print(" ")
            print(" Pre - Name Change:",company_name)
            result = self.collection.update_one(
                        {'_id': company['_id']},
                        {'$set': {'name': "Company "+company_name}}
                    )
            company_ = self.collection.find_one({'_id':company['_id']})
            print("Post - Name Change:",company_['name'])
        end_time = time.time()
        run_time = end_time - start_time
        return run_time


### **Model2**: One document for “Person” with “Company” as embedded document.

In [None]:
#Model2


### **Model3**: One document for “Company” with “Person” as embedded documents.

In [None]:
#Model3


## <a id='part2'>Part 2: Query Execution</a>
- **Objective**: Execute four queries and log run time for each model.
- **Tasks:**
  - Query1: For each person, retrieve their full name and their company’s name.
  - Query2: For each company, retrieve its name and the number of employees.
  - Query3: For each person born before 1988, update their age to “30”.
  - Query4: For each company, update its name to include the word “Company”.

In [94]:
#Initialize models and data generation

#Model1
model1 = Model1()
model1.data_generator(NUMBER_OF_COMPANIES, NUMBER_OF_EMPLOYEES)

#model2

#model3


### **Query1**: For each person, retrieve their full name and their company’s name.


In [95]:
#Model1 - Query1
print("Model1 - Query1==========")
q1_run_time = model1.query1()

#Model2 - Query1
print("Model2 - Query1==========")
#

#Model3 - Query1
print("Model3 - Query1==========")
#


 
Full Name: Lance Jenkins
Company: Mayer, Rice and Campbell
 
Full Name: Wayne Allen
Company: Mayer, Rice and Campbell
 
Full Name: Melinda Moore
Company: Mayer, Rice and Campbell
 
Full Name: Haley Jackson
Company: Mayer, Rice and Campbell
 
Full Name: Theresa Thomas
Company: Stark Group
 
Full Name: Johnathan Rogers
Company: Stark Group
 
Full Name: Mark Rowe
Company: Stark Group
 
Full Name: Kelly Campos
Company: Stark Group


### **Query2**: For each company, retrieve its name and the number of employees.


In [96]:
#Model1 - Query2
print("Model1 - Query2==========")
q2_run_time = model1.query2()

#Model2 - Query2
print("Model2 - Query2==========")
#

#Model3 - Query3
print("Model3 - Query3==========")
#


 
Company: Mayer, Rice and Campbell
Number of Employees: 4
 
Company: Stark Group
Number of Employees: 4


### **Query3**: For each person born before 1988, update their age to “30”.

In [97]:
#Model1 - Query3
print("Model1 - Query3==========")
q3_run_time = model1.query3()

#Model2 - Query3
print("Model2 - Query3==========")
#

#Model3 - Query3
print("Model3 - Query3==========")
#


 
 Pre - Age Change: 64
Post - Age Change: 30
 
 Pre - Age Change: 66
Post - Age Change: 30
 
 Pre - Age Change: 46
Post - Age Change: 30
 
 Pre - Age Change: 51
Post - Age Change: 30
 
 Pre - Age Change: 46
Post - Age Change: 30
 
 Pre - Age Change: 45
Post - Age Change: 30


### **Query4**: For each company, update its name to include the word “Company”.

In [98]:
#Model1 - Query4
print("Model1 - Query4==========")
q4_run_time = model1.query4()

#Model2 - Query4
print("Model2 - Query4==========")
#

#Model3 - Query4
print("Model3 - Query4==========")
#


 
 Pre - Name Change: Mayer, Rice and Campbell
Post - Name Change: Company Mayer, Rice and Campbell
 
 Pre - Name Change: Stark Group
Post - Name Change: Company Stark Group


### Append run time results into a dataframe to present

In [101]:
#compile run times for Model1
df_m1_run_times = pd.DataFrame([['Model1',q1_run_time, q2_run_time, q3_run_time, q4_run_time]], columns = ['Model','Q1_run_time','Q2_run_time','Q3_run_time','Q4_run_time'])
print(df_m1_run_times)



    Model  Q1_run_time  Q2_run_time  Q3_run_time  Q4_run_time
0  Model1       0.0067     0.001056       0.0126     0.004297


## <a id='part3'>Part 3: Results & Discussion</a>
- **Objective**: Compare run times for query execution across the three models.
- **Tasks:** 
  - Question1: Order queries from best to worst for Q1. Which model performs best? Why?
  - Question2: Order queries from best to worst for Q2. Which model performs best? Why?
  - Question3: Order queries from best to worst for Q3. Which model performs best? Why?
  - Question4: Order queries from best to worst for Q4. Which model performs best? Why?
  - Question5: What are your conclusions about denormalization or normalization of data in MongoDB? In the case of updates, which others better performance?

In [None]:
#table of all the run times of queries goes here

### **Question1**: Order queries from best to worst for Q1. Which model performs best? Why?

### **Question2**: Order queries from best to worst for Q2. Which model performs best? Why?

### **Question3**: Order queries from best to worst for Q3. Which model performs best? Why?

### **Question4**: Order queries from best to worst for Q4. Which model performs best? Why?

### **Question5**: What are your conclusions about denormalization or normalization of data in MongoDB? In the case of updates, which others better performance?