## <center>**MONGODB Databases : Implementation**</center>
This notebook contains the code to establish the databases in Mongodb and to perform the required queries and note their times.

According to my roll number, the csv files I picked for creating these databases were :
- A-100
- A-1000
- A-10000
- B-100-3-1
- B-100-5-2
- B-100-10-1
- B-1000-5-2
- B-1000-10-4
- B-1000-50-2
- B-10000-5-1
- B-10000-50-2
- B-10000-500-1


The 9 databases formed according to question were therefore:
- **A_100, B_100_3_1 (db1)**
- **A_100, B_100_5_2 (db2)**
- **A_100, B_100_10_1 (db3)**
- **A_1000, B_1000_5_2 (db4)**
- **A_1000, B_1000_10_4 (db5)**
- **A_1000, B_1000_50_2 (db6)**
- **A_10000, B_10000_5_1 (db7)**
- **A_10000, B_10000_50-2 (db8)**
- **A-10000, B_10000_500_1 (db9)**

Please install the given packages before running in case they dont already exist.

In [1]:
#!pip3 uninstall pymongo
#!pip install pymongo==2.8
import pymongo
import pandas as pd
from pymongo import MongoClient
from time import time
import numpy as np
import csv
import pprint

#### **Function to create a database**
This function creates a connection to the specified database and if it doesn't already exist, it creates that database.<br>
**NOTE:** This will cerate the database in the same directory as the notebook. IF you want a different path, specify it while passing the argument to the create_database function.

In [2]:
def create_connection(db):
    
    client = pymongo.MongoClient("mongodb://127.0.0.1:27017/")
    database = client[db]
    if client:
        print(db)
    #client.list_database_names()
    client.close()

In [3]:
create_connection('db1')
create_connection("db2")
create_connection("db3")
create_connection("db4")
create_connection("db5")
create_connection("db6")
create_connection("db7")
create_connection("db8")
create_connection("db9")

db1
db2
db3
db4
db5
db6
db7
db8
db9


### <center>**Creating collections and Importing data in db**</center>
#### **Function to create collections in database**
After creating the databases, next we create the collections inside the database and import data into those collections. Below is the function for that which takes as argument the database, and the two collections you want to import in that database .i.e. A and B. <br>
**NOTE:** Please take care while of the path while passing argumwnts to this function. While running, I had the csv files in the exact same directory. Mention the path correctly in case there is a different path.

In [4]:
def create_tables(db, table1, table2):
    
    client = MongoClient("mongodb://127.0.0.1:27017/")
    database = client[db]
    
    colA = database["A"]
    colB = database["B"]
    
    with open(table1, newline = '') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        next(reader)
        for row in reader:
            dic = {"A1" : int(row[0]), "A2" : row[1]}
            colA.insert(dic)
            #print(x.insert_id)
    
    with open(table2, newline = '') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        next(reader)
        for row in reader:
            dic = {"B1" : int(row[0]), "B2" : int(row[1]), "B3" : row[2]}
            colB.insert(dic)
            #print(x.insert_id)
    client.close()

#### **Database 1 - db1.db**

In [12]:
create_tables('db1', 'A-100.csv', 'B-100-3-1.csv')

#### **Database 2 - db2.db**

In [13]:
create_tables('db2', 'A-100.csv', 'B-100-5-2.csv')

#### **Database 3 - db3.db**

In [14]:
create_tables('db3', 'A-100.csv', 'B-100-10-1.csv')

#### **Database 4 - db4.db**

In [15]:
create_tables('db4', 'A-1000.csv', 'B-1000-5-2.csv')

#### **Database 5 - db5.db**

In [16]:
create_tables('db5', 'A-1000.csv', 'B-1000-10-4.csv')

#### **Database 6 - db6.db**

In [17]:
create_tables('db6', 'A-1000.csv', 'B-1000-50-2.csv')

#### **Database 7 - db7.db**

In [18]:
create_tables('db7', 'A-10000.csv', 'B-10000-5-1.csv')

#### **Database 8 - db8.db**

In [19]:
create_tables('db8', 'A-10000.csv', 'B-10000-50-2.csv')

#### **Database 9 - db9.db**

In [20]:
create_tables('db9', 'A-10000.csv', 'B-10000-500-1.csv')

### <center>**Querying the databases and finding time**</center>
#### **Function to run a query and report time**
This function loops 7 times and creates connection to each of the 9 databases one by one to execute the query passed on to it. <br>
*Since while executing on shell directly executes and then prints the query, so to counter the overhead the same thing has been followed while noting time. Data has been fetched and collected. Since the method has been appointed in all 4 versions, the overhead is cancelled*

In [6]:
def run_query(query):
    
    t = [[0] * 9] * 7
    for i in range(7):
        client = MongoClient("mongodb://localhost:27017")
        database = client['db1']
        colA = database["A"]
        colB = database["B"]
        if query == query1:
            tic = time()
            cursor = colA.aggregate(query)
            toc = time()
            t[i][0] = toc - tic
        else:
            tic = time()
            cursor = colB.aggregate(query, allowDiskUse = True)
            toc = time()
            t[i][0] = toc - tic
        client.close()        
        
        client = MongoClient("mongodb://localhost:27017")
        database = client['db2']
        colA = database["A"]
        colB = database["B"]
        if query == query1:
            tic = time()
            cursor = colA.aggregate(query)
            toc = time()
            t[i][1] = toc - tic
        else:
            tic = time()
            cursor = colB.aggregate(query, allowDiskUse = True)
            toc = time()
            t[i][1] = toc - tic
        client.close()        
        
        client = MongoClient("mongodb://localhost:27017")
        database = client['db3']
        colA = database["A"]
        colB = database["B"]
        if query == query1:
            tic = time()
            cursor = colA.aggregate(query)
            toc = time()
            t[i][2] = toc - tic
        else:
            tic = time()
            cursor = colB.aggregate(query, allowDiskUse = True)
            toc = time()
            t[i][2] = toc - tic
        client.close()        
        
        client = MongoClient("mongodb://localhost:27017")
        database = client['db4']
        colA = database["A"]
        colB = database["B"]
        if query == query1:
            tic = time()
            cursor = colA.aggregate(query)
            toc = time()
            t[i][3] = toc - tic
        else:
            tic = time()
            cursor = colB.aggregate(query, allowDiskUse = True)
            toc = time()
            t[i][3] = toc - tic
        client.close()        
        
        client = MongoClient("mongodb://localhost:27017")
        database = client['db5']
        colA = database["A"]
        colB = database["B"]
        if query == query1:
            tic = time()
            cursor = colA.aggregate(query)
            toc = time()
            t[i][4] = toc - tic
        else:
            tic = time()
            cursor = colB.aggregate(query, allowDiskUse = True)
            toc = time()
            t[i][4] = toc - tic
        client.close()        
        
        client = MongoClient("mongodb://localhost:27017")
        database = client['db6']
        colA = database["A"]
        colB = database["B"]
        if query == query1:
            tic = time()
            cursor = colA.aggregate(query)
            toc = time()
            t[i][5] = toc - tic
        else:
            tic = time()
            cursor = colB.aggregate(query, allowDiskUse = True)
            toc = time()
            t[i][5] = toc - tic
        client.close()        
        
        client = MongoClient("mongodb://localhost:27017")
        database = client['db7']
        colA = database["A"]
        colB = database["B"]
        if query == query1:
            tic = time()
            cursor = colA.aggregate(query)
            toc = time()
            t[i][6] = toc - tic
        else:
            tic = time()
            cursor = colB.aggregate(query, allowDiskUse = True)
            toc = time()
            t[i][6] = toc - tic
        client.close()
        
        
        client = MongoClient("mongodb://localhost:27017")
        database = client['db8']
        colA = database["A"]
        colB = database["B"]
        if query == query1:
            tic = time()
            cursor = colA.aggregate(query)
            toc = time()
            t[i][7] = toc - tic
        else:
            tic = time()
            cursor = colB.aggregate(query, allowDiskUse = True)
            toc = time()
            t[i][7] = toc - tic
        client.close()        
        
        client = MongoClient("mongodb://localhost:27017")
        database = client['db9']
        colA = database["A"]
        colB = database["B"]
        if query == query1:
            tic = time()
            cursor = colA.aggregate(query)
            toc = time()
            t[i][8] = toc - tic
        else:
            tic = time()
            cursor = colB.aggregate(query, allowDiskUse = True)
            toc = time()
            t[i][8] = toc - tic
        client.close()
    return t

### **Defining queries and time arrays**

In [11]:
query1 = [{ "$match" : { "A1" : { "$lte" :  50 }}}]
query2 = [{ "$sort" : { "B3" : 1 }}]
query3 = [{ "$group" : { '_id' : "$B2", "Average" : { "$sum" : 1 }}}, { "$group" : { '_id' : "null", "Avg per A1" : { "$avg" : "$Average" }}}]
query4 = [{ "$lookup" : { 'from' : "A", 'localField' : "B2", 'foreignField' : "A1", 'as' : "nA"}}, {"$unwind" : "$nA"}, { "$project":{"nA.A2" : 1, 'B1' : 1,'B2' : 1, 'B3' : 1}}]

In [12]:
time_q1 = [[0] * 9] * 7
time_q2 = [[0] * 9] * 7
time_q3 = [[0] * 9] * 7
time_q4 = [[0] * 9] * 7

### **Query 1 across all databases**

In [13]:
time_q1 = np.round(run_query(query1), 6)
print(time_q1)

[[0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]
 [0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]
 [0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]
 [0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]
 [0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]
 [0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]
 [0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]]


In [14]:
final_t1 = (np.sum(time_q1, axis = 0) - np.max(time_q1, axis = 0) - np.max(time_q1, axis = 0)) / 5
final_t1

array([0.00188 , 0.011152, 0.013667, 0.007534, 0.003017, 0.002637,
       0.009667, 0.015387, 0.014923])

### **Query 2 across all databases**

In [15]:
time_q2 = np.round(run_query(query2), 6)
print(time_q2)

[[5.002000e-03 4.111000e-03 3.882000e-03 9.643000e-03 1.548900e-02
  5.661100e-02 7.358300e-02 6.104810e-01 8.235404e+00]
 [5.002000e-03 4.111000e-03 3.882000e-03 9.643000e-03 1.548900e-02
  5.661100e-02 7.358300e-02 6.104810e-01 8.235404e+00]
 [5.002000e-03 4.111000e-03 3.882000e-03 9.643000e-03 1.548900e-02
  5.661100e-02 7.358300e-02 6.104810e-01 8.235404e+00]
 [5.002000e-03 4.111000e-03 3.882000e-03 9.643000e-03 1.548900e-02
  5.661100e-02 7.358300e-02 6.104810e-01 8.235404e+00]
 [5.002000e-03 4.111000e-03 3.882000e-03 9.643000e-03 1.548900e-02
  5.661100e-02 7.358300e-02 6.104810e-01 8.235404e+00]
 [5.002000e-03 4.111000e-03 3.882000e-03 9.643000e-03 1.548900e-02
  5.661100e-02 7.358300e-02 6.104810e-01 8.235404e+00]
 [5.002000e-03 4.111000e-03 3.882000e-03 9.643000e-03 1.548900e-02
  5.661100e-02 7.358300e-02 6.104810e-01 8.235404e+00]]


In [16]:
final_t2 = (np.sum(time_q2, axis = 0) - np.max(time_q2, axis = 0) - np.max(time_q2, axis = 0)) / 5
final_t2

array([5.002000e-03, 4.111000e-03, 3.882000e-03, 9.643000e-03,
       1.548900e-02, 5.661100e-02, 7.358300e-02, 6.104810e-01,
       8.235404e+00])

### **Query 3 across all databases**

In [17]:
time_q3 = np.round(run_query(query3), 6)
print(time_q3)

[[5.518000e-03 1.032800e-02 3.473000e-03 8.393000e-03 1.654300e-02
  5.174300e-02 8.173800e-02 5.198410e-01 4.160537e+00]
 [5.518000e-03 1.032800e-02 3.473000e-03 8.393000e-03 1.654300e-02
  5.174300e-02 8.173800e-02 5.198410e-01 4.160537e+00]
 [5.518000e-03 1.032800e-02 3.473000e-03 8.393000e-03 1.654300e-02
  5.174300e-02 8.173800e-02 5.198410e-01 4.160537e+00]
 [5.518000e-03 1.032800e-02 3.473000e-03 8.393000e-03 1.654300e-02
  5.174300e-02 8.173800e-02 5.198410e-01 4.160537e+00]
 [5.518000e-03 1.032800e-02 3.473000e-03 8.393000e-03 1.654300e-02
  5.174300e-02 8.173800e-02 5.198410e-01 4.160537e+00]
 [5.518000e-03 1.032800e-02 3.473000e-03 8.393000e-03 1.654300e-02
  5.174300e-02 8.173800e-02 5.198410e-01 4.160537e+00]
 [5.518000e-03 1.032800e-02 3.473000e-03 8.393000e-03 1.654300e-02
  5.174300e-02 8.173800e-02 5.198410e-01 4.160537e+00]]


In [18]:
final_t3 = (np.sum(time_q3, axis = 0) - np.max(time_q3, axis = 0) - np.max(time_q3, axis = 0)) / 5
final_t3

array([5.518000e-03, 1.032800e-02, 3.473000e-03, 8.393000e-03,
       1.654300e-02, 5.174300e-02, 8.173800e-02, 5.198410e-01,
       4.160537e+00])

### **Query 4 across all databases**

In [19]:
time_q4 = np.round(run_query(query4), 6)
print(time_q4)

[[0.023502 0.029408 0.019482 0.09086  0.110093 0.165954 1.240149 1.019666
  0.85443 ]
 [0.023502 0.029408 0.019482 0.09086  0.110093 0.165954 1.240149 1.019666
  0.85443 ]
 [0.023502 0.029408 0.019482 0.09086  0.110093 0.165954 1.240149 1.019666
  0.85443 ]
 [0.023502 0.029408 0.019482 0.09086  0.110093 0.165954 1.240149 1.019666
  0.85443 ]
 [0.023502 0.029408 0.019482 0.09086  0.110093 0.165954 1.240149 1.019666
  0.85443 ]
 [0.023502 0.029408 0.019482 0.09086  0.110093 0.165954 1.240149 1.019666
  0.85443 ]
 [0.023502 0.029408 0.019482 0.09086  0.110093 0.165954 1.240149 1.019666
  0.85443 ]]


In [20]:
final_t4 = (np.sum(time_q4, axis = 0) - np.max(time_q4, axis = 0) - np.max(time_q4, axis = 0)) / 5
final_t4

array([0.023502, 0.029408, 0.019482, 0.09086 , 0.110093, 0.165954,
       1.240149, 1.019666, 0.85443 ])

In [21]:
print(time_q1)
print(time_q2)
print(time_q3)
print(time_q4)

[[0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]
 [0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]
 [0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]
 [0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]
 [0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]
 [0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]
 [0.00188  0.011152 0.013667 0.007534 0.003017 0.002637 0.009667 0.015387
  0.014923]]
[[5.002000e-03 4.111000e-03 3.882000e-03 9.643000e-03 1.548900e-02
  5.661100e-02 7.358300e-02 6.104810e-01 8.235404e+00]
 [5.002000e-03 4.111000e-03 3.882000e-03 9.643000e-03 1.548900e-02
  5.661100e-02 7.358300e-02 6.104810e-01 8.235404e+00]
 [5.002000e-03 4.111000e-03 3.882000e-03 9.643000e-03 1.548900e-02
  5.661100e-02 7.358300e-02 6.104810e-01 8.235404e+00]
 [5.002000e-03 4.111000e-03 3.8

### **Saving times collected to csv**

In [22]:
import csv
with open("t1.csv", "w+") as my_csv:
    writer = csv.writer(my_csv, delimiter=',')
    writer.writerows(time_q1)

In [23]:
with open("t2.csv", "w+") as my_csv:
    writer = csv.writer(my_csv, delimiter=',')
    writer.writerows(time_q2)

In [24]:
with open("t3.csv", "w+") as my_csv:
    writer = csv.writer(my_csv, delimiter=',')
    writer.writerows(time_q3)

In [25]:
with open("t4.csv", "w+") as my_csv:
    writer = csv.writer(my_csv, delimiter=',')
    writer.writerows(time_q4)