## <center>**MARIADB Databases : Implementation and Querying (Without Indexing)**</center>
This notebook contains the code to establish the databases in Mariadb and to perform the required queries and note their times.

According to my roll number, the csv files I picked for creating these databases were :
- A-100
- A-1000
- A-10000
- B-100-3-1
- B-100-5-2
- B-100-10-1
- B-1000-5-2
- B-1000-10-4
- B-1000-50-2
- B-10000-5-1
- B-10000-50-2
- B-10000-500-1


The 9 databases formed according to question were therefore:
- **A_100, B_100_3_1 (db1)**
- **A_100, B_100_5_2 (db2)**
- **A_100, B_100_10_1 (db3)**
- **A_1000, B_1000_5_2 (db4)**
- **A_1000, B_1000_10_4 (db5)**
- **A_1000, B_1000_50_2 (db6)**
- **A_10000, B_10000_5_1 (db7)**
- **A_10000, B_10000_50-2 (db8)**
- **A-10000, B_10000_500_1 (db9)**



We start off by importing the required packages. <br>
**NOTE:** You will have to first install the packages by uncommenting the lines below if they are not already installed on your system.

In [1]:
#!pip3 install mariadb
import mariadb
import pandas as pd
from mariadb import Error
from time import time
import numpy as np
import csv

#### **Function to create a database**
This function creates a connection to the specified database and if it doesn't already exist, it creates that database.<br>
**NOTE:** This will cerate the database in the same directory as the notebook. IF you want a different path, specify it while passing the argument to the create_database function. Also change the username and password as per requirement

In [31]:
def create_connection():
    
    connection = None
    try:
        connection = mariadb.connect(user="user1",
        password= "password1")
        c = connection.cursor()
        c.execute('CREATE DATABASE db1;')
        c.execute('CREATE DATABASE db2;')
        c.execute('CREATE DATABASE db3;')
        c.execute('CREATE DATABASE db4;')
        c.execute('CREATE DATABASE db5;')
        c.execute('CREATE DATABASE db6;')
        c.execute('CREATE DATABASE db7;')
        c.execute('CREATE DATABASE db8;')
        c.execute('CREATE DATABASE db9;')
    except Error as e:
        print(e)
    finally:
        if connection :
            connection.close()

In [32]:
create_connection()

db1
db2
db3
db4
db5
db6
db7
db8
db9


### <center>**Creating database tables and Importing data in db**</center>
#### **Function to create tables in database**
After creating the databases, next we create the tables inside the database and import data into those tables. Below is the function for that which takes as argument the database, and the two tables you want to import in that database .i.e. A and B. <br>
**NOTE:** Please take care while of the path while passing arguments to this function. While running, I had the csv files in the exact same directory. Mention the path correctly in case there is a different path. And take care of the username and password fields.

In [4]:
def create_tables(db, table1, table2):
    
    connection = mariadb.connect(user="user1",
    password= "password1",
    database= db)

    c = connection.cursor()
    
    c.execute('DROP TABLE if EXISTS B')
    c.execute('DROP TABLE if Exists A')

    c.execute('''CREATE TABLE A( A1 int,
                A2 text );''')

    c.execute('CREATE TABLE B( B1 int, B2 int, B3 varchar(255))')

    connection.commit()
    
    with open(table1, newline = '') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        next(reader)
        for row in reader:
            A1 = int(row[0])
            A2 = row[1]
            c.execute(f'INSERT INTO A(A1, A2) VALUES("{A1}", "{A2}");')
    
    with open(table2, newline = '') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        next(reader)
        for row in reader:
            B1 = int(row[0])
            B2 = int(row[1])
            B3 = row[2]
            c.execute(f'INSERT INTO B(B1, B2, B3) VALUES("{B1}", "{B2}", "{B3}");')
    
    connection.commit()
    connection.close()

#### **Database 1 - db1.db**

In [5]:
create_tables('db1', 'A-100.csv', 'B-100-3-1.csv')

#### **Database 2 - db2.db**

In [6]:
create_tables('db2', r'A-100.csv', r'B-100-5-2.csv')

#### **Database 3 - db3.db**

In [7]:
create_tables('db3', r'A-100.csv', r'B-100-10-1.csv')

#### **Database 4 - db4.db**

In [8]:
create_tables('db4', r'A-1000.csv', r'B-1000-5-2.csv')

#### **Database 5 - db5.db**

In [9]:
create_tables('db5', r'A-1000.csv', r'B-1000-10-4.csv')

#### **Database 6 - db6.db**

In [10]:
create_tables('db6', r'A-1000.csv', r'B-1000-50-2.csv')

#### **Database 7 - db7.db**

In [11]:
create_tables('db7', r'A-10000.csv', r'B-10000-5-1.csv')

#### **Database 8 - db8.db**

In [12]:
create_tables('db8', r'A-10000.csv', r'B-10000-50-2.csv')

#### **Database 9 - db9.db**

In [13]:
create_tables('db9', r'A-10000.csv', r'B-10000-500-1.csv')

### <center>**Querying the databases and finding time**</center>
#### **Function to run a query and report time**
This function loops 7 times and creates connection to each of the 9 databases one by one to execute the query passed on to it. <br>
*Since while executing on shell directly executes and then prints the query, so to counter the overhead the same thing has been followed while noting time. Data has been fetched and collected. Since the method has been appointed in all 4 versions, the overhead is cancelled*

In [2]:
def run_query(query):
    
    t = [[0] * 9] * 7
    for i in range(7):
        connection = mariadb.connect(user="user1", password= "password1", database = 'db1')
        c = connection.cursor()
        tic = time()
        c.execute(query)
        x = pd.DataFrame(c.fetchall())
        toc = time()
        #print(toc - tic)
        t[i][0] = toc - tic
        connection.close()
        
        connection = mariadb.connect(user="user1", password= "password1", database = 'db2')
        c = connection.cursor()
        tic = time()
        c.execute(query)
        x = pd.DataFrame(c.fetchall())
        toc = time()
        #print(toc - tic)
        t[i][1] = toc - tic        
        connection.close()
        
        connection = mariadb.connect(user="user1", password= "password1", database = 'db3')
        c = connection.cursor()
        tic = time()
        c.execute(query)
        x = pd.DataFrame(c.fetchall())
        toc = time()
        #print(toc - tic)
        t[i][2] = toc - tic
        connection.close()
        
        connection = mariadb.connect(user="user1", password= "password1", database = 'db4')
        c = connection.cursor()
        tic = time()
        c.execute(query)
        x = pd.DataFrame(c.fetchall())
        toc = time()
        #print(toc - tic)
        t[i][3] = toc - tic
        connection.close()
        
        connection = mariadb.connect(user="user1", password= "password1", database = 'db5')
        c = connection.cursor()
        tic = time()
        c.execute(query)
        x = pd.DataFrame(c.fetchall())
        toc = time()
        #print(toc - tic)
        t[i][4] = toc - tic
        connection.close()
        
        connection = mariadb.connect(user="user1", password= "password1", database = 'db6')
        c = connection.cursor()
        tic = time()
        c.execute(query)
        x = pd.DataFrame(c.fetchall())
        toc = time()
        #print(toc - tic)
        t[i][5] = toc - tic
        connection.close()
        
        connection = mariadb.connect(user="user1", password= "password1", database = 'db7')
        c = connection.cursor()
        tic = time()
        c.execute(query)
        x = pd.DataFrame(c.fetchall())
        toc = time()
        #print(toc - tic)
        t[i][6] = toc - tic
        connection.close()
        
        connection = mariadb.connect(user="user1", password= "password1", database = 'db8')
        c = connection.cursor()
        tic = time()
        c.execute(query)
        x = pd.DataFrame(c.fetchall())
        toc = time()
        #print(toc - tic)
        t[i][7] = toc - tic
        connection.close()
                
        connection = mariadb.connect(user="user1", password= "password1", database = 'db9')
        c = connection.cursor()
        tic = time()
        c.execute(query)
        x = pd.DataFrame(c.fetchall())
        toc = time()
        #print(toc - tic)
        t[i][8] = toc - tic
        connection.close()
        
    return t

### **Defining queries and time arrays**

In [3]:
query1 = 'SELECT * FROM A WHERE A1 <= 50'
query2 = 'SELECT * FROM B ORDER BY B3'
query3 = 'SELECT AVG(X.COL) FROM (SELECT COUNT(B2) AS COL FROM B GROUP BY B2) AS X'
query4 = 'SELECT A2, B1, B2, B3 FROM A, B WHERE A.A1 = B.B2'

In [4]:
time_q1 = [[0] * 9] * 7
time_q2 = [[0] * 9] * 7
time_q3 = [[0] * 9] * 7
time_q4 = [[0] * 9] * 7

### **Query 1 across all databases**

In [5]:
time_q1 = np.round(run_query(query1), 6)
print(time_q1)

[[0.001313 0.001034 0.001014 0.001882 0.00181  0.001823 0.009714 0.009654
  0.009362]
 [0.001313 0.001034 0.001014 0.001882 0.00181  0.001823 0.009714 0.009654
  0.009362]
 [0.001313 0.001034 0.001014 0.001882 0.00181  0.001823 0.009714 0.009654
  0.009362]
 [0.001313 0.001034 0.001014 0.001882 0.00181  0.001823 0.009714 0.009654
  0.009362]
 [0.001313 0.001034 0.001014 0.001882 0.00181  0.001823 0.009714 0.009654
  0.009362]
 [0.001313 0.001034 0.001014 0.001882 0.00181  0.001823 0.009714 0.009654
  0.009362]
 [0.001313 0.001034 0.001014 0.001882 0.00181  0.001823 0.009714 0.009654
  0.009362]]


In [6]:
final_t1 = (np.sum(time_q1, axis = 0) - np.max(time_q1, axis = 0) - np.max(time_q1, axis = 0)) / 5
final_t1

array([0.001313, 0.001034, 0.001014, 0.001882, 0.00181 , 0.001823,
       0.009714, 0.009654, 0.009362])

### **Query 2 across all databases**

In [7]:
time_q2 = np.round(run_query(query2), 6)
print(time_q2)

[[9.7722000e-02 3.5650000e-03 4.7310000e-03 2.6086000e-02 6.2371000e-02
  2.0391900e-01 2.7257700e-01 2.2747610e+00 7.3418112e+01]
 [9.7722000e-02 3.5650000e-03 4.7310000e-03 2.6086000e-02 6.2371000e-02
  2.0391900e-01 2.7257700e-01 2.2747610e+00 7.3418112e+01]
 [9.7722000e-02 3.5650000e-03 4.7310000e-03 2.6086000e-02 6.2371000e-02
  2.0391900e-01 2.7257700e-01 2.2747610e+00 7.3418112e+01]
 [9.7722000e-02 3.5650000e-03 4.7310000e-03 2.6086000e-02 6.2371000e-02
  2.0391900e-01 2.7257700e-01 2.2747610e+00 7.3418112e+01]
 [9.7722000e-02 3.5650000e-03 4.7310000e-03 2.6086000e-02 6.2371000e-02
  2.0391900e-01 2.7257700e-01 2.2747610e+00 7.3418112e+01]
 [9.7722000e-02 3.5650000e-03 4.7310000e-03 2.6086000e-02 6.2371000e-02
  2.0391900e-01 2.7257700e-01 2.2747610e+00 7.3418112e+01]
 [9.7722000e-02 3.5650000e-03 4.7310000e-03 2.6086000e-02 6.2371000e-02
  2.0391900e-01 2.7257700e-01 2.2747610e+00 7.3418112e+01]]


In [8]:
final_t2 = (np.sum(time_q2, axis = 0) - np.max(time_q2, axis = 0) - np.max(time_q2, axis = 0)) / 5
final_t2

array([9.7722000e-02, 3.5650000e-03, 4.7310000e-03, 2.6086000e-02,
       6.2371000e-02, 2.0391900e-01, 2.7257700e-01, 2.2747610e+00,
       7.3418112e+01])

### **Query 3 across all databases**

In [9]:
time_q3 = np.round(run_query(query3), 6)
print(time_q3)

[[1.32600e-03 1.41200e-03 1.50600e-03 5.54400e-03 7.78700e-03 2.95570e-02
  4.77960e-02 2.77697e-01 2.34640e+00]
 [1.32600e-03 1.41200e-03 1.50600e-03 5.54400e-03 7.78700e-03 2.95570e-02
  4.77960e-02 2.77697e-01 2.34640e+00]
 [1.32600e-03 1.41200e-03 1.50600e-03 5.54400e-03 7.78700e-03 2.95570e-02
  4.77960e-02 2.77697e-01 2.34640e+00]
 [1.32600e-03 1.41200e-03 1.50600e-03 5.54400e-03 7.78700e-03 2.95570e-02
  4.77960e-02 2.77697e-01 2.34640e+00]
 [1.32600e-03 1.41200e-03 1.50600e-03 5.54400e-03 7.78700e-03 2.95570e-02
  4.77960e-02 2.77697e-01 2.34640e+00]
 [1.32600e-03 1.41200e-03 1.50600e-03 5.54400e-03 7.78700e-03 2.95570e-02
  4.77960e-02 2.77697e-01 2.34640e+00]
 [1.32600e-03 1.41200e-03 1.50600e-03 5.54400e-03 7.78700e-03 2.95570e-02
  4.77960e-02 2.77697e-01 2.34640e+00]]


In [10]:
final_t3 = (np.sum(time_q3, axis = 0) - np.max(time_q3, axis = 0) - np.max(time_q3, axis = 0)) / 5
final_t3

array([1.32600e-03, 1.41200e-03, 1.50600e-03, 5.54400e-03, 7.78700e-03,
       2.95570e-02, 4.77960e-02, 2.77697e-01, 2.34640e+00])

### **Query 4 across all databases**

In [None]:
time_q4 = np.round(run_query(query4), 6)
print(time_q4)

In [None]:
final_t4 = (np.sum(time_q4, axis = 0) - np.max(time_q4, axis = 0) - np.max(time_q4, axis = 0)) / 5
final_t4

In [None]:
print(time_q1)
print(time_q2)
print(time_q3)
print(time_q4)

### **Saving times collected to csv**

In [None]:
import csv
with open("t1.csv", "w+") as my_csv:
    writer = csv.writer(my_csv, delimiter=',')
    writer.writerows(time_q1)

In [None]:
with open("t2.csv", "w+") as my_csv:
    writer = csv.writer(my_csv, delimiter=',')
    writer.writerows(time_q2)

In [None]:
with open("t3.csv", "w+") as my_csv:
    writer = csv.writer(my_csv, delimiter=',')
    writer.writerows(time_q3)

In [None]:
with open("t4.csv", "w+") as my_csv:
    writer = csv.writer(my_csv, delimiter=',')
    writer.writerows(time_q4)