## Best Practices

Let us go through some best practices to perform batch load.
* We should minimize the number of connections to database.
* We should avoid executing queries using hard coded values. Rather, we should prefer bind variables.
* Too much committing is bad as commit incurs overhead.
* If we have to load considerable amount of data, we should consider committing every 1,000 records or 10,000 records or even more based up on the capacity of the database.
* Most of the mainstream databases perform direct path I/O or batch load which might perform better compare to looping, inserting and committing data.

In [1]:
import mysql.connector as mc
from mysql.connector import errorcode as ec
import pandas as pd
import datetime

In [2]:
def get_connection(user, password, host, db):
    try:
        connection = mc.connect(user=user, 
                                password=password,
                                host=host,
                                database=db
                               )
    except mc.Error as error:
        if error.errno == ec.ER_ACCESS_DENIED_ERROR:
            print("Invalid Credentials")
        else:
            print(err)
    return connection

In [3]:
def get_cursor(connection):
    return connection.cursor()

In [4]:
def get_orders():
    orders_path = "/Users/itversity/Research/data/retail_db/orders/orders.csv"
    orders_schema = [
        "order_id",
        "order_date",
        "order_customer_id",
        "order_status"
    ]
    orders = pd.read_csv(
        orders_path,
        header=None,
        names=orders_schema
    )
    return orders

In [17]:
def load_orders(connection, cursor, query, orders):
    for idx, order in orders.iterrows():
        cursor.execute(query, (order.order_id, order.order_date, order.order_customer_id, order.order_status))
        connection.commit()

In [18]:
connection = get_connection('demo_user', 'itversity', 'localhost', 'demo_db')

In [19]:
cursor = get_cursor(connection)

In [20]:
orders = get_orders()
orders.count()

order_id             68883
order_date           68883
order_customer_id    68883
order_status         68883
dtype: int64

In [21]:
query = ("""INSERT INTO orders
         (order_id, order_date, order_customer_id, order_status)
         VALUES
         (%s, %s, %s, %s)""")

In [22]:
%%time
load_orders(connection, cursor, query, orders)

CPU times: user 24.4 s, sys: 2.3 s, total: 26.7 s
Wall time: 1min 17s


* Truncate the table and reduce the frequency of the commit.

In [5]:
def load_orders(connection, cursor, query, orders):
    print(datetime.datetime.now())
    for idx, order in orders.iterrows():
        cursor.execute(query, (order.order_id, order.order_date, order.order_customer_id, order.order_status))
    connection.commit()

In [12]:
connection = get_connection('demo_user', 'itversity', 'localhost', 'demo_db')

In [13]:
cursor = get_cursor(connection)

In [14]:
orders = get_orders()
orders.count()

order_id             68883
order_date           68883
order_customer_id    68883
order_status         68883
dtype: int64

In [15]:
query = ("""INSERT INTO orders
         (order_id, order_date, order_customer_id, order_status)
         VALUES
         (%s, %s, %s, %s)""")

In [16]:
%%time
load_orders(connection, cursor, query, orders)

2020-05-24 06:31:51.038273
CPU times: user 19.7 s, sys: 1.35 s, total: 21 s
Wall time: 30.1 s


* Committing every 1000 records using batch. Make sure to truncate table before invoking load_orders function with frequent commits.
* In this case one insert statement will be used to insert 1000 records at a time. This is more efficient than issuing 1000 statements for 1000 records (one statement per record)

In [23]:
def load_orders(connection, cursor, query, orders):
    print(datetime.datetime.now())
    employees_batch = []
    count = 1
    for idx, order in orders.iterrows():
        employees_batch.append(tuple(order))
        if(count%1000 == 0):
            cursor.executemany(query, employees_batch)
            connection.commit()
            employees_batch = []
        count = count + 1
    cursor.executemany(query, employees_batch)
    connection.commit()

In [24]:
connection = get_connection('demo_user', 'itversity', 'localhost', 'demo_db')

In [25]:
cursor = get_cursor(connection)

In [26]:
orders = get_orders()
orders.count()

order_id             68883
order_date           68883
order_customer_id    68883
order_status         68883
dtype: int64

In [27]:
query = ("""INSERT INTO orders
         (order_id, order_date, order_customer_id, order_status)
         VALUES
         (%s, %s, %s, %s)""")

In [28]:
%%time
load_orders(connection, cursor, query, orders)

2020-05-24 06:40:57.857892
CPU times: user 9 s, sys: 142 ms, total: 9.15 s
Wall time: 10.2 s
