<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:27%; left:10%;">
    MySQL and MariaDB for Python Developers
</h1>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:55%; left:10%;">
    David Mertz, Ph.D.
</h3>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:62%; left:10%;">
    Data Scientist
</h3>
</div>

# MySQL and MariaDB adapters for Python

Most Python developers who access MySQL databases use the `mysql.connector` driver/adapter that is provided by Oracle and the MySQL project.  It is well tested, reliable, and conforms well to the DB-API 2.0 standard.  

However, there are several other notable options for adaptors that you may want to consider for your particular use case.

* `PyMySQL` is a pure-Python implementation that complies with DB-API 2.0.  In the base version, it has no external dependencies outside of the standard library.  However, if you wish to use the authentication methods "sha256_password" or "caching_sha2_password", you will need an extra module.  If you work in a context where pure-Python is required, this is available; some extra capabilities are also available. Both can be installed with:

```bash
pip install PyMySQL[rsa]
```

* `aiomysql` is an `asyncio`-friendly modification of `PyMySQL` that allows you to use MySQL in asynchronous programs.  General programming styles and benefits of async are discussed in another INE course.  In a sentence though, for high-performance I/O bound programs an asyncronous approach can be *vastly* faster than a thread-based or single-threaded one.  `aiomysql` is *mostly* similar to the DB-API, but because of the nature of asynchronous programming, some differences arise.

* `trio_mysql` is very similar, in concept, to `aiomysql`, but it wraps `PyMySQL` for use with the `Trio` async library rather than the Python standard library `asyncio`.  There are certainly certain advantages of Trio over `asyncio`, but that is far afield of this course, and hence examples of `trio_mysql` will not be presented in this lesson.

* `tormysql` is yet another asynchronous driver for MySQL.  It claims to be the highest performing driver, but I have not benchmarked it myself.  While `tormysql` can be used with `asyncio`, it focuses on another 3rd-party asynchronous library called `Tornado`.  Tornado likewise has good features, but they are outside the scope of this course.

## Object-relational mapping

Not covered in this course is object-relational mapping libraries that convert SQL and tuple-level interfaces with MySQL into Python method calls.  In particular, `SQLAlchemy` is very popular among many Python developers.  Personally, I find the extra layer between code and database a distraction rather than a benefit.

In any event though, `SQLAlchemy` relies on the actual MySQL adapters we discuss in this lesson.  The abstractions it provides are not MySQL specific, but generic patterns for accessing RDBMS data.

## Pure Python

Let us use the `PyMySQL` adapter to load some data into our database.  Two of the enhancements in PyMySQL are more flexible cursors and providing connections and cursors as context managers.  Let us illustrate both of those in these examples.  In general, other than enhancements we might use, `PyMySQL` should be a drop-in replacement for `mysql.connector` (likely somewhat slower in pure-Python).

In [1]:
import pymysql
print(f"PyMySQL version | {pymysql.__version__}")
print(f"API level       | {pymysql.apilevel}")
print(f"Parameter style | {pymysql.paramstyle}")
print(f"Thread safety   | {pymysql.threadsafety}")

PyMySQL version | 0.9.3
API level       | 2.0
Parameter style | pyformat
Thread safety   | 1


We need credentials and host/port setting for any of these adaptors, of course.

In [2]:
# `port` MUST BE an int not a string representing int for pymysql
credentials = dict(host='localhost',
                   user='ine_student',
                   password='ine-password',
                   database='ine',
                   port=3306)

In [3]:
from pprint import pprint
from pymysql.cursors import DictCursor

# Note: context manager usage slightly different in later versions
with pymysql.connect(charset='utf8', cursorclass=DictCursor, **credentials) as cur:
    cur.execute("SELECT username, created_on FROM users;")
    data = cur.fetchall()
        
pprint(data)

[{'created_on': datetime.datetime(2021, 1, 9, 22, 19, 3), 'username': 'Alice'},
 {'created_on': datetime.datetime(2021, 1, 9, 22, 19, 3), 'username': 'Bob'},
 {'created_on': datetime.datetime(2021, 1, 9, 22, 19, 3), 'username': 'Carlos'},
 {'created_on': datetime.datetime(2021, 1, 9, 22, 32, 16), 'username': 'Sybil'},
 {'created_on': datetime.datetime(2021, 1, 9, 22, 32, 16), 'username': 'Trudy'},
 {'created_on': datetime.datetime(2021, 1, 9, 22, 32, 16), 'username': 'Vanna'}]


## Deciding on schemata

As some data to load into the database, let us take some information on United States zip codes published by the U.S. Census Bureau.  We have two source files available.  One tab separated file that gives explanations of column names, and another that gives information per zip code.

In [4]:
!head -5 ../data/census-zipcodes-2018.fields

USPS	United States Postal Service State Abbreviation
GEOID	Geographic Identifier - fully concatenated geographic code (State FIPS and district number)
ALAND	Land Area (square meters) - Created for statistical purposes only
AWATER	Water Area (square meters) - Created for statistical purposes only
ALAND_SQMI	Land Area (square miles) - Created for statistical purposes only


In [5]:
!head -5 ../data/census-zipcodes-2018.tsv

USPS	ALAND	AWATER	ALAND_SQMI	AWATER_SQMI	INTPTLAT	INTPTLONG
00601	166659749	799292	64.348	0.309	18.180555	-66.749961
00602	79307535	4428429	30.621	1.71	18.361945	-67.175597
00603	81887185	181411	31.617	0.07	18.455183	-67.119887
00606	109579993	12487	42.309	0.005	18.158327	-66.932928


Before putting the data into tables, we should decide on good table layouts.  The field key is relatively straightforward.  MySQL is relative strict in its permitted SQL syntax; for example, a column named `key` is not permitted (because it looks like the keyword), although other RDBMSs allow this.  Likewise, MySQL is slightly uncommon in declaring `PRIMARY KEY` as a separate pseudo-column rather than an adjective to a column.

In [6]:
conn = pymysql.connect(**credentials)
with conn.cursor() as cur:
    cur.execute('DROP TABLE IF EXISTS census_zipcode_fields;')
    sql_census_fields = """
    CREATE TABLE census_zipcode_fields (
      fieldname VARCHAR(15) NOT NULL,
      description TEXT,
      PRIMARY KEY (fieldname)
    );
    """
    cur.execute(sql_census_fields)
    conn.commit()

The data set describing fields is small, and can easily be read into memory.

In [7]:
with open('../data/census-zipcodes-2018.fields') as fields:
    rows = [tuple(line.strip().split('\t')) for line in fields]

sql = "INSERT INTO census_zipcode_fields VALUES (%s, %s)"

conn = pymysql.connect(**credentials)
with conn.cursor() as cur:
    cur.executemany(sql, rows)
    conn.commit()

The types for the main data allows us to use the data types of MySQL more versatilely.  

The `DECIMAL` or `NUMERIC` types in MySQL (and other SQL systems) uses somewhat strange naming: "precision" means the total number of digits stored; "scale" means the number of those digits to the right of the decimal point.

In [8]:
sql_geography = """
CREATE TABLE census_zipcode_geography (
  USPS CHAR(5),    
  ALAND BIGINT,              -- some zips are larger than 2e9 m^2
  AWATER BIGINT,
  ALAND_SQMI DECIMAL(8, 3),  -- largest zips need 5 to left of decimal
  AWATER_SQMI DECIMAL(8, 3), -- sizes with 3 digits of "scale"
  INTPTLAT REAL,             -- keep fields from key, although duplicative
  INTPTLONG REAL,
  location POINT,             -- use geometric type for lat/lon
  PRIMARY KEY (USPS)
);
"""
conn = pymysql.connect(**credentials)
with conn.cursor() as cur:
    cur.execute('DROP TABLE IF EXISTS census_zipcode_geography;')
    cur.execute(sql_geography)
    conn.commit()

We stipulate that this data is large enough we do not want to load it all at once (really it is not).  Unfortunately, PyMySQL is somewhat limited in correctly templating MySQL, and MySQL itself somewhat inflexible in accepting non-conforming SQL.

Specifically, if we use the cursor templating mechanism by passing in a second argument, the constructor for `POINT` is quoted in the insertion, but MySQL will only accept the unquoted form.  We can succeed using manual Python templating.  We also need to make sure zip codes are in extra quotes so they do not get converted as integers (some have leading zeros).

In [9]:
sql_insert_geo = f"""
INSERT into census_zipcode_geography
VALUES ('%s', %s, %s, %s, %s, %s, %s, %s);
"""
conn = pymysql.connect(**credentials)
with conn.cursor() as cur:
    with open('../data/census-zipcodes-2018.tsv') as fh:
        next(fh)   # discard header line
        for line in fh:
            row = line.strip().split('\t')
            row.append(f"ST_PointFromText('POINT ({row[-2]} {row[-1]})')")
            sql = sql_insert_geo % tuple(row)
            cur.execute(sql)
    conn.commit()

# Show an example of the necessary SQL formatting
print(sql)


INSERT into census_zipcode_geography
VALUES ('99929', 5635963110, 637274792, 2176.058, 246.053, 56.370538, -131.693453, ST_PointFromText('POINT (56.370538 -131.693453)'));



Just to see that our data is in the database, and as a preview of the POINT data type, let us make a query for the land area of those zipcodes that are close to where I live. For this short distance, the Pythagorean formula suffices; for larger distances, we should utilize Haversine distance (we return to that in a later lesson).

In [10]:
from collections import namedtuple

In [11]:
sql_near = """ -- My lat/lon: approx 45.1, -69.3
SELECT USPS, ALAND_SQMI, AWATER_SQMI
FROM census_zipcode_geography 
-- search within approximately 0.15 degrees 
-- but lat/lon are different sizes away from equator
WHERE SQRT ( POWER(ST_X(location)-45.1, 2) + 
             POWER(ST_Y(location)+69.3, 2) ) < 0.15;
"""
with pymysql.connect(**credentials) as cur:
    cur.execute(sql_near)
    results = cur.fetchall()
    Row = namedtuple("Row", [c[0] for c in cur.description])

for row in results:
    print(Row(*row))

Row(USPS='04443', ALAND_SQMI=Decimal('137.384'), AWATER_SQMI=Decimal('8.674'))
Row(USPS='04479', ALAND_SQMI=Decimal('38.441'), AWATER_SQMI=Decimal('1.359'))
Row(USPS='04930', ALAND_SQMI=Decimal('53.969'), AWATER_SQMI=Decimal('2.373'))
Row(USPS='04939', ALAND_SQMI=Decimal('37.670'), AWATER_SQMI=Decimal('0.269'))


## Asynchronous access

On modern computers, I/O is by far the slowest component.  Thread switches, let alone process switches, are relatively expensive.  Simply checking whether a given I/O operation is ready to provide more data is one or two orders of magnitude cheaper, and has zero memory cost compared to allocating a thread.  

Using `aiomysql` (or `trio_mysql` or `tormysql`) allows your program to perform other work while waiting for the results to arrive from a query.  However, doing so *does* require becoming familiar with the `await` and `async` keywords, and generally shifting your thinking towards the styles of programming required by `asyncio` in the standard library. 

The simple examples in this lesson will not come remotely close to those where any of this matters.  But for much larger datasets, and for multi-tenancy of RDBMS access, the differences can be huge.

We first will import the `asyncio` scaffolding and the `aiomysql` module.  Because `asyncio` does not claim to follow the DB-API, the module attributes like `.apilevel`, and `.paramstyle` do not exist.  However, *most* of the DB-API is still consistent; e.g. `.connect()`, `.cursor()`, `.execute()`, `.fetchall()` still have their familiar meanings.

In [12]:
import asyncio
from asyncio import get_event_loop, gather, as_completed
import aiomysql

Because this code is running inside a Jupyter notebook which already has its own `asyncio` event loop, we need to use a third-party module called `nest_asyncio` to path the event loop and run async code in cells.  Outside of environments (Jupyter, web servers, GUI applications) that might create their own event loops, this is not necessary.

In [13]:
import nest_asyncio
nest_asyncio.apply()

We might want to check the zip codes near certain latitude/longitude locations.

In [14]:
Location = namedtuple("Location", "latitude longitude distance")
locs = [Location(40.0, -105.3, 0.15), Location(45.1, -69.3, 0.15), 
        Location(34.9, -82.4, 0.15), Location(42.6, -72.5, 0.15)]

In [15]:
# `port` MUST BE an int not a string representing int for pymysql
# Named of database is `db` rather than `database` parameter for aiomysql
credentials = dict(host='localhost',
                   user='ine_student',
                   password='ine-password',
                   db='ine',
                   port=3306)

For an asyncronous adapter, we need to wrap our operation in a special coroutine function that is defined with `async def` rather than plain `def`.  Each of the steps has an extra `await` keyword to indicate that the event loop is free to do other work between each such line.  The logic, however, is very much the same as we have seen with other adapters.

In [16]:
async def near_location(loc):
    conn = await aiomysql.connect(**credentials)
    cur = await conn.cursor()
    await cur.execute(sql_near % loc)
    results = await cur.fetchall()
    return (loc, results)

We cannot simply run this function, but need instead to tell the event loop to manage it.  In fact, let us let the event loop handle several such coroutines, each for a different reference location.

In [17]:
sql_near = """ -- templatized where near and distance
SELECT USPS, ALAND_SQMI, AWATER_SQMI
FROM census_zipcode_geography 
-- search within approximately 0.15 degrees 
-- but lat/lon are different sizes away from equator
WHERE SQRT ( POWER(ST_X(location)-%f, 2) + 
             POWER(ST_Y(location)-%f, 2) ) < %f;
"""

In [18]:
aws = [near_location(loc) for loc in locs]

loop = get_event_loop()
for ret in loop.run_until_complete(gather(*aws)):
    loc, results = ret
    print(loc, "...", len(results), "tuples\n", "-"*70)
    for tup in results[:2]:
        print(tup)
    print()

Location(latitude=40.0, longitude=-105.3, distance=0.15) ... 10 tuples
 ----------------------------------------------------------------------
('80025', Decimal('11.722'), Decimal('0.004'))
('80027', Decimal('19.462'), Decimal('0.196'))

Location(latitude=45.1, longitude=-69.3, distance=0.15) ... 4 tuples
 ----------------------------------------------------------------------
('04443', Decimal('137.384'), Decimal('8.674'))
('04479', Decimal('38.441'), Decimal('1.359'))

Location(latitude=34.9, longitude=-82.4, distance=0.15) ... 12 tuples
 ----------------------------------------------------------------------
('29601', Decimal('4.280'), Decimal('0.024'))
('29605', Decimal('25.579'), Decimal('0.175'))

Location(latitude=42.6, longitude=-72.5, distance=0.15) ... 13 tuples
 ----------------------------------------------------------------------
('01054', Decimal('22.790'), Decimal('0.164'))
('01301', Decimal('25.494'), Decimal('0.541'))



The above code has two limitations.  

* Each time a new coroutine is created, it makes a new connection.  A more efficient approach is to create a *connection pool* and share connections as they are requested (but not close them implicitly at function end).
* We wait for all the results from the various coroutines to be complete in `loop.run_until_complete()`.  If the 4th query is ready early, we cannot process it while waiting for the 1st query to complete.

As a secondary concern, by doing a `.fetchall()` on the results, we cannot not process each result tuple immediately.

In [19]:
async def one_location(pool, loc):
    async with pool.acquire() as conn:
        async with conn.cursor() as cur:
            await cur.execute(sql_near % loc)
            results = []
            async for row in cur:
                # might process each tuple as soon as received
                results.append(row)
    return (loc, results)

In [20]:
async def near_locations(locs):
    async with aiomysql.create_pool(**credentials, maxsize=10) as pool:
        queries = [one_location(pool, loc) for loc in locs]
        for future in as_completed(queries):
            loc, results = await future
            print(loc, "...", len(results), "tuples\n", "-"*70)
            for tup in results[:2]:
                print(tup)
            print()

In [21]:
loop = asyncio.get_event_loop()  
loop.run_until_complete(near_locations(locs))

Location(latitude=45.1, longitude=-69.3, distance=0.15) ... 4 tuples
 ----------------------------------------------------------------------
('04443', Decimal('137.384'), Decimal('8.674'))
('04479', Decimal('38.441'), Decimal('1.359'))

Location(latitude=42.6, longitude=-72.5, distance=0.15) ... 13 tuples
 ----------------------------------------------------------------------
('01054', Decimal('22.790'), Decimal('0.164'))
('01301', Decimal('25.494'), Decimal('0.541'))

Location(latitude=40.0, longitude=-105.3, distance=0.15) ... 10 tuples
 ----------------------------------------------------------------------
('80025', Decimal('11.722'), Decimal('0.004'))
('80027', Decimal('19.462'), Decimal('0.196'))

Location(latitude=34.9, longitude=-82.4, distance=0.15) ... 12 tuples
 ----------------------------------------------------------------------
('29601', Decimal('4.280'), Decimal('0.024'))
('29605', Decimal('25.579'), Decimal('0.175'))



## Summary

For users who want to venture beyond the standard `mysql.connector` adapter, several other good options are available.  For heavy workloads, using one of the asynchronous adapters can be a big win. 