<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:27%; left:10%;">
    MySQL and MariaDB for Python Developers
</h1>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:55%; left:10%;">
    David Mertz, Ph.D.
</h3>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:62%; left:10%;">
    Data Scientist
</h3>
</div>

# Data types in MySQL

Columns in MySQL may utilize a wide range of data types.  Many of these are features in SQL standards.  Others are MySQL custom data types with particular functions and syntax provided to work with them.  Still others—such as support for JSON and spatial objects—are supported in the most recent SQL standards, but were added in somewhat different forms earlier and independently within MySQL.

We have used `TIMESTAMP` in an earlier lesson.  `DATE`, `DATETIME`, and `YEAR` are also available.  Within Python, these all interact seamlessly with the Python `datetime` module, so no special discussion is contained in this lesson.

In [1]:
import mysql.connector
import pandas as pd
cred = dict(user='ine_student', password='ine-password', database='ine', host='localhost')
conn = mysql.connector.connect(**cred)
cur = conn.cursor()

## Numeric types

MySQL numeric types consist of one, two, three, four, and eight byte integers, signed or unsigned.  Four and eight byte floating-point numbers, and selectable-precision decimals. If the extra specifiers `UNSIGNED` and `SIGNED` are omitted, the latter is assumed.

In [17]:
sql = """
CREATE TABLE numbers (
    name TEXT,
    a TINYINT UNSIGNED DEFAULT NULL, -- 0 to +255
    b SMALLINT SIGNED DEFAULT NULL,  -- -32,768 to 32,767
    c MEDIUMINT DEFAULT NULL,        -- -8,388,608 to 8,388,607
    d INT DEFAULT NULL,              -- approx -2e9 to +2e9
    e BIGINT UNSIGNED DEFAULT NULL,  -- 0 to approx 1.8e19
    f DECIMAL(10,9) DEFAULT NULL,    -- up to 65 decimals digits permitted
    g DECIMAL(40,25) DEFAULT NULL,   -- 40 total digits, 25 trailing
    h FLOAT DEFAULT NULL,            -- 32-bit float
    i DOUBLE PRECISION DEFAULT NULL  -- 64-bit float
    );
"""
cur.execute('DROP TABLE IF EXISTS numbers;')
cur.execute(sql)
conn.commit()

Let us insert a fairly precise version of pi into all the columns.

In [3]:
pi = '3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534'
cur.execute("INSERT INTO numbers (name) VALUES ('pi')")

for col in 'abcdefghi':
    try:
        cur.execute(f"UPDATE numbers SET {col}={pi} WHERE name='pi';")
    except Exception as err:
        print(f"Column {col}: {str(err).strip()}")
    else:
        cur.execute(f'SELECT {col} FROM numbers;')
        print(f"Column {col}: {cur.fetchall()[-1]}")
    finally:
        conn.commit()

Column a: (3,)
Column b: (3,)
Column c: (3,)
Column d: (3,)
Column e: (3,)
Column f: (Decimal('3.141592654'),)
Column g: (Decimal('3.1415926535897932384626434'),)
Column h: (3.14159,)
Column i: (3.141592653589793,)


In [4]:
bignum = '1000000000000'
cur.execute(f"INSERT INTO numbers (name) VALUES ('{bignum}')")
for col in 'abcdefghi':
    try:
        cur.execute(f"UPDATE numbers SET {col}={bignum} WHERE name='{bignum}';")
    except Exception as err:
        print(f"Column {col}: {str(err).strip()}")
    else:
        cur.execute(f'SELECT {col} FROM numbers;')
        print(f"Column {col}: {cur.fetchall()[-1]}")
    finally:
        conn.commit()

Column a: 1264 (22003): Out of range value for column 'a' at row 2
Column b: 1264 (22003): Out of range value for column 'b' at row 2
Column c: 1264 (22003): Out of range value for column 'c' at row 2
Column d: 1264 (22003): Out of range value for column 'd' at row 2
Column e: (1000000000000,)
Column f: 1264 (22003): Out of range value for column 'f' at row 2
Column g: (Decimal('1000000000000.0000000000000000000000000'),)
Column h: (1000000000000.0,)
Column i: (1000000000000.0,)


In [5]:
cur.execute("SELECT * FROM numbers")
pd.DataFrame(cur.fetchall(), 
             columns=[c[0] for c in cur.description]).T

Unnamed: 0,0,1
name,pi,1000000000000.0
a,3,
b,3,
c,3,
d,3,
e,3,1000000000000.0
f,3.141592654,
g,3.1415926535897932384626434,1000000000000.0
h,3.14159,1000000000000.0
i,3.14159,1000000000000.0


## Character and binary data

There are numerous character and binary data types in MySQL, of both fixed and varying length. They are `CHAR`, `VARCHAR`, `BINARY`, `VARBINARY`, `BLOB`, `TEXT`, `ENUM`, and `SET`.  Even more types actually exist, since `TINYTEXT`, `TINYBLOB`, `MEDIUMTEXT`, `MEDIUMBLOB`, `LONGTEXT` and `LONGBLOB` are `TEXT` and `BLOB` with different maximum sizes.




The fact there are so many to choose from can make the decision a little tricky.  For columns that will be indexed (other than with a `FULLTEXT` index), only `CHAR` and `VARCHAR` (or `BINARY` and `VARBINARY`) can be used.  `VARCHAR` uses one extra character of storage to null terminate each value, but `CHAR` uses space padding; this means that either might be smaller overall, but for most kinds of textual content, `VARCHAR` will win.  However, on the flip side, `CHAR` columns produce somewhat faster indices.  All of these are small differences, however.

The `TEXT` or `BLOB` types are very similar to `VARCHAR`, but with fixed maximum lengths of $2^8$, $2^{16}$, $2^{24}$ and $2^{32}$ for the several size prefixes.  These different types use a size prefix of 1, 2, 3, or 4 bytes, but beyond a few bytes, no space is wasted by choosing a larger `TEXT` type.  By implication, the largest size of a single data item in MySQL is 4 GiB.

In [6]:
sql = """
CREATE TABLE characters (
    a CHAR(5) CHARACTER SET utf8,
    b CHAR(50),
    c VARCHAR(5),
    d VARCHAR(50),
    e TEXT CHARACTER SET latin1 COLLATE latin1_general_cs
    );
"""
cur.execute('DROP TABLE IF EXISTS characters;')
cur.execute(sql)
conn.commit()

UTF-8 should be the default encoding for any sensibly configured MySQL installation, but legacy settings sometimes exist.  Unless you have a compelling need not to do so, use UTF-8 for all text.  Within Python, characters are always abstract unicode code points, so no special conversion is needed.

In [7]:
s = 'MySQL sure has a lot of character types'
for col in 'abcde':
    try:
        cur.execute(f"INSERT INTO characters ({col}) VALUES ('{s}');")
    except Exception as err:
        print(f"Column {col}: {str(err).strip()}")
    else:
        cur.execute(f'SELECT {col} FROM characters;')
        print(f"Column {col}: {cur.fetchall()[0]}")
    finally:
        conn.rollback()

Column a: 1406 (22001): Data too long for column 'a' at row 1
Column b: ('MySQL sure has a lot of character types',)
Column c: 1406 (22001): Data too long for column 'c' at row 1
Column d: ('MySQL sure has a lot of character types',)
Column e: ('MySQL sure has a lot of character types',)


## Binary data



To illustrate binary data, we might create a few Python pickles as something plausible to store in a database.

In [8]:
from pickle import dumps, loads
cur.execute("DROP TABLE IF EXISTS pickles;")
cur.execute("CREATE TABLE pickles (name TEXT, bytes BLOB);")

data = [('tuple', dumps(("a string", 1+2j))), 
        ('dict', dumps({'this': 4, 'that': 1.23})),
        ('complex', dumps(3+4j))]

cur.executemany("INSERT INTO pickles VALUES (%s, %s)", data)

In [9]:
cur.execute("SELECT * FROM pickles;")
for row in cur:
    print(row[0], loads(row[1]))
    print(bytes(row[1]))
    print()

tuple ('a string', (1+2j))
b'\x80\x04\x95;\x00\x00\x00\x00\x00\x00\x00\x8c\x08a string\x94\x8c\x08builtins\x94\x8c\x07complex\x94\x93\x94G?\xf0\x00\x00\x00\x00\x00\x00G@\x00\x00\x00\x00\x00\x00\x00\x86\x94R\x94\x86\x94.'

dict {'this': 4, 'that': 1.23}
b'\x80\x04\x95\x1e\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x04this\x94K\x04\x8c\x04that\x94G?\xf3\xae\x14z\xe1G\xaeu.'

complex (3+4j)
b'\x80\x04\x95.\x00\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x07complex\x94\x93\x94G@\x08\x00\x00\x00\x00\x00\x00G@\x10\x00\x00\x00\x00\x00\x00\x86\x94R\x94.'



## Set and Enum

Two additional string types are available: `SET` and `ENUM`.  These create more efficient storage, and access, but are mostly equivalent semantically to strings.  We saw earlier that `ENUM` is simply a string that can have a finite number of distinct values (maximum 65k).  A `SET` can contain multiple strings that you interact with simply as commas separated values.  However, internally, a set of up to 64 items is represented as a single bit field (32- or 64-bit; which can be much smaller than the full text it represents).

In [10]:
sql_items = """
CREATE TABLE items (
  onecolor ENUM('red', 'green', 'blue', 'orange', 'white', 'black', 'purple'),
  many SET('red', 'green', 'blue', 'orange', 'white', 'black', 'purple')
);
"""
cur.execute("DROP TABLE IF EXISTS items")
cur.execute(sql_items)

In [11]:
from random import shuffle, randrange
colors = ["red", "green", "blue", "orange", "white", "black", "purple"]

for _ in range(100):
    shuffle(colors)
    onecolor = colors[-1]
    many = ",".join(colors[:randrange(8)])
    cur.execute(f"INSERT INTO items VALUES ('{onecolor}', '{many}')")

In [12]:
cur.execute("SELECT * FROM items")
pd.DataFrame(cur.fetchall(), columns=[c[0] for c in cur.description])

Unnamed: 0,onecolor,many
0,orange,{black}
1,green,"{purple, blue, white, orange}"
2,blue,"{white, black}"
3,green,"{red, green, orange, blue, white, purple, black}"
4,purple,{}
...,...,...
95,black,{}
96,white,{orange}
97,black,"{green, blue, white}"
98,green,"{red, orange, blue, white, black}"


## Geometric types

MySQL offers a wide range of geometric/spatial types.  All of them describe some particular subset of a Cartesian plain.  The type `POINT` was used in prior lessons.  We also have: `GEOMETRY`, `LINESTRING`, `POLYGON`, `MULTIPOINT`, `MULTILINESTRING`, `MULTIPOLYGON`, `GEOMETRYCOLLECTION`.  Working with these gets into some specialized GIS (Geographic Information Systems) issues that will not be addressed in this course.  

If your work is in this area, MySQL provides R-Trees with quadratic splitting for `SPATIAL` indexes on spatial columns.  R-tree indices can greatly improve performance for many kinds of queries.

## Full text indexing

Use of `FULLTEXT` indices is not a data type *per se*, but can work with any character data type.  However, it is worth noting and is related to these data types. MySQL is very powerful is in performing efficient full-text search.  For this example, let us put the Project Gutenberg book, _Introduction to the study of the history of language_, by Herbert Augustus Strong and Willem Sijbrand Logeman and Benjamin Ide Wheeler (1891) into a MySQL table, in a special way.

In [13]:
# Create one text for each paragraph
paras = open('../data/58650-0.txt').read().split('\n\n')
len(paras)

3247

The format used here will allow many books to be loaded.  For the example, only one is used.

In [14]:
sql_books = """
CREATE TABLE books (
    book_id TEXT,
    para_num INTEGER,
    para_text TEXT,
    FULLTEXT idx (para_text)
) ENGINE=InnoDB;
"""
cur.execute("DROP TABLE IF EXISTS books;")
cur.execute(sql_books)
conn.commit()

In [15]:
for n, para in enumerate(paras):
    # Add both the vector and full text to row
    sql = "INSERT INTO books VALUES (%s, %s, %s);"
    cur.execute(sql, ('58650-0.txt', n, para))
conn.commit()

This becomes interesting where we want to search for patterns.  For example, the linguistics book sometimes discusses what proper names refer to.  The search we perform does basic stemming (e.g. name/names both match) and is case insensitive.

In [16]:
import re
sql_search = """
SELECT para_num, para_text, 
       MATCH(para_text) AGAINST ('+proper +names' IN NATURAL LANGUAGE MODE) AS score 
FROM books
ORDER BY score DESC
LIMIT 2;
"""
cur.execute(sql_search)
for para, text, score in cur.fetchall():
    pat = re.compile(r'(proper|names|name|refer)', re.I)
    text = re.sub(pat, r'❰\1❱', text)
    print(f"Paragraph: {para}\nScore: {score}")
    print(text[:1000], "...")
    print()

Paragraph: 194
Score: 66.00066375732422
❰Proper❱ ❰names❱ owe their origin to the change of the ‘occasional’ concrete
meanings of certain words into ‘usual’ meanings. All ❰names❱ of persons
and places took their origin from ❰names❱ of species; and the usage
κατ’ ἐξοχήν was the starting-point for this process. We are
able to observe it distinctly in numerous instances of ❰names❱ both of
persons and of places. Such ordinary ❰names❱ as the following are very
instructive for our purpose: _Field_, _Hill_, _Bridges_, _Townsend_,
_Hedges_, _Church_, _Stone_, _Meadows_, _Newton_, _Villeneuve_,
_Newcastle_, _Neuchâtel_, _Neuburg_, _Milltown_, etc. Such ❰names❱ as
these served in the first instance merely to indicate to neighbours a
certain person or town: and they were sufficient to distinguish such
person or town from others in the neighbourhood. They passed into
regular ❰proper❱ ❰names❱ as soon as they were apprehended in this concrete
sense by neighbours too far removed to judge of the reason