# BMI 535/635: Management & Processing of Large-scale Data

#### Author: Michael Mooney (mooneymi@ohsu.edu)

## Week 3: Data Storage and Querying Solutions in Python

1. Introduction
2. Learning Objectives
3. Resource Profiling
4. Review of Basic Python Data Types
5. Data from dbSNP
6. Connecting to Relational DBs
7. Pandas
8. Bcolz and bdot
9. HDF5 (PyTables)

Requirements:
- Python modules:
    - `os`
    - `time`
    - `timeit`
    - `memory_profiler`
    - `shutil`
    - `numpy`
    - `pandas`
    - `bcolz`
    - `bdot`
    - `pytables (tables)`
    - `pymysql`
- Data files:
    - dbSNP annotations (chromosome 1 only): `./data/chr1_reducedCols.txt.gz`
    - A MySQL config file containing connection parameters: `mysqlconfig.py`

In [1]:
import os
import shutil
import numpy as np
import pandas as pd
import bcolz
import bdot
import tables
import pymysql as pym
import pymysql.cursors

## Introduction

Below are some common problems encountered when working with large datasets.

1. Data does not fit into memory
    - In particular, this can be a problem when setting up parallel computations, where each process needs the full data
2. Accessing (querying) the data is slow
3. Data files on-disk are very large (i.e. not easily portable)

Potential Solutions:

1. Use on-disk storage that is optimized for fast read/write access
2. Use data storage that allows for multiple concurrent reads (i.e. can be shared across multiple processes)
3. Use data compression

### Learning Objectives

1. You will learn some basic methods for profiling the amount of resources and time used by computational tasks
2. You will learn how store large datasets in various "high-performance" Python data structures
3. You will learn how to query data in each of the data structures
4. You will learn how to convert between these various data storage solutions


## Resource Profiling

In [2]:
## Note: this is not a python command (only needed in the Jupyter notebook)
%load_ext memory_profiler

In [3]:
import time
import timeit
from memory_profiler import memory_usage

In [4]:
## A dummy function that creates a large list
def foo(a, n=100):
    time.sleep(2)
    b = [a] * n
    time.sleep(1)
    return None

## Use the time and memory_profiler modules to profile the foo function
t0 = time.time()
mem1 = memory_usage((foo, (1,10000000)), max_usage=True, timestamps=True)[0]
print "Elapsed time: %.3f seconds\n Memory used: %.3f Mb" % (mem1[1]-t0, mem1[0])

Elapsed time: 3.126 seconds
 Memory used: 153.695 Mb


In [5]:
## Use timeit to profile foo
timeit.timeit('foo(1,10000000)', setup='from __main__ import foo', number=1) 

3.0711960792541504

In [6]:
## Use timeit to profile multiple function calls
## Default is 3 repeats (repeat=3)
timeit.repeat('foo(1,10000000)', setup='from __main__ import foo', number=1) 

[3.078933000564575, 3.0812008380889893, 3.0734341144561768]

#### Note: in a Jupyter notebook the %memit, %time, and %timeit magics are available

Use the following to see the docstrings:

%memit?

%time?

%timeit?

In [7]:
%memit foo(1, 10000000)

peak memory: 153.79 MiB, increment: 0.01 MiB


In [8]:
%time foo(1, 10000000)

CPU times: user 62.8 ms, sys: 14.6 ms, total: 77.4 ms
Wall time: 3.08 s


In [9]:
%timeit -n 1 -r 3 foo(1, 10000000)

1 loop, best of 3: 3.08 s per loop


****Note: Be cautious when using these Jupyter magics when doing things such as opening files, it is possible your code will be executed multiple times which could cause problems (i.e. multiple open file handles).**

## Review of Basic Python Data Types

Basic Python data types and when to use them:

**Lists/Tuples**: Use these when you need to iterate over a collection of items.

**Dictionaries**: Use these when you need to repeatedly access individual data elements (e.g. searching by a key value). 

**Sets**: Use these when you need to test for membership in a collection of items. Note: dictionaries can work well for this as well.

In [10]:
%%memit
## Create some example data
import random
LIST1 = random.sample(xrange(1000000), 1000000)
DICT1 = dict([(i,idx) for idx, i in enumerate(LIST1)])
SET1 = set(LIST1)

peak memory: 328.26 MiB, increment: 174.39 MiB


In [11]:
## How long does it take to find an item?
## Using a list
t0 = time.time()
idx = LIST1.index(567890)
print idx
print "Elapsed time for list: %.3f seconds\n" % (time.time()-t0,)

## Using a dictionary
t0 = time.time()
idx = DICT1[567890]
print idx
print "Elapsed time for dictionary: %.3f seconds\n" % (time.time()-t0,)

501355
Elapsed time for list: 0.031 seconds

501355
Elapsed time for dictionary: 0.000 seconds



In [12]:
%time LIST1.index(567890)

CPU times: user 32.2 ms, sys: 725 µs, total: 32.9 ms
Wall time: 33 ms


501355

In [13]:
%time DICT1[567890]

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 8.82 µs


501355

In [14]:
## How long does it take to determine if an item exists?
x = 567890
## Using a list
print "List: "
%time x in LIST1

## Using a dictionary
print "Dictionary: "
%time x in DICT1

## Using a set
print "Set: "
%time x in SET1

List: 
CPU times: user 31.2 ms, sys: 76 µs, total: 31.3 ms
Wall time: 31.3 ms
Dictionary: 
CPU times: user 4 µs, sys: 4 µs, total: 8 µs
Wall time: 8.82 µs
Set: 
CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 5.01 µs


True

## dbSNP Dataset

For the following examples, we'll be using data from dbSNP, which contains information about all single nucleotide polymorphisms (SNPs) on human chromosome 1. The data file is a tab-delimited text file containing four columns: the 'rs' number of the SNP, the chromosome, the position, and a comma-separated list of genes at the same location. Note: the file contains a multi-line header.

In [15]:
!head ./data/chr1_reducedCols.txt

dbSNP Chromosome Report
Refer to ftp://ftp.ncbi.nlm.nih.gov/snp/00readme for documentation on tabular data below

rs#	chr	chr	local
		pos	loci


171	1	175261679	
242	1	20869461	
538	1	6160958	KCNAB2


## Connecting to Relational DBs in Python

The following MySQL examples assume a local database server, with a database called 'bmi535'. The following commands were run to create a table and load data:

    CREATE TABLE snps (rs int(10), 
                       chr int(10), 
                       pos int(10), 
                       loci varchar(80));
    
    LOAD DATA LOCAL INFILE '~/Documents/BMI535/Lectures/data/chr1_reducedCols.txt' 
    INTO TABLE snps FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
    IGNORE 7 LINES (rs, chr, pos, loci);
    
The following command was run to clean cases of missing data (un-mapped SNPs):

    UPDATE snps SET pos = NULL WHERE pos = 0;

I've also created a second table with the same data, but this time I've created an index on the 'pos' column.

    CREATE TABLE snps_idx SELECT * FROM snps;
    
    CREATE INDEX pos ON snps_idx (pos); 


In [16]:
## Import database connection settings
import mysqlconfig as cfg

In [17]:
## Connect to the MySQL database
## Note: the 'cursorclass' parameter is optional, in this case it specifies
## that query results will be returned as dictionaries, rather than the default tuples
conn = pym.connect(host=cfg.mysql['host'], user=cfg.mysql['user'], password=cfg.mysql['password'], 
                   database=cfg.mysql['db'], cursorclass=pymysql.cursors.DictCursor)

In [18]:
%%time
query = "SELECT * FROM snps WHERE chr = 1 AND pos = 225512846 AND loci = 'DNAH14';"
with conn as c:
    with c as cur:
        cur.execute(query)
        result = cur.fetchone()
        print result


{u'chr': 1, u'loci': 'DNAH14', u'pos': 225512846, u'rs': 189425743}
CPU times: user 1.53 ms, sys: 1.85 ms, total: 3.38 ms
Wall time: 5.56 s


In [19]:
%%time
## Now let's look at how an indexed table affects performance
query = "SELECT * FROM snps_idx WHERE chr = 1 AND pos = 225512846 AND loci = 'DNAH14';"
with conn as c:
    with c as cur:
        cur.execute(query)
        result = cur.fetchone()
        print result


{u'chr': 1, u'loci': 'DNAH14', u'pos': 225512846, u'rs': 189425743}
CPU times: user 1.35 ms, sys: 1.1 ms, total: 2.45 ms
Wall time: 11.5 ms


In [20]:
conn.close()

In [21]:
## Note: you can use the following connection attribute to test if the connection is open
conn.open

False

## Pandas

Pandas ...

### Load Data into a Pandas DataFrame

In [22]:
## Load SNP Data
## Note we can load data directly from a compressed file (gzip)
%memit snps = pd.read_csv('./data/chr1_reducedCols.txt.gz', compression='gzip', sep='\t', header=None, skiprows=7, names=['rs','chr','pos','loci'])

peak memory: 1493.11 MiB, increment: 1222.81 MiB


In [23]:
## Check the dimensions of the dataframe
snps.shape

(12237943, 4)

In [24]:
## Print info about the data (data types, etc.)
snps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12237943 entries, 0 to 12237942
Data columns (total 4 columns):
rs      int64
chr     int64
pos     object
loci    object
dtypes: int64(2), object(2)
memory usage: 373.5+ MB


In [25]:
## Print the first few rows of the data
snps.head()

Unnamed: 0,rs,chr,pos,loci
0,171,1,175261679,
1,242,1,20869461,
2,538,1,6160958,KCNAB2
3,546,1,93617546,TMED5
4,549,1,15546825,TMEM51


In [26]:
## Do some data cleaning ...
## Since some SNP positions were missing (spaces), make sure to convert
## the column to numeric data.
## Also, fill NaNs in the loci column with empty strings. This will improve 
## compatability with other Python modules (e.g. conversion of data types)
snps['pos'] = pd.to_numeric(snps['pos'], errors='coerce', downcast='integer')
snps = snps.fillna({'loci':''})

In [27]:
snps.head()

Unnamed: 0,rs,chr,pos,loci
0,171,1,175261679.0,
1,242,1,20869461.0,
2,538,1,6160958.0,KCNAB2
3,546,1,93617546.0,TMED5
4,549,1,15546825.0,TMEM51


In [28]:
snps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12237943 entries, 0 to 12237942
Data columns (total 4 columns):
rs      int64
chr     int64
pos     float64
loci    object
dtypes: float64(1), int64(2), object(1)
memory usage: 373.5+ MB


### Perform Query Using Pandas

In [29]:
%time pandas_result = snps.query("(chr==1) & (pos==225512846) & (loci=='DNAH14')")

CPU times: user 403 ms, sys: 141 ms, total: 543 ms
Wall time: 568 ms


In [30]:
pandas_result

Unnamed: 0,rs,chr,pos,loci
3456788,189425743,1,225512846.0,DNAH14


### Save to HDF5 for Use Later

**Pandas has some confusing documentation when it comes to creating HDF5 files from dataframes (the `to_hdf()` method). According to the docs, the `data_columns` parameter is meant to specify what columns should be indexed in the HDF5 file (PyTables format only). It does do this, but it also uses this parameter to specify which columns are able to be (easily) queried in the HDF5 table. And ultimately, whether or not indexes are created is controlled by the `index` parameter. As I see it, you should always use `data_columns=True` so you can always query all columns, but control indexing with the `index` parameter (and actually it is probably better to create indexes later, as needed, using the PyTables module; see below). Creating indexes on all columns is costly and probably unnecessary in most cases.

In [None]:
## Note: Use index=False to avoid creating any indexes in the HDF5 file.
%time snps.to_hdf('./data/snps_pandas_hdf.h5', mode='w', key='/snps', format='table', data_columns=True, index=False, complib='blosc:lz4', complevel=9)

In [None]:
## How much space is used on disk?
!du -sh ./data/snps_pandas_hdf.h5

In [None]:
## Save an HDF5 with zlib compression for compatibility with R
## This is much slower than above, so I've lowered the compression level
%time snps.to_hdf('./data/snps_pandas_hdf_zlib.h5', mode='w', key='/snps', format='table', data_columns=True, index=False, complib='zlib', complevel=6)

In [None]:
## How much space is used on disk?
!du -sh ./data/snps_pandas_hdf_zlib.h5

## Bcolz

Bcolz is a Python module for storing large data sets on-disk or in memory with compression. 

### Load Data into a Bcolz ctable (in-memory)

Bcolz Tutorials:
[http://bcolz.readthedocs.io/en/latest/tutorial.html](http://bcolz.readthedocs.io/en/latest/tutorial.html)

****Note**: If your dataframe has a string column with missing values (NaNs), the conversion to a Bcolz ctable may be very slow. You should fill-in the NaNs with empty strings (or some other value) so that Bcolz can more easily convert the Pandas 'object' data type to fixed length strings (we did this above).

In [31]:
## Get info about Bcolz and set some parameters
bcolz.print_versions()
bcolz.defaults.cparams['cname'] = 'lz4'
bcolz.defaults.cparams['clevel'] = 9
bcolz.defaults.cparams['shuffle'] = bcolz.BITSHUFFLE
bcolz.set_nthreads(1)

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version:     1.1.2
NumPy version:     1.13.3
Blosc version:     1.11.2 ($Date:: 2017-01-27 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd']
Numexpr version:   2.6.2
Dask version:      0.15.0
Python version:    2.7.11 |Anaconda custom (x86_64)| (default, Dec  6 2015, 18:57:58) 
[GCC 4.2.1 (Apple Inc. build 5577)]
Platform:          darwin-x86_64
Byte-ordering:     little
Detected cores:    8
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


8

In [32]:
%memit snps_bcolz1 = bcolz.ctable.fromdataframe(snps)

You can access infer_dtype as pandas.api.types.infer_dtype
  inferred_type = pd.lib.infer_dtype(vals)


peak memory: 3129.13 MiB, increment: 1229.25 MiB


In [33]:
## Indexing is much like numpy
snps_bcolz1[0:5]

array([(171, 1,   1.75261679e+08, ''), (242, 1,   2.08694610e+07, ''),
       (538, 1,   6.16095800e+06, 'KCNAB2'),
       (546, 1,   9.36175460e+07, 'TMED5'),
       (549, 1,   1.55468250e+07, 'TMEM51')],
      dtype=[('rs', '<i8'), ('chr', '<i8'), ('pos', '<f8'), ('loci', 'S76')])

In [36]:
## You can also easily iterate over a ctable
## Here we are iterating over the first 10 rows
[row.loci for row in snps_bcolz1.iter(0,10)]

['',
 '',
 'KCNAB2',
 'TMED5',
 'TMEM51',
 'ATP2B4',
 'FUCA1',
 'C1orf123,CPT2',
 'SERPINC1',
 '']

### Load Data into a Bcolz ctable (on-disk)

In [37]:
## Create an on-disk Bcolz ctable from a Pandas DataFrame
## First check that the data directory is empty
bcolz_dir = './data/bcolz_data'
if os.path.exists(bcolz_dir):
    shutil.rmtree(bcolz_dir)

In [38]:
%%memit
snps_bcolz2 = bcolz.ctable.fromdataframe(snps, rootdir=bcolz_dir)

peak memory: 2748.48 MiB, increment: 1005.21 MiB


In [39]:
%memit

peak memory: 1746.95 MiB, increment: 0.00 MiB


In [40]:
## How much space is used on disk?
!du -sh $bcolz_dir

182M	./data/bcolz_data


### Perform Query Using Bcolz (in-memory)

In [41]:
%time bcolz_result = snps_bcolz1["(chr==1) & (pos==225512846) & (loci=='DNAH14')"]

CPU times: user 1.29 s, sys: 25.2 ms, total: 1.32 s
Wall time: 1.32 s


In [42]:
bcolz_result

array([(189425743, 1,   2.25512846e+08, 'DNAH14')],
      dtype=(numpy.record, [('rs', '<i8'), ('chr', '<i8'), ('pos', '<f8'), ('loci', 'S76')]))

### Perform Query Using Bcolz (on-disk)

In [43]:
%time bcolz_result = snps_bcolz2["(chr==1) & (pos==225512846) & (loci=='DNAH14')"]

CPU times: user 1.39 s, sys: 130 ms, total: 1.52 s
Wall time: 1.93 s


In [44]:
bcolz_result

array([(189425743, 1,   2.25512846e+08, 'DNAH14')],
      dtype=(numpy.record, [('rs', '<i8'), ('chr', '<i8'), ('pos', '<f8'), ('loci', 'S76')]))

In [45]:
## Another way to query the bcolz ctable
%time bcolz_result2 = [row for row in snps_bcolz2.where("(chr==1) & (pos==225512846) & (loci=='DNAH14')")]

CPU times: user 1.42 s, sys: 125 ms, total: 1.54 s
Wall time: 1.75 s


In [46]:
bcolz_result2

[row(rs=189425743, chr=1, pos=225512846.0, loci='DNAH14')]

### Save to HDF5 for Use Later

In [None]:
## Note this will use the cparams specified above (i.e. same compression as Pandas)
%time snps_bcolz.tohdf5('./data/snps_bcolz_hdf.h5', mode='w', nodepath='/snps', )

In [None]:
## How much space is used on disk?
!du -sh ./data/snps_bcolz_hdf.h5

### Bcolz carrays and `bdot`

`carrays` are very similar to numpy multi-dimensional arrays (ndarrays), but have features such as compression and on-disk storage that make them useful for large data sets. They are good for homogeneous data, in contrast to `ctables` (above), which are better for heterogeneous data tables.

The `bdot` module is built on Bcolz and allows for fast dot products on carrays.

https://github.com/tailwind/bdot

Other resources for further reading:

Blaze: [http://blaze.pydata.org/](http://blaze.pydata.org/)

Dask: [https://dask.pydata.org/en/latest/](https://dask.pydata.org/en/latest/)

In [57]:
## Create an on-disk Bcolz carray from a numpy ndarray
## First check that the data directory is empty
bcolz_dir2 = './data/bcolz_data2'
if os.path.exists(bcolz_dir2):
    shutil.rmtree(bcolz_dir2)

In [58]:
carray1 = bdot.carray(np.random.uniform(0, 1, size=(10000, 100)), rootdir=bcolz_dir2)
carray1.flush()

In [64]:
## How much space is used on disk?
!du -sh $bcolz_dir2

7.6M	./data/bcolz_data2


In [59]:
carray1.shape

(10000, 100)

In [60]:
carray1[0:5, 0:5]

array([[ 0.14734498,  0.82226933,  0.21698654,  0.35471016,  0.63703963],
       [ 0.34430996,  0.29680343,  0.64645604,  0.05586492,  0.66312949],
       [ 0.06034305,  0.09151643,  0.76518275,  0.92991039,  0.06438756],
       [ 0.08673422,  0.04884272,  0.45266602,  0.6098255 ,  0.22949568],
       [ 0.25815289,  0.32575835,  0.62686822,  0.94105001,  0.82776109]])

In [61]:
## Create another carray with just the first row
v = carray1[0]
v.shape

(100,)

In [65]:
## Perform a dot product 
%memit x = carray1.dot(v)

peak memory: 1766.63 MiB, increment: -0.50 MiB


In [63]:
x.shape

(10000,)

## PyTables

PyTables is a Python module that provides an interface to the HDF5 library. It extends the basic HDF5 functionality to allow for improved querying...

[http://www.pytables.org/usersguide/tutorials.html](http://www.pytables.org/usersguide/tutorials.html)

### HDF5 from Pandas (PyTables format)

In [67]:
## Set filename and determine whether the file is in PyTables format
pandas_hdf5 = './data/snps_pandas_hdf.h5'
tables.is_pytables_file(pandas_hdf5)

'2.1'

In [68]:
%%memit
## Let's load the HDF5 file exported by Pandas
h5file = tables.open_file(pandas_hdf5)
h5_snps_pandas = h5file.root.snps

In [69]:
h5_snps_pandas.table

/snps/table (Table(12237943,), shuffle, blosc:lz4(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "rs": Int64Col(shape=(), dflt=0, pos=1),
  "chr": Int64Col(shape=(), dflt=0, pos=2),
  "pos": Float64Col(shape=(), dflt=0.0, pos=3),
  "loci": StringCol(itemsize=76, shape=(), dflt='', pos=4)}
  byteorder := 'little'
  chunkshape := (4854,)

In [70]:
## Is the table indexed?
h5_snps_pandas.table.colindexes

{
  }

In [71]:
## Let's search the HDF5 file
## The first column in the table is an index; I'm excluding it from the results
%time pytables_result = [row[1:] for row in h5_snps_pandas.table.iterrows() if row['chr']==1 and row['pos']==225512846 and row['loci']=='DNAH14']

CPU times: user 3.85 s, sys: 101 ms, total: 3.95 s
Wall time: 4 s


In [72]:
pytables_result

[(189425743, 1, 225512846.0, 'DNAH14')]

In [73]:
## Now let's run a query using the PyTables in-kernel method
%time pytables_result2 = [row[1:] for row in h5_snps_pandas.table.where("""(chr==1) & (pos==225512846) & (loci=='DNAH14')""")]

CPU times: user 1.08 s, sys: 45.9 ms, total: 1.12 s
Wall time: 1.12 s


In [74]:
pytables_result2

[(189425743, 1, 225512846.0, 'DNAH14')]

In [75]:
## Close the file
h5file.close()

### HDF5 from Bcolz

In [86]:
bcolz_hdf5 = './data/snps_bcolz_hdf.h5'
tables.is_pytables_file(bcolz_hdf5)

'2.1'

In [87]:
## And now the HDF5 file exported by bcolz
bcolz_hdf5 = './data/snps_bcolz_hdf.h5'
h5file2 = tables.open_file(bcolz_hdf5)
h5_snps_bcolz = h5file2.root.snps

In [88]:
h5_snps_bcolz

/snps (Table(12237943,), shuffle, blosc:lz4(9)) ''
  description := {
  "rs": Int64Col(shape=(), dflt=0, pos=0),
  "chr": Int64Col(shape=(), dflt=0, pos=1),
  "pos": Float64Col(shape=(), dflt=0.0, pos=2),
  "loci": StringCol(itemsize=76, shape=(), dflt='', pos=3)}
  byteorder := 'little'
  chunkshape := (5242,)

In [89]:
h5_snps_bcolz.colindexes

{
  }

In [90]:
## Let's search the HDF5 file (from Bcolz)
%time hdf5_result = [row[:] for row in h5_snps_bcolz.iterrows() if row['chr']==1 and row['pos']==225512846 and row['loci']=='DNAH14']

CPU times: user 3.73 s, sys: 48.7 ms, total: 3.78 s
Wall time: 3.78 s


In [91]:
hdf5_result

[(189425743, 1, 225512846.0, 'DNAH14')]

In [92]:
## Now let's run a query using the PyTables in-kernel method (again, on the Bcolz HDF5 file)
%time hdf5_result2 = [row[:] for row in h5_snps_bcolz.where("""(chr==1) & (pos==225512846) & (loci=='DNAH14')""")]

CPU times: user 1.08 s, sys: 48.1 ms, total: 1.13 s
Wall time: 1.13 s


In [93]:
## Close the file
h5file2.close()

In [94]:
## Are any files still open?
len(tables.file._open_files.get_handlers_by_name('./data/snps_pandas_hdf.h5'))

0

In [95]:
## If so, close them
tables.file._open_files.close_all()

### PyTables: Creating an Index

In [96]:
## Open the file in append mode
h5file = tables.open_file(pandas_hdf5, mode='a')
h5_snps_pandas = h5file.root.snps

In [97]:
## Let's take a look at the table
h5_snps_pandas.table

/snps/table (Table(12237943,), shuffle, blosc:lz4(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "rs": Int64Col(shape=(), dflt=0, pos=1),
  "chr": Int64Col(shape=(), dflt=0, pos=2),
  "pos": Float64Col(shape=(), dflt=0.0, pos=3),
  "loci": StringCol(itemsize=76, shape=(), dflt='', pos=4)}
  byteorder := 'little'
  chunkshape := (4854,)

In [98]:
## Now create the index
## Note: a csindex (completely sorted) is the most optimized index
## and therefore is likely to take longer to create and consume
## more disk space. You can create other types of indexes with
## the create_index() method
%time h5_snps_pandas.table.cols.pos.create_csindex()

CPU times: user 26.9 s, sys: 738 ms, total: 27.7 s
Wall time: 28.3 s


12237943

In [99]:
## Now you should see that an index has been added
h5_snps_pandas.table

/snps/table (Table(12237943,), shuffle, blosc:lz4(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "rs": Int64Col(shape=(), dflt=0, pos=1),
  "chr": Int64Col(shape=(), dflt=0, pos=2),
  "pos": Float64Col(shape=(), dflt=0.0, pos=3),
  "loci": StringCol(itemsize=76, shape=(), dflt='', pos=4)}
  byteorder := 'little'
  chunkshape := (4854,)
  autoindex := True
  colindexes := {
    "pos": Index(9, full, shuffle, zlib(1)).is_csi=True}

In [100]:
## Determine whether your query will use an index
## This will return the column name whose index is usable, or
## an empty set if none
h5_snps_pandas.table.will_query_use_indexing("""(chr==1) & (pos==225512846) & (loci=='DNAH14')""")

frozenset({'pos'})

In [101]:
## Let's run a query using the PyTables in-kernel method (now using an index)
%time pandas_result2 = [row[1:] for row in h5_snps_pandas.table.where("""(chr==1) & (pos==225512846) & (loci=='DNAH14')""")]

CPU times: user 5.75 ms, sys: 3.06 ms, total: 8.81 ms
Wall time: 10.6 ms


****This is nearly as good as using the indexed MySQL table!**

In [102]:
pandas_result2

[(189425743, 1, 225512846.0, 'DNAH14')]

In [103]:
## Close the file
h5file.close()

## What did we learn?

- Some basic ways to measure the performance (i.e. the resources used) of computational tasks
- There are multiple solutions for storing large datasets in Python, each with different capabilities
- Indexing can greatly improve query performance

### A Quick Summary

- For storing data in memory:
    - Pandas (Numpy)
    - Bcolz
- For storing data on-disk:
    - Bcolz
    - HDF5 (PyTables)


## In-Class Exercises

In [None]:
## Exercise 1.
## 

In [None]:
## Exercise 2.
##

#### Last Updated: 25-Oct-2017