In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas
import pyodbc

# Code Quality Assurance Practices

## Summary
 * How do you define quality code
 * What are the common QA practices?
 * How do you conduct a peer review?
 * What kinds of tools can we use?
 * Common pitfalls

## How do you define quality code?
### Basic Tenets:
 * DRY: Don't Repeat Yourself
 * KISS: Keep it Simple, Stupid!
 * SRP: Single Responsibility Principle
 * Don't reinvent the wheel
   * But don't blindly trust other wheels!
 * Be clear:
   * function names should mean something
   * variable names should me something
   * lines shouldn't be too long
   * functions should have descriptions if sufficiently complex
 * Results: Does the code produced expected results from known input?
 * Error handling: are edge and corner cases properly handled?

## Work through a basic example: retrieving values from a database 
(`import pyodbc` is implied)

In [2]:
def db_locs():
    ''' Gets a list of locations from the database
    '''
    cnn = pyodbc.connect(database='prjTEST', server='pmtester-02')
    cur = cnn.cursor()
    query = "select locname from locations"
    cur.execute(query)
    
    results = [row['locname'] for row in cur]
    cur.close()
    cnn.close()
    return results

In [3]:
def db_samps():
    ''' Gets a list of samples from the database
    '''
    cnn = pyodbc.connect(database='prjTEST', server='pmtester-02')
    cur = cnn.cursor()
    query = "select samplename from samples"
    cur.execute(query)
    
    samples = [row['samplename'] for row in cur]
    cur.close()
    cnn.close()
    return samples

### What do the names mean? Are we getting or inserting?
### Apply some DRY and clear up names

In [4]:
def connectToDB(cmd=None):
    cnn = pyodbc.connect(db='prjTEST', server='pmtester-02')
    cur = cnn.cursor()
    if cmd is not None:
        cur.execute(cmd)
        
    return cnn, cur

def closeConnections(cnn, cur):
    cur.close()
    cnn.close()
    
def getLocations():
    query = "select locname from locations"
    cnn, cur = connectToDB(cmd=query)
    results = [row['locname'] for row in cur]
    closeConnections(cnn, cur)
    return results
    
def getSamples():
    query = "select samplename from samples"
    cnn, cur = connectToDB(cmd=query)
    samples = [row['samplename'] for row in cur]
    closeConnections(cnn, cur)
    return samples
    

### And more DRY still...

In [5]:
def connectToDB(cmd=None):
    cnn = pyodbc.connect(db='prjTEST', server='pmtester-02')
    cur = cnn.cursor()
    if cmd is not None:
        cur.execute(cmd)
    return cnn, cur

def closeConnections(cnn, cur):
    cur.close()
    cnn.close()

def _get_single_col_from_table(column, table):
    query = "select {} from {}".format(column, table)
    cnn, cur = connectToDB(cmd=query)
    values = [row[column] for row in cur]
    closeConnections(cnn, cur)
    return values

def getLocations():
    return _get_single_col_from_table('locname', 'locations')

def getSamples():
    return _get_single_col_from_table('samplename', 'samples')

## Now use previously invented wheels

In [6]:
import pyodbc
import pandas

def connectToDB():
    return pyodbc.connect(db='prjTEST', server='pmtester-02')

def _get_single_col_from_table(column, table):
    query = "select {} from {}".format(column, table)
    with connectToDB() as cnn:
        values = pandas.read_sql(query, cnn)    
    return values[column].tolist()

def getLocations():
    return _get_single_col_from_table('locname', 'locations')

def getSamples():
    return _get_single_col_from_table('samplename', 'samples')

## Recall where we started

In [7]:
def db_locs():
    ''' Gets a list of locations from the database
    '''
    cnn = pyodbc.connect(database='prjTEST', server='pmtester-02')
    cur = cnn.cursor()
    query = "select locname from locations"
    cur.execute(query)
    
    results = [row['locname'] for row in cur]
    cur.close()
    cnn.close()
    return results

def db_samps():
    ''' Gets a list of samples from the database
    '''
    cnn = pyodbc.connect(database='prjTEST', server='pmtester-02')
    cur = cnn.cursor()
    query = "select samplename from samples"
    cur.execute(query)
    
    samples = [row['samplename'] for row in cur]
    cur.close()
    cnn.close()
    return samples

## So the code looks nice, but can we trust it?

### The main goal of things like DRY, SRP, and KISS is to facilitate QA.

### QA comes in few primary things:
 * Peer review
   * Code is read far more than it is written
   * Intent should be communicated through function/variable names
   * Code should execute in a very linear fashion, telling a story
     where classes/variables are nouns acted upon by verbs 
     (methods/functions)
 * Unit testing
   * Small, simple functions are easier to test:
   * range of possible inputs and outputs shrinks
   * error handling simplifies
 * Continuous Integration (CI)
   * As code evolves CI systems automically run tests
   * Test failures trigger an alert
   * Code metrics (such as test coverage) can be reported automatically

## To quote Wes McKinney (author of pandas)

> The test suite is where a library hangs its dirty laundry

After the initial main code base is reviewed to for the
main concepts, it's time to really dig into the test
suite.

## Q: What's a test suite?
## A: Test suites collections of special functions and class that confirm that each library function and class behaves as expected. They also guard against small errors propogating into other parts of a code base.

## Q: But I wrote a bunch of code without a test suite? How are you going to review it?
## A1: I'm going to make you write a test suite
## A2: You should have written your tests before your wrote your code

## Caveat: A test suite is only as good as you make it.

### You should strive to really stress your functions and seek out edge and corner cases

## Test-driven development Example

### PM needs a database summarized. Main output should be a table with the following columns:
   1. Site Area
   2. Pollutant
   3. Median concentration, three sig figs with qualifier
   4. Maximum concentration, three sig figs with qualifier
   

   
### So we need functions to 
   1. connect to the database
   2. retrieve data from the database
   3. format a number of any order of magnitude to 3 sig figs
   4. find the maximum result for a given site area and its qualifier
   5. find the median results ...
   6. combine the formatted result and qualifier into a string
   7. write all of the output to a table

## Formatting by significant figures