# Debugging and Testing
**Based on materials contributed by Anthony Scopatz, Patrick Fuller, Katy Huff, and Rachel Slaybaugh**

## What is debugging?

The process of debugging is one of the most crucial parts of every piece of code you will ever write. Quite simply, it is finding and correcting errors in a program.

## Why does debugging matter?

Unless you're perfect, you are bound to make errors. Especially early 
on, expect to spend much more time debugging than actually coding. The process 
fits the Pareto principle - you're going to spend \~20% of your time writing 
~80% of your code, and the other ~80% of your time will be spent screaming 
obscenities at your computer (I think that's what the Pareto principle says, 
anyway). Remember to keep calm, and **LEARN** from your mistakes.

## Debugging Basics: Exceptions, Errors, and Tracebacks

When your code errors, Python will stop and return an *exception* that attempts 
to tell you what's up. There are approximately 165 exceptions in the Python standard 
library, and you'll be seeing many of them very soon. Exceptions to know
include:

    SyntaxError # You're probably missing a parenthesis or colon
    NameError   # There's probably a variable name typo somewhere
    TypeError   # You're doing something with incompatible variable types
    ValueError  # You're calling a function with the wrong parameter
    IOError     # You're trying to use a file that doesn't exist
    IndexError  # You're trying to reference a list element that doesn't exist
    KeyError    # Similar to an IndexError, but for dictionaries
    Exception   # This means "an error of any type" - hopefully you don't see it often

When code returns an exception, we say that the exception was **thrown** or
**raised**. These exceptions may be **handled** or **caught** by the code. Speaking
of, you can handle exceptions in Python like so:

In [3]:
try:
    a = 1.0 / 0.0
except ZeroDivisionError:
    print("Going from zero to hero.")
    a = 1.0

Going from zero to hero.


That being said, there are some things you should keep in mind.

* Exception handling in your own code should be seen as a last resort. _Never_
  use exception handling where another approach would work just as well.
* If you have to handle exceptions, be specific in their type. Writing a blanket
  `except Exception` line provides a place for unintended bugs to hide.

When an exception is printed, it often comes with something called a
**traceback**. This is Python's attempt to tell you where the code errored. It
will look like gibberish for a while, but that impression will go away with time.

So, when your code errors, Python tells you 

1. why it errored, and 
2. where it errored. 

"*Isn't that enough to debug?*", you might ask. 

Well, yeah. It is.  However if you debug only by running your code you will be spending a lot more
time in the screaming-obscenities-at-your-computer portion of coding. Every tool
discussed below doesn't add much in terms of functionality (they're still just
pointing out errors), but they all help in decreasing debugging time.

## Linting: Catching the Stupid Errors

As I said before, you can debug by simply attempting to run your code. This,
however, is very annoying. First off, the code will always stop at the first 
exception. This means that, if you have ten errors, you'll have to run the code 
ten times to find all of them. Now, imagine that this is long-running code.
Imagine waiting five minutes for your code to run, only to discover it breaks
because of a typo. Doesn't that sound terrible?

Enter linting. "Linting" is the process of discovering errors in a code (typically 
typos and syntax errors --- i.e., the dumb stuff) before the code is ever run or 
compiled. In Python, this can be accomplished through using the 
[pyflakes](http://pypi.python.org/pypi/pyflakes/) 
library. It works by statically analyzing your code without running it. This 
means that it can find multiple errors at once, rather than stopping at the 
first exception.

You can run pyflakes on your code by typing:

    $ pyflakes my_code.py
    
We can take this a step further with integrated development environments, or
*IDEs*. IDEs are (basically) glorified text editors that dynamically lint,
showing you typos as you write your code. You will find that coders generally
have strong opinions on the use of IDEs, either positive or negative. Regardless,
if you want to play around with one, I recommend [Eclipse](http://www.eclipse.org/) 
with the [PyDev](http://pydev.org/) plugin, or [Visual Studo Code](https://code.visualstudio.com/docs) with the Python extension in its extension marketplace.

## Coding standards: The Details Matter!

> The one skill that separates bad programmers from good programmers is attention to detail. 
>
> Zed Shaw, _Learn Python the Hard Way_

In a written natural language, there are many ways to express the same idea. To 
make the consumption of information easier, people define style guides to enforce 
particularly effective ways of writing. This is even more true in coding; 
consistent style choices make scripts much easier to read. They become absolutely 
essential as projects become large (>1 person).

Some programming languages, such as Java and C++,  have multiple competing standards, 
and it's easy to imagine how messy this can get. Luckily, Python doesn't have 
this issue. The official standard, [PEP8](http://www.python.org/dev/peps/pep-0008/), 
is used everywhere. Unless you plan on hiding all the code you write from the 
outside world you should learn PEP8.

To help out coders, there are tools to test for compliance. The aptly named 
`pep8` library shows you where your code deviates from the PEP8 standard, and
`autopep8` goes a step further by trying to fix all of your errors for you.
These are both run from the shell, as


    $ pep8 my_code.py
    
    $ autopep8 my_code.py > my_new_code.py

These libraries won't always pick up everything. Furthermore, due to
the desire to maintain backward compatibility, there is some wiggle room in PEP8 
(see [this powerpoint](www.python.org/doc/essays/ppt/regrets/PythonRegrets.ppt) 
of Python regrets, made by the creator of the language). Here are some additional
rules to remember:

**PEP8 conventions missed by the `pep8` checker:**

* Variables and functions should be named  in `snake_case`. No capital letters.
  Classes are named in `CapCase`.
* Multiline comments use `"""`, not `'''`.
* Private methods and variables should be prefixed with an underscore, ie. 
  `_my_private_method()`.
 
**Special rules outside of PEP8**

* _Never_ use tabs. Ever.
* Use list comprehensions over `map()`, `reduce()`, and `filter()`.
* Avoid iterating through lists by index whenever possible.
 
These rules might seem arbitrary at first.  However, they
make collaborative coding much easier.

## Debuggers: for the deep-rooted errors

Linting will only catch the really obvious errors. For more complex issues,
(ie. bugs), you're going to want to follow the code's logic line by line. One
lazy way to do this is to put `print` statements everywhere, which allows you
to view variables over time. However, this gets messy quickly, and you lose
control of what variables you can see once you start executing.

This is where the Python DeBugger, or _pdb_, comes into play. With it, you can 
step through your code and watch as variables are changed.

All you have to do to use this is import the `pdb` module and call the 
`set_trace()` function.

In [4]:
import pdb
# ... 
# Your code here
# ...
pdb.set_trace()

--Call--
> /Users/johnny/opt/anaconda3/lib/python3.7/site-packages/IPython/core/displayhook.py(252)__call__()
-> def __call__(self, result=None):
(Pdb) 
(Pdb) 
(Pdb) 
(Pdb) 
(Pdb) w
  /Users/johnny/opt/anaconda3/lib/python3.7/runpy.py(193)_run_module_as_main()
-> "__main__", mod_spec)
  /Users/johnny/opt/anaconda3/lib/python3.7/runpy.py(85)_run_code()
-> exec(code, run_globals)
  /Users/johnny/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py(16)<module>()
-> app.launch_new_instance()
  /Users/johnny/opt/anaconda3/lib/python3.7/site-packages/traitlets/config/application.py(664)launch_instance()
-> app.start()
  /Users/johnny/opt/anaconda3/lib/python3.7/site-packages/ipykernel/kernelapp.py(563)start()
-> self.io_loop.start()
  /Users/johnny/opt/anaconda3/lib/python3.7/site-packages/tornado/platform/asyncio.py(148)start()
-> self.asyncio_loop.run_forever()
  /Users/johnny/opt/anaconda3/lib/python3.7/asyncio/base_events.py(534)run_forever()
-> self._run_once()
  /Users/johnn

BdbQuit: 

Now, when you run the code, it will stop at whatever line you put `set_trace()`.
You'll be prompted to give a command. Some common commands include:

 * `continue` continues on to the next time a `set_trace()` line is hit
 * `print <variable>` prints the current value of a specified variable
 * `list` shows the source code around the `set_trace()` line
 * `args` prints the values of all the arguments in the current function

There are a lot more options, which can be found [here](http://docs.python.org/2/library/pdb.html), 
but these few should be enough to get you running with pdb.

 ## Profiling: making code fast

So, you've found your errors, those deep-rooted bugs, and even standardized
your code to conform to PEP8. But, for some reason, it's still really slow.
What can we do about this?

The first idea you might have is to time your code. Analogous to the `print`
statement debugging above, you could write some logic to print run times at
various points in your script.

In [5]:
from time import time

t_naught = time()
# code you want to  time
print(time() - t_naught)

4.1961669921875e-05


You can also time your entire script with the `time` BASH command:

    $ time python my_code.py

While these both work, they're either too messy or not detailed enough. What we
really want is a breakdown of how long the computer spends running each part
of our code.

*Profilers* provide a way to do just this. With Python, run your script in a
shell with this command

    $ python -m cProfile -s time my_code.py

This returns the amount of functions called in the execution, along with a
breakdown of the time each function took. The `-s time` part sorts the output
by the time taken (which is usually what you care about). A sample output looks 
like:

    2530004 function calls in 0.789 seconds
    Ordered by: internal time
   
     ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1980000    0.253    0.000    0.253    0.000 my_code.py:23(&lt;genexpr&gt;)
      10000    0.190    0.000    0.780    0.000 my_code.py:16(evolve)
     220000    0.182    0.000    0.436    0.000 {sum}
     270000    0.150    0.000    0.150    0.000 my_code.py:28(neighbors)
          1    0.009    0.009    0.789    0.789 my_code.py:9(my_func)
      50000    0.004    0.000    0.004    0.000 {method 'add' of 'set' objects}
          1    0.000    0.000    0.000    0.000 {range}

With this information, you can go back into your code and adjust the logic to 
improve the speed bottlenecks.

## Testing: For making sure it does what you think

Software testing is a process by which one or more expected behaviors
and results from a piece of software are exercised and confirmed. Well
chosen tests will confirm expected code behavior for the extreme
boundaries of the input domains, output ranges, parametric combinations,
and other behavioral **edge cases**.

## Why test software?

Unless you write flawless, bug-free, perfectly accurate, fully precise,
and predictable code *every time*, you must test your code in order to
trust it enough to answer in the affirmative to at least a few of the
following questions:

-  Does your code work?
-  **Always?**
-  Does it do what you think it does? ([Patriot Missile Failure](http://www.ima.umn.edu/~arnold/disasters/patriot.html))
-  Does it continue to work after changes are made?
-  Does it continue to work after system configurations or libraries
   are upgraded?
-  Does it respond properly for a full range of input parameters?
-  What about **edge and corner cases**?
-  What's the limit on that input parameter?
-  How will it affect your
   [publications](http://www.nature.com/news/2010/101013/full/467775a.html)?

### Verification

*Verification* is the process of asking, "Have we built the software
correctly?" That is, is the code bug free, precise, accurate, and
repeatable?

### Validation

*Validation* is the process of asking, "Have we built the right
software?" That is, is the code designed in such a way as to produce the
answers we are interested in, data we want, etc.

### Uncertainty Quantification

*Uncertainty Quantification* is the process of asking, "Given that our
algorithm may not be deterministic, was our execution within acceptable
error bounds?" This is particularly important for anything which uses
random numbers, eg Monte Carlo methods.

## Where are tests?

Say we have an averaging function:

In [6]:
def mean(numlist):
    total = sum(numlist)
    length = len(numlist)
    return total/length

Tests could be implemented as runtime *exceptions in the function*:

In [7]:
def mean(numlist):
    try:
        total = sum(numlist)
        length = len(numlist)
    except TypeError:
        raise TypeError("The number list was not a list of numbers.")
    except:
        print("There was a problem evaluating the number list.")
    return total/length

Sometimes tests are functions alongside the function definitions
they are testing.

In [8]:
def mean(numlist):
    try:
        total = sum(numlist)
        length = len(numlist)
    except TypeError:
        raise TypeError("The number list was not a list of numbers.")
    except:
        print("There was a problem evaluating the number list.")
    return total/length

def test_mean():
    assert mean([0, 0, 0, 0]) == 0
    assert mean([0, 200]) == 100
    assert mean([0, -200]) == -100
    assert mean([0]) == 0


def test_floating_mean():
    assert mean([1, 2]) == 1.5

Sometimes tests live in an executable independent of the main executable.

**Implementation File:** `mean.py`

In [9]:
def mean(numlist):
    try:
        total = sum(numlist)
        length = len(numlist)
    except TypeError:
        raise TypeError("The number list was not a list of numbers.")
    except:
        print("There was a problem evaluating the number list.")
    return total/length

**Test File:** `test_mean.py`

In [10]:
from mean import mean

def test_mean():
    assert mean([0, 0, 0, 0]) == 0
    assert mean([0, 200]) == 100
    assert mean([0, -200]) == -100
    assert mean([0]) == 0


def test_floating_mean():
    assert mean([1, 2]) == 1.5

ModuleNotFoundError: No module named 'mean'

## When should we test?

The three right answers are:

- **ALWAYS!**
- **EARLY!**
- **OFTEN!**

The longer answer is that testing either before or after your software
is written will improve your code, but testing after your program is
used for something important is too late.

If we have a robust set of tests, we can run them before adding
something new and after adding something new. If the tests give the same
results (as appropriate), we can have some assurance that we didn't
wreak anything. The same idea applies to making changes in your system
configuration, updating support codes, etc.

Another important feature of testing is that it helps you remember what
all the parts of your code do. If you are working on a large project
over three years and you end up with 200 classes, it may be hard to
remember what the widget class does in detail. If you have a test that
checks all of the widget's functionality, you can look at the test to
remember what it's supposed to do.

## Who should test?

In a collaborative coding environment, where many developers contribute
to the same code base, developers should be responsible individually for
testing the functions they create and collectively for testing the code
as a whole.

Professionals often test their code, and take pride in test coverage,
the percent of their functions that they feel confident are
comprehensively tested.

## How are tests written?

The type of tests that are written is determined by the testing
framework you adopt. Don't worry, there are a lot of choices.

### Types of Tests

**Exceptions:** Exceptions can be thought of as type of runtime test.
They alert the user to exceptional behavior in the code. Often,
exceptions are related to functions that depend on input that is unknown
at compile time. Checks that occur within the code to handle exceptional
behavior that results from this type of input are called Exceptions.

**Unit Tests:** Unit tests are a type of test which test the fundamental
units of a program's functionality. Often, this is on the class or
function level of detail. However what defines a *code unit* is not
formally defined.

To test functions and classes, the interfaces (API) - rather than the
implementation - should be tested. Treating the implementation as a
black box, we can probe the expected behavior with boundary cases for
the inputs.

**System Tests:** System level tests are intended to test the code as a
whole. As opposed to unit tests, system tests ask for the behavior as a
whole. This sort of testing involves comparison with other validated
codes, analytical solutions, etc.

**Regression Tests:** A regression test ensures that new code does
change anything. If you change the default answer, for example, or add a
new question, you'll need to make sure that missing entries are still
found and fixed.

**Integration Tests:** Integration tests query the ability of the code
to integrate well with the system configuration and third party
libraries and modules. This type of test is essential for codes that
depend on libraries which might be updated independently of your code or
when your code might be used by a number of users who may have various
versions of libraries.

**Test Suites:** Putting a series of unit tests into a collection of
modules creates, a test suite. Typically the suite as a whole is
executed (rather than each test individually) when verifying that the
code base still functions after changes have been made.

# Elements of a Test

**Behavior:** The behavior you want to test. For example, you might want
to test the fun() function.

**Expected Result:** This might be a single number, a range of numbers,
a new fully defined object, a system state, an exception, etc. When we
run the fun() function, we expect to generate some fun. If we don't
generate any fun, the fun() function should fail its test.
Alternatively, if it does create some fun, the fun() function should
pass this test. The the expected result should known *a priori*. For
numerical functions, this is result is ideally analytically determined
even if the function being tested isn't.

**Assertions:** Require that some conditional be true. If the
conditional is false, the test fails.

**Fixtures:** Sometimes you have to do some legwork to create the
objects that are necessary to run one or many tests. These objects are
called fixtures as they are not really part of the test themselves but
rather involve getting the computer into the appropriate state.

For example, since fun varies a lot between people, the fun() function
is a method of the Person class. In order to check the fun function,
then, we need to create an appropriate Person object on which to run
fun().

**Setup and teardown:** Creating fixtures is often done in a call to a
setup function. Deleting them and other cleanup is done in a teardown
function.

**The Big Picture:** Putting all this together, the testing algorithm is
often:

    setup()
    test()
    teardown()

But, sometimes it's the case that your tests change the fixtures. If so,
it's better for the setup() and teardown() functions to occur on either
side of each test. In that case, the testing algorithm should be:

    setup()
    test1()
    teardown()

    setup()
    test2()
    teardown()

    setup()
    test3()
    teardown()

## Nose: A Python Testing Framework

The testing framework we'll discuss today is called nose. However, there are
several other testing frameworks available in most language. Most notably there
is [JUnit](http://www.junit.org/) in Java which can arguably attributed to
inventing the testing framework. Google also provides a [test
framework](code.google.com/p/googletest/) for C++ applications (note, there's
also [CTest](http://cmake.org/Wiki/CMake/Testing_With_CTest)).  There
is at least one testing framework for R:
[testthat](http://cran.r-project.org/web/packages/testthat/index.html).

### Where do nose tests live?

Nose tests are files that begin with `Test-`, `Test_`, `test-`, or
`test_`. Specifically, these satisfy the testMatch regular expression
`[Tt]est[-_]`. (You can also teach nose to find tests by declaring them
in the unittest.TestCase subclasses chat you create in your code. You
can also create test functions which are not unittest.TestCase
subclasses if they are named with the configured testMatch regular
expression.)

### Nose Test Syntax

To write a nose test, we make assertions.

    assert should_be_true()
    assert not should_not_be_true()

Additionally, nose itself defines number of assert functions which can
be used to test more specific aspects of the code base.

    from nose.tools import *

    assert_equal(a, b)
    assert_almost_equal(a, b)
    assert_true(a)
    assert_false(a)
    assert_raises(exception, func, *args, **kwargs)
    assert_is_instance(a, b)
    # and many more!

Moreover, numpy offers similar testing functions for arrays:

    from numpy.testing import *

    assert_array_equal(a, b)
    assert_array_almost_equal(a, b)
    # etc.

## Exercise: Writing tests for mean()

There are a few tests for the mean() function that we listed in this
lesson. What are some tests that should fail? Add at least three test
cases to this set. Edit the `test_mean.py` file which tests the mean()
function in `mean.py`.

*Hint:* Think about what form your input could take and what you should
do to handle it. Also, think about the type of the elements in the list.
What should be done if you pass a list of integers? What if you pass a
list of strings?

**Example**:

    $ nosetests test_mean.py

In [11]:
def mean(numlist):
    try:
        total = sum(numlist)
        length = len(numlist)
    except TypeError:
        raise TypeError("The number list was not a list of numbers.")
    except:
        print("There was a problem evaluating the number list.")
    return total/length

def test_mean():
    assert mean([0, 0, 0, 0]) == 0
    assert mean([0, 200]) == 100
    assert mean([0, -200]) == -100
    assert mean([0]) == 0

from nose.tools import assert_raises

def test_floating_mean():
    assert mean([1, 2]) == 1.5
    
def test_string_mean():
    assert_raises(TypeError, mean(),["string"])
    assert_raises(TypeError, mean(),["string string"])
    assert_raises(TypeError, mean(),[""])
    assert_raises(TypeError, mean(),["", ""])
def test_int_string_mean():
    assert_raises(TypeError, mean(),["string", 1, 0])
    assert_raises(TypeError, mean(),[1, 0, "string"])
    assert_raises(TypeError, mean(),[1, "string", 0])

def test_int_floating_mean():
    assert mean([0, 1.0, 2]) == 1.0
    assert mean([2.0, 1.0, 9]) == 4.0

## Test Driven Development

Test driven development (TDD) is a philosophy whereby the developer
creates code by **writing the tests first**. That is to say you write the
tests *before* writing the associated code!

This is an iterative process whereby you write a test then write the
minimum amount code to make the test pass. If a new feature is needed,
another test is written and the code is expanded to meet this new use
case. This continues until the code does what is needed.

TDD operates on the YAGNI principle (You Ain't Gonna Need It). People
who diligently follow TDD swear by its effectiveness. This development
style was put forth most strongly by [Kent Beck in
2002](http://www.amazon.com/Test-Driven-Development-By-Example/dp/0321146530).

### A TDD Example

Say you want to write a std() function which computes the [Standard 
Deviation](http://en.wikipedia.org/wiki/Standard_deviation). You
would - of course - start by writing the test, possibly testing a single set of 
numbers:

In [12]:
from nose.tools import assert_equal, assert_almost_equal

def test_std1():
    obs = std([0.0, 2.0])
    exp = 1.0
    assert_equal(obs, exp)

You would *then* go ahead and write the actual function:

In [13]:
def std(vals):
    # you snarky so-and-so
    return 1.0

And that is it, right?! Well, not quite. This implementation fails for
most other values. Adding tests we see that:

In [14]:
def test_std1():
    obs = std([0.0, 2.0])
    exp = 1.0
    assert_equal(obs, exp)

def test_std2():
    obs = std([])
    exp = 0.0
    assert_equal(obs, exp)

def test_std3():
    obs = std([0.0, 4.0])
    exp = 2.0
    assert_equal(obs, exp)

These extra tests now require that we bother to implement at least a slightly more reasonable function:

In [15]:
def std(vals):
    # a little better
    if len(vals) == 0:
        return 0.0
    return vals[-1] / 2.0

However, this function still fails whenever vals has more than two elements or
the first element is not zero. Time for more tests!

In [16]:
def test_std1():
    obs = std([0.0, 2.0])
    exp = 1.0
    assert_equal(obs, exp)

def test_std2():
    obs = std([])
    exp = 0.0
    assert_equal(obs, exp)

def test_std3():
    obs = std([0.0, 4.0])
    exp = 2.0
    assert_equal(obs, exp)

def test_std4():
    obs = std([1.0, 3.0])
    exp = 1.0
    assert_equal(obs, exp)

def test_std5():
    obs = std([1.0, 1.0, 1.0])
    exp = 0.0
    assert_equal(obs, exp)

At this point, we had better go ahead and try do the right thing...

In [17]:
def std(vals):
    # finally, some math
    n = len(vals)
    if n == 0:
        return 0.0
    mu = sum(vals) / n
    var = 0.0
    for val in vals:
        var = var + (val - mu)**2
    return (var / n)**0.5

Here it becomes very tempting to take an extended coffee break or
possibly a power lunch. But then you remember those pesky infinite values!
Perhaps the right thing to do here is to just be undefined.  Infinity in 
Python may be represented by any literal float greater than or equal to 1e309.

In [18]:
def test_std1():
    obs = std([0.0, 2.0])
    exp = 1.0
    assert_equal(obs, exp)

def test_std2():
    obs = std([])
    exp = 0.0
    assert_equal(obs, exp)

def test_std3():
    obs = std([0.0, 4.0])
    exp = 2.0
    assert_equal(obs, exp)

def test_std4():
    obs = std([1.0, 3.0])
    exp = 1.0
    assert_equal(obs, exp)

def test_std5():
    obs = std([1.0, 1.0, 1.0])
    exp = 0.0
    assert_equal(obs, exp)

def test_std6():
    obs = std([1e500])
    exp = NotImplemented
    assert_equal(obs, exp)

def test_std7():
    obs = std([0.0, 1e4242])
    exp = NotImplemented
    assert_equal(obs, exp)

This means that it is time to add the appropriate case to the function
itself:

In [19]:
def std(vals):
    # sequence and you shall find
    n = len(vals)
    if n == 0:
        return 0.0
    mu = sum(vals) / n
    if mu == 1e500:
        return NotImplemented
    var = 0.0
    for val in vals:
        var = var + (val - mu)**2
    return (var / n)**0.5

## Further Statistics Tests
The `stats.py` and `test_stats.py` files in the `/notebooks/OtherFiles/` folder contain stubs for other simple statistics functions. Try your new test-driven development chops by implementing one or more of these functions along with their corresponding tests.

## Testing Excercise

Complete the files in `/OtherFiles/` called `random_thousand.py` and `test_random_thousand.py`. Make sure you are using test driven development by writing your tests first! These tests will require you to decide on and read documentation for several modules. View this as also an excercise in your googling abilities.