# Unit testing in data science #

I am big proponent of the test-driven development, so when I came to the data science I was puzzled not only with bad naming practices, but the total lack of unit tests. And this is surprising, if we remember that the cost of error could range from lost hours of training an incorrect model to bad model in production.

Here I concentrate on unit testing numpy and pandas-based solutions. My further experiments will be in TensorFlow and Apache Spark.

In a Stack Overflow post https://stackoverflow.com/questions/40172281/unit-tests-for-functions-in-a-jupyter-notebook, suggested using this approach

In [2]:
def add(a, b):
    return a + b

In [3]:
import unittest

class TestNotebook(unittest.TestCase):

    def test_add(self):
        self.assertEqual(add(2, 2), 4)


unittest.main(argv=[''], verbosity=0, exit=False)

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK


<unittest.main.TestProgram at 0x1a29cd917f0>

Let's try to process the titanic data set.

In [4]:
import numpy as np
import pandas as pd

In [6]:
titanic = pd.read_csv("data/titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Let's say, I want to write a function to remove Name column. If we use test-driven development, we need to write a unit test _before_ writing code

In [19]:
class TestColumnRemoving(unittest.TestCase):

    def test_remove_name(self):
        test_df = pd.DataFrame({'Name': ['John', 'Mary'], 'Age': [15,17]})
        self.assertEqual(2, len(test_df.columns))
        updated_df = remove_name(test_df)
        self.assertEqual(1, len(updated_df.columns))
        self.assertEqual('Age', updated_df.columns[0])

In [12]:
unittest.main(argv=[''], verbosity=0, exit=False)

ERROR: test_remove_name (__main__.TestColumnRemoving)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython-input-11-25be06be27a0>", line 6, in test_remove_name
    updated_df = remove_name(test_df)
NameError: name 'remove_name' is not defined

----------------------------------------------------------------------
Ran 2 tests in 0.002s

FAILED (errors=1)


<unittest.main.TestProgram at 0x1a29dfc0630>

The test fails because we have not implemented the method yet!

In [13]:
def remove_name(df: pd.DataFrame) -> pd.DataFrame:
    return df.drop(columns=['Name'], inplace=False)

In [20]:
#Now run the test again
unittest.main(argv=[''], verbosity=0, exit=False)

----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK


<unittest.main.TestProgram at 0x1a29dfdc438>

Here we can see how we can do red-green-refactor cycle right here in the notebook