<a href="https://colab.research.google.com/github/matthewpecsok/data_engineering/blob/main/tutorials/de_testing_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to unit tests

suppose we write a function intended to add two numbers together and return the result.

In [None]:
import unittest

In [None]:
def add_positive_numbers(a,b):
  return a-b # broken intentionally

In [None]:
add_positive_numbers(6,7)

how can we make sure this works correctly automatically?




Here we write a unit test to see if the function correctly returns the sum of the two numbers.

Any functions that meet the following criteria will be run.

*  within classes that inherit from TestCase
*  names start with test_


*This test should fail to pass.*

Define the test

We test to see if the function returns a result of 5.

This is normally what you get when you add 2 and 3 together....

## Test should fail.

In [None]:
class TestAddNumbers(unittest.TestCase):
    def test_add_positive_numbers(self):
        result = add_positive_numbers(2, 3)
        self.assertEqual(result, 5)  # Assert that the result is equal to 5

### Run the test.

In [None]:
unittest.main(argv=['first-arg-is-ignored'], exit=False)

-1 is truly not equal to 5. This test caught the fact that the function incorrectly added two numbers together.

Now we attempt to fix the function. We re-define the function and run the test again.

This test should pass.

## Fixed function.

In [None]:
def add_positive_numbers(a, b):
    return a + b # fixed

### Rerun Test.

In [None]:
unittest.main(argv=['first-arg-is-ignored'], exit=False)

## Raising errors.

Good error handling means raising errors in certain cases.

What if we want to handle an error that SHOULD be raised?

Here we add additional code that makes sure neither a or b is negative.

In [None]:
def add_positive_numbers(a, b):
    if (a <= 0) or (b <= 0):
        raise ValueError("Cannot pass negative numbers")
    return a + b

In [None]:
class TestNegative(unittest.TestCase):
    def test_negative(self):
        with self.assertRaises(ValueError):  # Asserts that a ValueError is raised
            add_positive_numbers(-5, 10)

In [None]:
if __name__ == '__main__':
  unittest.main(argv=['first-arg-is-ignored'], exit=False)

## Adding names to tests.

What if we would like to know the names of the tests?

We can override the addSuccess and addFailure methods.

Overriding allows us to provide custom functionality that we choose to implement. In this case we make sure to call the base method that would normally run, and we print some text along with the test id.

This code


```
print(f"Test Passed: {test.id()}")
```
and
```
print(f"Test Failed: {test.id()}")
```

is what gives us the test name/id

### Overriding addSuccess and addFailure

In [None]:
import unittest

class CustomTestResult(unittest.TestResult):
    def addSuccess(self, test):
        super().addSuccess(test)
        print(f"Test Passed: {test.id()}")

    def addFailure(self, test, err):
        super().addFailure(test, err)
        print(f"Test Failed: {test.id()}")

we must then call our custom test class to make sure to use the functionality.

### Updating the test run contructor.

In [None]:
unittest.main(testRunner=unittest.TextTestRunner(resultclass=CustomTestResult),
              argv=['first-arg-is-ignored'],
              exit=False)

## Evaluating expressions

there are many things we can check for

what if we have an expression that should correctly check the length of a list and return either True or False?

In [None]:
array_length_five = [1,2,3,4,5]

len(array_length_five) < 6

In [None]:
len(array_length_five) < 4

In [None]:
class LengthTest(unittest.TestCase):
    def test_check_expression_true(self):
        self.assertTrue(len([1,2,3,4,5]) < 10) # is this true is the length less than 10?


In [None]:
if __name__ == '__main__':
  unittest.main(testRunner=unittest.TextTestRunner(resultclass=CustomTestResult),
                argv=['first-arg-is-ignored'],
                exit=False)

Notice how each time a new method is registered it is run by the unittest. Now we have 3 methods that have been run.

In [None]:
if __name__ == '__main__':
  unittest.main(testRunner=unittest.TextTestRunner(resultclass=CustomTestResult),
                argv=['first-arg-is-ignored'],
                exit=False)

In [None]:
def return_dataframe():
  import pandas as pd
  df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
  return None

In [None]:
class TestDataframes(unittest.TestCase):
    def test_dataframes(self):
        df = return_dataframe()
        self.assertIsInstance(df, pd.DataFrame)  # Assert that the variable is a dataframe

In [None]:
if __name__ == '__main__':
  unittest.main(testRunner=unittest.TextTestRunner(resultclass=CustomTestResult),
                argv=['first-arg-is-ignored'],
                exit=False)

# Testing Data Pipelines (multiple stages)

Testing in the context of Data Engineering.

How to test a sequence of data transformations?

Imagine we have order dates and delivery dates.

## Transformations as individual calls in sequence.

In [None]:
from pandas.core.internals.array_manager import NaT
import pandas as pd
import numpy as np
import random as random
from datetime import datetime, timedelta

order_dates = [
    datetime(2023, 1, 1),
    datetime(2023, 2, 15),
    datetime(2023, 3, 8),
    datetime(2023, 4, 22),
    datetime(2023, 5, 10),
    NaT,
    datetime(2023, 7, 14),
    datetime(2023, 8, 19),
    datetime(2023, 9, 3),
    datetime(2023, 10, 31)
]

received_dates = [
    datetime(2023, 1, 13),
    datetime(2023, 1, 25),
    NaT,
    datetime(2023, 5, 12),
    datetime(2023, 6, 22),
    datetime(2023, 8, 21),
    datetime(2023, 8, 19),
    datetime(2023, 9, 13),
    datetime(2023, 10, 11),
    datetime(2023, 12, 11)
]

dates_df_orig = pd.DataFrame({
    'order_date' : order_dates,
    'received_date' : received_dates
})

dates_df = dates_df_orig.copy()

dates_df.head(3)

our first step to computing the number of days from order to delivery might be to drop any null values.

In [None]:
dates_df = dates_df.dropna().copy()
dates_df

our second step could then be to compute elapsed days

In [None]:
dates_df.loc[:, 'elapsed_days'] = (dates_df['received_date'] - dates_df['order_date']).dt.days
dates_df

our final step might be then to drop impossible values (such as negative elapsed days)

In [None]:
dates_df = dates_df[dates_df['elapsed_days']> 0]
dates_df

Now let's imagine this is a production pipeline with each step a function.We'd like to run multiple commands in sequence and at each step check the results to be sure our transformations are correct.

Let's create some helper functions.

## Using functions to create a pipeline.

In [None]:
# Step 1
def step_1_dropnull(df):
  df = df.dropna().copy()
  return df

# Step 2
def step_2_compute_days(df):
  df.loc[:, 'elapsed_days'] = (df['received_date'] - df['order_date']).dt.days
  return df

# Step 3
def step_3_dropnegative(df):
  df = df[df['elapsed_days']> 0]
  return df

In [None]:
df = step_1_dropnull(dates_df_orig)
df = step_2_compute_days(df)
df = step_3_dropnegative(df)
df

There is a problem here though. Have we done ANY error handling? What happens if we pass bad data into our sequence?

### Poor error handling.(really NO error handling)

In [None]:
df = step_1_dropnull(" Hi I'm NOT a dataframe. I'm a string! ")
df = step_2_compute_days(df)
df = step_3_dropnegative(df)
df

Good code and pipelines not only process data when everything is perfect. They handle situations in which the pipeline will fail and do so gracefully.

### Beginning to handle errors.

In [None]:
def step_1_dropnull(df):
  try:
    if isinstance(df, pd.DataFrame):
        # Process the DataFrame
        df = df.dropna().copy()
    else:
        raise TypeError("Input is not a DataFrame.")
  except TypeError as e:
    print(f"step_1_dropnull: {e}")
    raise

  return df

def step_2_compute_days(df):
  df.loc[:, 'elapsed_days'] = (df['received_date'] - df['order_date']).dt.days
  return df

def step_3_dropnegative(df):
  df = df[df['elapsed_days']> 0]
  return df

In [None]:
df = step_1_dropnull("hi i'm definitely not a dataframe")
df = step_2_compute_days(df)
df = step_3_dropnegative(df)
df

This is better, we've returned a custom error message that helps us diagnose the issue, but the code still crashes. We'd like to handle errors in such a way that the code doesn't crash, but rather fails gracefully.

### Graceful error handling

Here we stop returning an error and instead choose a number to represent failure. In this case we use -5 which means the passed data was not a dataframe. Now we can truly gracefully handle this error and can send a useful message to a data engineer to troubleshoot.

In [None]:
def the_pipeline(df):
  try:
    df = step_1_dropnull(df)
    df = step_2_compute_days(df)
    df = step_3_dropnegative(df)
    df

    return df

  except Exception as e:
    print(f"Pipeline failed with error: {e}")
    return -5 # negative 5 means the pipeline failed.

In [None]:
the_pipeline("hi i'm definitely not a dataframe")

# Test driven development

Write the Test first. Then write the Code.

This might seem like a meaningless change, but it causes you to think first about all the ways your code can and should fail.

In [None]:
class MultiplyNumbers(unittest.TestCase):
    def test_multiply_two_numbers(self):
        result = multiply_two_numbers(2, 4)
        self.assertEqual(result, 8)  # Assert that the result is equal to 8

In [None]:
def multiply_two_numbers(a,b):
  return

In [None]:
if __name__ == '__main__':
  unittest.main(testRunner=unittest.TextTestRunner(resultclass=CustomTestResult),
                argv=['first-arg-is-ignored'],
                exit=False)

Now we know the test appears to work as it failed as expected.

Refactor the code so the test will pass.

In [None]:
def multiply_two_numbers(a,b):
  result = a*b
  return result

In [None]:
if __name__ == '__main__':
  unittest.main(testRunner=unittest.TextTestRunner(resultclass=CustomTestResult),
                argv=['first-arg-is-ignored'],
                exit=False)

# Mocking



In [None]:
import requests

def get_data_from_api():
    response = requests.get('https://api.example.com/data')
    return response.json()

In [None]:
import unittest
from unittest.mock import patch
from unittest import mock
import requests

class TestGetDataFromAPI(unittest.TestCase):
    @patch('requests.get')
    def test_get_data_from_api(self, mock_get):
        mock_response = {'data': 'mocked data'}
        mock_get.return_value.json.return_value = mock_response

        result = get_data_from_api()

        mock_get.assert_called_once_with('https://api.example.com/data')
        self.assertEqual(result, mock_response)

    @patch('requests.get')
    def test_successful_api_response(self, mock_get):
        # Mock the API response
        mock_response = mock.Mock()
        mock_response.status_code = 200
        mock_response.json.return_value = {'data': 'example'}
        mock_get.return_value = mock_response

        # Call the function that interacts with the API
        result = get_data_from_api()

        # Assertions
        mock_get.assert_called_once_with('https://api.example.com/data')
        self.assertEqual(result, {'data': 'example'})

In [None]:
if __name__ == '__main__':
  unittest.main(testRunner=unittest.TextTestRunner(resultclass=CustomTestResult),
                argv=['first-arg-is-ignored'],
                exit=False)