# Unit Testing


Unit testing refers to the testing of individual functions to check the are functioning as expected. Sometimes little quirks in the inputs that we aren't expected, or changes made later to the function can make it behave in unexpected ways.

The function below is a simple pandas function that will calculate the average value of a given column in a dataframe

In [1]:
!pip install pandas
import pandas as pd


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
def calculate_average(df, column_name):
    """
    Calculate the average of a specified column in a pandas DataFrame.

    :param df: Input DataFrame
    :param column_name: Name of the column to calculate the average for
    :return: The average value of the column
    """
    # Calculate the average value
    avg_value = df[column_name].mean()
    
    return avg_value

Looks simple enough right? Let's give it some data to try out.

This dataframe represents the names and purchase totals of customers to our shop. We'll calculate the average purchase amount across all customers.

In [4]:
# Example DataFrame
data = {
    "Name": ["Alice", "Bob", "Cathy"],
    "Total": [34.50, 45.50, 29.50]
}

# Create a pandas DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Total
0,Alice,34.5
1,Bob,45.5
2,Cathy,29.5


In [5]:
# Calculate the average total
average_total = calculate_average(df, "Total")
print(f"Average Total: {average_total}")

Average Total: 36.5


Looks good to me! Now we can implement this in our pipeline and completely forget about it. It works on our data from today so I'm sure nothing will go wrong ever.

### Empty Dataframe

What if one day no one bought anything? This has never happened before so we didn't really think of it when building the function. We still have a dataframe but it's completely empty.

In [6]:
# Create an empty DataFrame
empty_df = pd.DataFrame()

In [7]:
# Calculate the average purchase total
average_total = calculate_average(empty_df, "Total")
print(f"Average Total: {average_total}")

KeyError: 'Total'

Oh dear - we got an error. And this error, if the function was part of a larger pipeline would grind everything to a halt.
We need to go back and fix the function and consider what should happen in this situation.
You can choose what happens, you might want it to return 0, or None, or some other value depending on the reason for creating this average.

In [8]:
def calculate_average(df, column_name):
    """
    Calculate the average of a specified column in a pandas DataFrame.

    :param df: Input DataFrame
    :param column_name: Name of the column to calculate the average for
    :return: The average value of the column
    """
    # Check if the DataFrame is empty
    if df.empty:
        print("Error: The DataFrame is empty.")
        return None
    
    # Calculate the average value
    avg_value = df[column_name].mean()
    
    return avg_value

In [9]:
# Create an empty DataFrame, 
empty_df = pd.DataFrame()

# Calculate the average purchase total
average_total = calculate_average(empty_df, "Total")
print(f"Average Total: {average_total}")

Error: The DataFrame is empty.
Average Total: None


This is definitely better than an error. Hopefully nothing else will go wrong!

### Non-existent column

You're wanting to do some market research on your customer base, and find out the average age of your customers. You're pretty sure you have that information, so you can just use your handy calculate_average() function to do it.

In [10]:
# Calculate the average age
average_age = calculate_average(df, "Age")
print(f"Average Age: {average_age}")

KeyError: 'Age'

Another error - which will hault our whole pipeline. It says there's no "Age" column, but you're sure you take that information, so it must be dropped elsewhere. You'll need to update your function to handle this error.

In [11]:
def calculate_average(df, column_name):
    """
    Calculate the average of a specified column in a pandas DataFrame.

    :param df: Input DataFrame
    :param column_name: Name of the column to calculate the average for
    :return: The average value of the column
    """
    # Check if the DataFrame is empty
    if df.empty:
        print("Error: The DataFrame is empty.")
        return None
    
    # Check if column exists
    if column_name not in df.columns:
        print(f"Error: Column '{column_name}' not found in the DataFrame.")
        return None
    
    # Calculate the average value
    avg_value = df[column_name].mean()
    
    return avg_value

In [12]:
# Calculate the average age
average_age = calculate_average(df, "Age")
print(f"Average Age: {average_age}")

Error: Column 'Age' not found in the DataFrame.
Average Age: None


Perfect - now that's handled and we no longer get a long error message.

### Non-numeric data

It's a new day and some new data, let's take a look.

In [14]:
# Example DataFrame
data = {
    "Name": ["Daniel", "Erica", "Frankie"],
    "Total": ["£12.00", "£4.50", "£175.34"]
}

# Create a pandas DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Total
0,Daniel,£12.00
1,Erica,£4.50
2,Frankie,£175.34


In [15]:
# Calculate the average total
average_total = calculate_average(df, "Total")
print(f"Average Total: {average_total}")

TypeError: Could not convert string '£12.00£4.50£175.34' to numeric

Hmm...

That's new. A new staff member has input the totals with the currency instead of as just numbers and now the function isn't working as intended - you'll need to make sure this case it handled.

In [16]:
def calculate_average(df, column_name):
    """
    Calculate the average of a specified column in a pandas DataFrame.

    :param df: Input DataFrame
    :param column_name: Name of the column to calculate the average for
    :return: The average value of the column
    """
    # Check if the DataFrame is empty
    if df.empty:
        print("Error: The DataFrame is empty.")
        return None
    
    # Check if column exists
    if column_name not in df.columns:
        print(f"Error: Column '{column_name}' not found in the DataFrame.")
        return None
    
    # Remove non-numeric characters (like currency symbols) and convert to float
    try:
        df[column_name] = df[column_name].replace(r"[^0-9.]", "", regex=True).astype(float)
    except ValueError:
        print(f"Error: Unable to convert values in column '{column_name}' to numeric.")
        return None
    
    # Calculate the average value
    avg_value = df[column_name].mean()
    
    return avg_value


In [17]:
# Calculate the average total
average_total = calculate_average(df, "Total")
print(f"Average Total: {average_total}")

Average Total: 63.946666666666665


And done! Hopefully nothing else goes wrong now, and hopefully the edits we made won't effect the normal functionality...

All of these issues could have been avoided if we used **Unit Tests** when developing the function in the first place. Unit tests allow us to check the behaviour of our functions with both expected and unexpected inputs outside the production environment. If they fail, they won't crash your whole pipeline and they only take a few seconds to a few minutes to run.

In [18]:
%%writefile average_column_function_pandas.py
# This is writing this cell to a flat python file - this is the final function we are testing
import pandas as pd

def calculate_average(df, column_name):
    """
    Calculate the average of a specified column in a pandas DataFrame.

    :param df: Input DataFrame
    :param column_name: Name of the column to calculate the average for
    :return: The average value of the column
    """
    # Check if the DataFrame is empty
    if df.empty:
        print("Error: The DataFrame is empty.")
        return None
    
    # Check if column exists
    if column_name not in df.columns:
        print(f"Error: Column '{column_name}' not found in the DataFrame.")
        return None
    
    # Remove non-numeric characters (like currency symbols) and convert to float
    try:
        df[column_name] = df[column_name].replace(r"[^0-9.]", "", regex=True).astype(float)
    except ValueError:
        print(f"Error: Unable to convert values in column '{column_name}' to numeric.")
        return None
    
    # Calculate the average value
    avg_value = df[column_name].mean()
    
    return avg_value

Writing average_column_function_pandas.py


In [19]:
%%writefile test_average_column_pandas.py 
#^Very import to start your file and functions with "test" - this is how pytest finds them!
import pandas as pd
from average_column_function_pandas import calculate_average # You will need to import your function to test it!


def test_calculate_average():
    """
    Test the calculate average function for expected behavior.
    """
    # Arrange
    data = {"value": ["100", "200", "300"]}
    input_df = pd.DataFrame(data)

    expected = 200.0

    # Act
    actual = calculate_average(input_df, "value")

    # Assert
    assert actual == expected, f"Expected {expected} but got {actual}"


def test_calculate_average_empty_df():
    """
    Test the calculate average function for an empty DataFrame.
    """
    # Arrange
    data = {}  # Empty dictionary
    input_df = pd.DataFrame(data)

    expected = None

    # Act
    actual = calculate_average(input_df, "value")

    # Assert
    assert actual == expected, f"Expected {expected} but got {actual}"


def test_calculate_average_no_column():
    """
    Test the calculate average function for a column that doesn't exist.
    """
    # Arrange
    data = {"value": ["100", "200", "300"]}
    input_df = pd.DataFrame(data)

    expected = None

    # Act
    actual = calculate_average(input_df, "age")

    # Assert
    assert actual == expected, f"Expected {expected} but got {actual}"


def test_calculate_average_currency_inputs():
    """
    Test the calculate average function for a column containing currency values.
    """
    # Arrange
    data = {"value": ["£100", "£200", "£300"]}
    input_df = pd.DataFrame(data)

    expected = 200.0

    # Act
    actual = calculate_average(input_df, "value")

    # Assert
    assert actual == expected, f"Expected {expected} but got {actual}"


Writing test_average_column_pandas.py


In [22]:
!pytest

platform linux -- Python 3.12.1, pytest-8.3.3, pluggy-1.5.0
rootdir: /workspaces/rap_intro_to_python/workshop_6
plugins: anyio-4.6.0
collected 4 items                                                              [0m

test_average_column_pandas.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                       [100%][0m



If everything went as expected, we should have four passing tests!

Try forcing one to fail by editing the expected output, and see what happens.

# Your Turn

Now it's your turn, trying writing a simple python function, and then write some test cases for it considering the expected and unexpected inputs. The cells below have the "cell magic" to create the files for you. 
For the test, use the Arrange, Act, Assert framework to structure them.

Suggestions:

- A function to multiply two numbers
    - What happens if you input a string?
    - Does it work for big numbers? Small numbers? Negative numbers?
- A function to test if a number is even
    - Does it work as intended and return True for even and False for odd?
    - What about negative numbers? 0?
- A function to return the longest word in a list of words
    - What if the list is empty?
    - What if two words have the same length? What would you *want* to happen?
    - What if the list was full of numbers?
- A function to count the number of vowels in a string
    - What is the string is empty?
    - What if there are no vowels? Or only vowels?
    - What is there's a space in the word?

In [28]:
%%writefile my_new_function.py

def multiply_numbers(input_1, input_2):
    if isinstance(input_1, str):
        print(f"Error: {input_1} is a string")
        return None
    if isinstance(input_2, str):
        print(f"Error: {input_2} is a string")
        return None
    output_val = input_1 * input_2
    return output_val

Overwriting my_new_function.py


In [27]:
def multiply_numbers(input_1, input_2):
    if isinstance(input_1, str):
        print(f"Error: {input_1} is a string")
        return None
    if isinstance(input_2, str):
        print(f"Error: {input_2} is a string")
        return None
    output_val = input_1 * input_2
    return output_val
multiply_numbers(2, "hello")


Error: hello is a string


In [31]:
%%writefile test_my_new_function.py

from my_new_function import multiply_numbers

def test_multiply_numbers():
    #testing for expected behaviour with positive integers
    # Arrange
    input_1 = 2
    input_2 = 5

    expected = 10
    
    # Act
    actual = multiply_numbers(2, 5)

    #Assert
    assert actual == expected, f"Expected {expected} but got {actual}"


def test_multiply_numbers_small():
    #testing for expected behaviour with decimal values below 1
    # Arrange
    input_1 = 0.1
    input_2 = .5

    expected = 0.05
    
    # Act
    actual = multiply_numbers(0.1, .5)

    #Assert
    assert actual == expected, f"Expected {expected} but got {actual}"


def test_multiply_numbers_negative():
    #testing for expected behaviour with a negative integer
    # Arrange
    input_1 = -2
    input_2 = 5

    expected = -10
    
    # Act
    actual = multiply_numbers(-2, 5)

    #Assert
    assert actual == expected, f"Expected {expected} but got {actual}"


def test_multiply_numbers_string():
    #testing for expected behaviour where one value is a string
    # Arrange
    input_1 = 2
    input_2 = "hello"

    expected = None
    
    # Act
    actual = multiply_numbers(2, "hello")

    #Assert
    assert actual == expected, f"Expected {expected} but got {actual}"

# If you edit the filename above, or change the function names remember to change your improt statement

Overwriting test_my_new_function.py


Hint: running just pytest will run ALL tests it finds, if you only want to run one file put the filename after (including the .py)

In [32]:
!pytest test_my_new_function.py

platform linux -- Python 3.12.1, pytest-8.3.3, pluggy-1.5.0
rootdir: /workspaces/rap_intro_to_python/workshop_6
plugins: anyio-4.6.0
collected 4 items                                                              [0m

test_my_new_function.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                             [100%][0m

