<a href="https://colab.research.google.com/github/jmasonlee/efficiently_testing_etl_pipelines/blob/main/Right_SizingTests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Efficiencies of Right Sizing Tests


This exercise is meant to give you a practical example on how the number of inputs to a test affect the number of tests you need to write and maintain in order to fully cover your system.

## Setup Notebook

In [None]:
!rm -rf efficiently_testing_etl_pipelines
!git clone https://github.com/jmasonlee/efficiently_testing_etl_pipelines.git
!cp /content/efficiently_testing_etl_pipelines/efficiencies_of_right_sizing_tests/src/diamond_pricing.py .
!cp /content/efficiently_testing_etl_pipelines/efficiencies_of_right_sizing_tests/tests/test_helpers/notebook_verification_helpers.py .
!rm -rf efficiently_testing_etl_pipelines
!rm -rf sample_data


### Install Dependencies

For the exercise, we will need some special dependencies to allow us to run lots of tests in a notebook.

`ipytest` lets us run our tests in a notebook.



In [None]:
!pip install ipytest

ipytest is what allows us to run our tests in a notebook. This next cell is not needed if you are writing tests in a separate pytest file.

In [None]:
import ipytest
ipytest.autoconfig()

We are installing `pyspark` because it doesn't come with the base colab environment

In [None]:
!pip install pyspark

approvaltests is what lets us run our tests with many combinations

In [None]:
!pip install approvaltests

## Create a local SparkSession

Normally spark runs on a bunch of executors in the cloud. Since we want our tests to be able to run on a single dev machine, we make a fixture that gives us a local spark context.

In [None]:
import pytest
from _pytest.fixtures import FixtureRequest
from pyspark import SparkConf
from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark(request: FixtureRequest):
    conf = (SparkConf()
        .setMaster("local")
        .setAppName("sample_pyspark_testing_starter"))

    spark = SparkSession \
        .builder \
        .config(conf=conf) \
        .getOrCreate()

    request.addfinalizer(lambda: spark.stop())
    return spark

## Exercise

This is a test for a piece of code that replaces all null values in the price column of a dataframe with an average price.

The average price is calculated from the price of other diamonds with the same cut, clarity and color.

## Initial State (1 test)

Run the below cell contining our first test. The test will fail. What does the failure look like here?

In [None]:
%%ipytest -qq
from diamond_pricing import replace_null_prices_with_floating_averages
from notebook_verification_helpers import verify_will_replace_null_values_with_floating_averages

def test_will_replace_null_prices_with_floating_averages(spark: SparkSession) -> None:
    price = [327]

    verify_will_replace_null_values_with_floating_averages(spark, price)


The test created 2 files. One file name ends in `approved.txt`. The other file name ends in `received.txt`.

Look at the `received.txt` file. If it looks good, approve it by running the cell below. Rerun the cell containing the test. It should pass.

In [None]:
!mv /content/test_one_test.received.txt /content/test_one_test.approved.txt

### Add a Price of None

Right now, we only have one test for a diamond with a price of 327. The price is wrapped in an array. Add a new test by adding a new item to that array.

Because you are adding a new input, the test will fail.

In [None]:
%%ipytest -qq
from diamond_pricing import replace_null_prices_with_floating_averages
from notebook_verification_helpers import verify_will_replace_null_values_with_floating_averages

def test_will_replace_null_prices_with_floating_averages(spark: SparkSession) -> None:
    price = [327]

    verify_will_replace_null_values_with_floating_averages(spark, price)


Compare the `test__will_replace_null_prices_with_floating_averages.received.txt` file to the `test__will_replace_null_prices_with_floating_averages.approved.txt` file. How is it different?

Each line in the files represents one test case. You have just created 2 tests. Run the cell below to update the expected output. Re-run the test cell, it should pass.

In [None]:
!mv /content/test_will_replace_null_prices_with_floating_averages.received.txt /content/test_will_replace_null_prices_with_floating_averages.approved.txt