<a href="https://colab.research.google.com/github/jmasonlee/efficiently_testing_etl_pipelines/blob/main/fixing_a_big_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup Notebook

In [2]:
!rm -rf efficiently_testing_etl_pipelines
!git clone https://github.com/jmasonlee/efficiently_testing_etl_pipelines.git
!cp -r /content/efficiently_testing_etl_pipelines/src/ .
!cp -r /content/efficiently_testing_etl_pipelines/tests/ .
!rm -rf efficiently_testing_etl_pipelines
!rm -rf tests/diamond_pricing_test*
!rm -rf tests/test_helpers/*verification_helpers.py
!rm -rf tests/conftest.py
!rm -rf sample_data


Cloning into 'efficiently_testing_etl_pipelines'...
remote: Enumerating objects: 550, done.[K
remote: Counting objects: 100% (191/191), done.[K
remote: Compressing objects: 100% (115/115), done.[K
remote: Total 550 (delta 121), reused 111 (delta 70), pack-reused 359[K
Receiving objects: 100% (550/550), 246.07 KiB | 8.48 MiB/s, done.
Resolving deltas: 100% (327/327), done.


# Setup Tests

### Install Dependencies

For the exercise, we will need some special dependencies to allow us to run lots of tests in a notebook.

`ipytest` lets us run our tests in a notebook.



In [3]:
!pip install ipytest

Collecting ipytest
  Downloading ipytest-0.13.3-py3-none-any.whl (14 kB)
Collecting jedi>=0.16 (from ipython->ipytest)
  Downloading jedi-0.18.2-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, ipytest
Successfully installed ipytest-0.13.3 jedi-0.18.2


ipytest is what allows us to run our tests in a notebook. This next cell is not needed if you are writing tests in a separate pytest file.

In [4]:
import ipytest
ipytest.autoconfig()

We are installing `pyspark` because it doesn't come with the base colab environment

In [5]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285398 sha256=13b8460b2c01651b31ebdfda3c116e8d7519284d585e49e934055dee2caf3462
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1


In [6]:
!pip install chispa

Collecting chispa
  Downloading chispa-0.9.2-py3-none-any.whl (28 kB)
Installing collected packages: chispa
Successfully installed chispa-0.9.2


## Create a local SparkSession

Normally spark runs on a bunch of executors in the cloud. Since we want our tests to be able to run on a single dev machine, we make a fixture that gives us a local spark context.

In [7]:
import pytest
from _pytest.fixtures import FixtureRequest
from pyspark import SparkConf
from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark(request: FixtureRequest):
    conf = (SparkConf()
        .setMaster("local")
        .setAppName("sample_pyspark_testing_starter"))

    spark = SparkSession \
        .builder \
        .config(conf=conf) \
        .getOrCreate()

    request.addfinalizer(lambda: spark.stop())
    return spark

## Create Helpers

This is a helper function that retrieves our test output from the expected.json file

In [8]:
import json

def expected_json():
    with open("tests/fixtures/expected.json") as f:
        return json.loads(f.read())

# The Test

In [16]:
%%ipytest -qq
from src.linear_regression_prep import transform
from tests.test_helpers.json_helpers import create_df_from_json, data_frame_to_json

from pyspark.sql import SparkSession

def test_prep_for_linear_regression(spark: SparkSession):
    diamonds_df = create_df_from_json("tests/fixtures/diamonds.json", spark)

    actual_df = transform(diamonds_df)
    assert data_frame_to_json(actual_df) == expected_json()

[32m.[0m[32m                                                                                            [100%][0m


# Make The Assert Specific

Right now, our test compares everything in the output dataframe to everything in a large json file. That's a lot of rows to compare and the assert is wrong anyways!

Let's make this test assert on the thing we actually care about - the output price of the diamond!

## Let's make our assert specific!
### We can do the next step in one of 3 ways:
#### With Chispa
- [ ] Add these imports to the top of the cell, below the `%%ipytest -qq` line:  
`from chispa import assert_column_equality`  
`from pySpark.sql.functions import lit`
- [ ] Filter the dataframe for the unique id of the diamond we care about:  
`actual_df=actual_df.filter(actual_df.id == 'DI-26-null-price')`
- [ ] Create a new column in our dataframe that contains our expected price:  
`actual_df=actual_df.withColumn('expected_price', lit(2960.0))`
- [ ] Assert the value in the price column matches the value we want:  
`assert_column_equality(actual_df, 'price', 'expected_price')`


In [None]:
%%ipytest -qq
from src.linear_regression_prep import transform
from tests.test_helpers.json_helpers import create_df_from_json, data_frame_to_json

from pyspark.sql import SparkSession



def test_prep_for_linear_regression(spark: SparkSession):
    diamonds_df = create_df_from_json("tests/fixtures/diamonds.json", spark)

    actual_df = transform(diamonds_df)


[32m.[0m[32m                                                                                            [100%][0m


### With Pandas
- [ ] import pandas:  
`import pandas as pd`
- [ ] Filter the dataframe for the unique id of the diamond we care about:  
  `actual_df=actual_df.filter(actual_df.id == 'DI-26-null-price')`
- [ ] Create your expected dataframe using Pandas:  
 `expected = pd.DataFrame(({'id': ["DI-26-null-price"], 'price':[2690.0] }))`
- [ ] Select the column you care about:  
  `actual_df=actual_df.select(['id', 'price'])
- [ ] Assert for dataframe equality using pandas:  
  `pd.testing.assert_frame_equal(actual_df, expected)`

In [13]:
%%ipytest -qq
from src.linear_regression_prep import transform
from tests.test_helpers.json_helpers import create_df_from_json, data_frame_to_json

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit


import pandas as pd

def test_prep_for_linear_regression(spark: SparkSession):
    diamonds_df = create_df_from_json("tests/fixtures/diamonds.json", spark)

    actual_df = transform(diamonds_df)


[31mF[0m[31m                                                                                            [100%][0m
[31m[1m_________________________________ test_prep_for_linear_regression __________________________________[0m

spark = <pyspark.sql.session.SparkSession object at 0x7f09a86a93f0>

    [94mdef[39;49;00m [92mtest_prep_for_linear_regression[39;49;00m(spark: SparkSession):[90m[39;49;00m
        diamonds_df = create_df_from_json([33m"[39;49;00m[33mtests/fixtures/diamonds.json[39;49;00m[33m"[39;49;00m, spark)[90m[39;49;00m
    [90m[39;49;00m
        actual_df = transform(diamonds_df)[90m[39;49;00m
        actual_df=actual_df.filter(actual_df.id == [33m'[39;49;00m[33mDI-26-null-price[39;49;00m[33m'[39;49;00m)[90m[39;49;00m
        expected = pd.DataFrame(({[33m'[39;49;00m[33mid[39;49;00m[33m'[39;49;00m: [[33m"[39;49;00m[33mDI-26-null-price[39;49;00m[33m"[39;49;00m], [33m'[39;49;00m[33mprice[39;49;00m[33m'[39;49;00m:[[94m2690.0[

### Assert on properties

- [ ] Filter the dataframe for the unique id of the diamond we care about:  
  `actual_df=actual_df.filter(actual_df.id == 'DI-26-null-price')`
- [ ] Convert your dataframe to JSON:  
`actual_df_json = data_frame_to_json(actual_df)`
- [ ] Assert the price property of the first object matches your expected price:  
`assert actual_df_json[0]['price'] == 2690.0`

In [14]:
%%ipytest -qq
from src.linear_regression_prep import transform
from tests.test_helpers.json_helpers import create_df_from_json, data_frame_to_json

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit


def test_prep_for_linear_regression(spark: SparkSession):
    diamonds_df = create_df_from_json("tests/fixtures/diamonds.json", spark)

    actual_df = transform(diamonds_df)



[31mF[0m[31m                                                                                            [100%][0m
[31m[1m_________________________________ test_prep_for_linear_regression __________________________________[0m

spark = <pyspark.sql.session.SparkSession object at 0x7f09a86c1c60>

    [94mdef[39;49;00m [92mtest_prep_for_linear_regression[39;49;00m(spark: SparkSession):[90m[39;49;00m
        diamonds_df = create_df_from_json([33m"[39;49;00m[33mtests/fixtures/diamonds.json[39;49;00m[33m"[39;49;00m, spark)[90m[39;49;00m
    [90m[39;49;00m
        actual_df = transform(diamonds_df)[90m[39;49;00m
    [90m[39;49;00m
        actual_df=actual_df.filter(actual_df.id == [33m'[39;49;00m[33mDI-26-null-price[39;49;00m[33m'[39;49;00m)[90m[39;49;00m
        actual_df_json = data_frame_to_json(actual_df)[90m[39;49;00m
>       [94massert[39;49;00m actual_df_json[[94m0[39;49;00m][[33m'[39;49;00m[33mprice[39;49;00m[33m'[39;49;00m] == [94m2690.0

# Reduce Duplicate Coverage and Fix the Bug

Right now, our test is running the entire transform function. Because there are multiple tests in `diamonds.json`, each test is running the same large block of code over and over again.

## The Code
This is the code that is executed when we run our test.

Right now, all of it is being run by every test case in our diamonds.json input file.

Execute both of these cells so that they are available in our test cell


### The `replace_null_prices_with_floating_averages` Function

### The `transform` Function

In [19]:
from pyspark.sql import DataFrame, Window, Column
from pyspark.sql.functions import log, when, mean, col

from src.build_indep_vars import build_indep_vars
from src.diamond_pricing import replace_null_prices_with_floating_averages

def replace_null(orig: Column, average: Column):
    return when(orig.isNull(), average).otherwise(orig)

def transform(df: DataFrame) -> DataFrame:

    df = df.withColumn('lprice', log('price'))
    window = Window.partitionBy('cut', 'clarity').orderBy('price').rowsBetween(-3, 3)
    moving_avg = mean(df['price']).over(window)
    df = df.withColumn('moving_avg', moving_avg)

    df_new = df.withColumn('price', replace_null(col('price'), col('moving_avg')))
    df = df[['id', 'carat', 'clarity', 'color', 'price']]
    df = build_indep_vars(df, ['carat', 'clarity', 'color'],
                                      categorical_vars=['clarity', 'color'],
                                      keep_intermediate=False,
                                      summarizer=True)
    return df

## The Test

Let's reduce the duplicate coverage.

### Prep

- [ ] Run the test. It should be failing.
- [ ] Replace the line that calls the transform function with the body of the transform function.  
- [ ] Rename `df` to `actual_df`, except the first place it's used. (Find and Replace is `ctrl-m-h`).  
This line:  
`df = df.withColumn('lprice', log('price'))`  
should become  
`actual_df = diamonds_df.withColumn('lprice', log('price'))`
- [ ] Change your assert code so that it is testing _for_ the bug.
```
    actual_df=actual_df.withColumn('expected_price', lit(2960.0))
```
becomes
```
  actual_df=actual_df.withColumn('expected_price', lit(2460.0))
```


- [ ] Extract your assert code into a one-line helper function:
```
def assert_diamond_has_expected_price(actual_df):
    actual_df=actual_df.filter(actual_df.id == 'DI-26-null-price')
    actual_df=actual_df.withColumn('expected_price', lit(2460.0))
    assert_column_equality(actual_df, 'price', 'expected_price')
```


In [20]:
%%ipytest -qq
from src.linear_regression_prep import transform
from tests.test_helpers.json_helpers import create_df_from_json, data_frame_to_json

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from chispa import assert_column_equality

def test_prep_for_linear_regression(spark: SparkSession):
    diamonds_df = create_df_from_json("tests/fixtures/diamonds.json", spark)

    actual_df = transform(diamonds_df)

    actual_df=actual_df.filter(actual_df.id == 'DI-26-null-price')
    actual_df=actual_df.withColumn('expected_price', lit(2960.0))
    assert_column_equality(actual_df, 'price', 'expected_price')

[31mF[0m[31m                                                                                            [100%][0m
[31m[1m_________________________________ test_prep_for_linear_regression __________________________________[0m

spark = <pyspark.sql.session.SparkSession object at 0x7f09a83e0910>

    [94mdef[39;49;00m [92mtest_prep_for_linear_regression[39;49;00m(spark: SparkSession):[90m[39;49;00m
        diamonds_df = create_df_from_json([33m"[39;49;00m[33mtests/fixtures/diamonds.json[39;49;00m[33m"[39;49;00m, spark)[90m[39;49;00m
    [90m[39;49;00m
        actual_df = diamonds_df.withColumn([33m'[39;49;00m[33mlprice[39;49;00m[33m'[39;49;00m, log([33m'[39;49;00m[33mprice[39;49;00m[33m'[39;49;00m))[90m[39;49;00m
        actual_df = replace_null_prices_with_floating_averages(actual_df)[90m[39;49;00m
        actual_df = actual_df[[[33m'[39;49;00m[33mid[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mcarat[39;49;00m[33m'[39;49;00m, [33m'[39;

#### Now we can start finding what's important!
Remember, our bug is that diamonds of the same cut and clarity are influencing the calculated price of diamonds with a different color. Only diamonds with the same cut, clarity and color should be influencing the calculated price for diamonds with a null price.

- [ ] Move your assert up one line at a time.  
 After each move, run your tests.  
 If it fails, figure out why it's failing.  
 If the test passes, the line wasn't important for the bug you wanted to catch. Delete it.
- [ ] Continue until you find the source of the bug
- [ ] You may need to rename columns in order to continue squeezing the bottom.

In [30]:
%%ipytest -qq
from src.linear_regression_prep import transform
from tests.test_helpers.json_helpers import create_df_from_json, data_frame_to_json

from pyspark.sql import SparkSession, DataFrame, Window, Column
from pyspark.sql.functions import lit, log, when, mean, col
from chispa import assert_column_equality

def assert_diamond_has_expected_price(actual_df):
    actual_df=actual_df.filter(actual_df.id == 'DI-26-null-price')
    actual_df=actual_df.withColumn('expected_price', lit(2460.0))
    assert_column_equality(actual_df, 'price', 'expected_price')

def test_prep_for_linear_regression(spark: SparkSession):
    diamonds_df = create_df_from_json("tests/fixtures/diamonds.json", spark)

    actual_df = diamonds_df.withColumn('lprice', log('price'))
    window = Window.partitionBy('cut', 'clarity').orderBy('price').rowsBetween(-3, 3)
    moving_avg = mean(actual_df['price']).over(window)
    actual_df = actual_df.withColumn('moving_avg', moving_avg)

    actual_df = actual_df.withColumn('price', replace_null(col('price'), col('moving_avg')))
    actual_df = actual_df[['id', 'carat', 'clarity', 'color', 'price']]
    actual_df = build_indep_vars(actual_df, ['carat', 'clarity', 'color'],
                                      categorical_vars=['clarity', 'color'],
                                      keep_intermediate=False,
                                      summarizer=True)

    assert_diamond_has_expected_price(actual_df)





[31mF[0m[31m                                                                                            [100%][0m
[31m[1m_________________________________ test_prep_for_linear_regression __________________________________[0m

spark = <pyspark.sql.session.SparkSession object at 0x7f09a85eef20>

    [94mdef[39;49;00m [92mtest_prep_for_linear_regression[39;49;00m(spark: SparkSession):[90m[39;49;00m
        diamonds_df = create_df_from_json([33m"[39;49;00m[33mtests/fixtures/diamonds.json[39;49;00m[33m"[39;49;00m, spark)[90m[39;49;00m
    [90m[39;49;00m
        actual_df = diamonds_df.withColumn([33m'[39;49;00m[33mlprice[39;49;00m[33m'[39;49;00m, log([33m'[39;49;00m[33mprice[39;49;00m[33m'[39;49;00m))[90m[39;49;00m
        window = Window.partitionBy([33m'[39;49;00m[33mcut[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mclarity[39;49;00m[33m'[39;49;00m).orderBy([33m'[39;49;00m[33mprice[39;49;00m[33m'[39;49;00m).rowsBetween(-[94m3[39;49;0