<a href="https://colab.research.google.com/github/jmasonlee/efficiently_testing_etl_pipelines/blob/main/fixing_a_big_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup Notebook

In [1]:
!rm -rf efficiently_testing_etl_pipelines
!git clone https://github.com/jmasonlee/efficiently_testing_etl_pipelines.git
!cp -r /content/efficiently_testing_etl_pipelines/src/ .
!cp -r /content/efficiently_testing_etl_pipelines/tests/ .
!rm -rf efficiently_testing_etl_pipelines
!rm -rf tests/diamond_pricing_test*
!rm -rf tests/test_helpers/*verification_helpers.py
!rm -rf tests/conftest.py
!rm -rf sample_data


Cloning into 'efficiently_testing_etl_pipelines'...
remote: Enumerating objects: 525, done.[K
remote: Counting objects: 100% (166/166), done.[K
remote: Compressing objects: 100% (92/92), done.[K
remote: Total 525 (delta 107), reused 113 (delta 70), pack-reused 359[K
Receiving objects: 100% (525/525), 239.28 KiB | 4.20 MiB/s, done.
Resolving deltas: 100% (313/313), done.


# Setup Tests

### Install Dependencies

For the exercise, we will need some special dependencies to allow us to run lots of tests in a notebook.

`ipytest` lets us run our tests in a notebook.



In [2]:
!pip install ipytest

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ipytest
  Downloading ipytest-0.13.3-py3-none-any.whl (14 kB)
Collecting jedi>=0.16 (from ipython->ipytest)
  Downloading jedi-0.18.2-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, ipytest
Successfully installed ipytest-0.13.3 jedi-0.18.2


ipytest is what allows us to run our tests in a notebook. This next cell is not needed if you are writing tests in a separate pytest file.

In [3]:
import ipytest
ipytest.autoconfig()

We are installing `pyspark` because it doesn't come with the base colab environment

In [4]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317130 sha256=a734f02fdb61d8c789991111397d62729fd65605874e96f19384987cb756c3ca
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0


## Create a local SparkSession

Normally spark runs on a bunch of executors in the cloud. Since we want our tests to be able to run on a single dev machine, we make a fixture that gives us a local spark context.

In [5]:
import pytest
from _pytest.fixtures import FixtureRequest
from pyspark import SparkConf
from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark(request: FixtureRequest):
    conf = (SparkConf()
        .setMaster("local")
        .setAppName("sample_pyspark_testing_starter"))

    spark = SparkSession \
        .builder \
        .config(conf=conf) \
        .getOrCreate()

    request.addfinalizer(lambda: spark.stop())
    return spark

## Create Helpers

This is a helper function that retrieves our test output from the expected.json file

In [6]:
import json

def expected_json():
    with open("tests/fixtures/expected.json") as f:
        return json.loads(f.read())

# The Test

In [7]:
%%ipytest -qq
from src.linear_regression_prep import transform
from tests.test_helpers.json_helpers import create_df_from_json, data_frame_to_json

from pyspark.sql import SparkSession

def test_prep_for_linear_regression(spark: SparkSession):
    diamonds_df = create_df_from_json("tests/fixtures/diamonds.json", spark)

    actual_df = transform(diamonds_df)
    assert data_frame_to_json(actual_df) == expected_json()



[31mF[0m[31m                                                                                            [100%][0m
[31m[1m_________________________________ test_prep_for_linear_regression __________________________________[0m

spark = <pyspark.sql.session.SparkSession object at 0x7f2152662530>

    [94mdef[39;49;00m [92mtest_prep_for_linear_regression[39;49;00m(spark: SparkSession):[90m[39;49;00m
        diamonds_df = create_df_from_json([33m"[39;49;00m[33mtests/fixtures/diamonds.json[39;49;00m[33m"[39;49;00m, spark)[90m[39;49;00m
    [90m[39;49;00m
        actual_df = transform(diamonds_df)[90m[39;49;00m
>       [94massert[39;49;00m data_frame_to_json(actual_df) == expected_json()[90m[39;49;00m
[1m[31mE       AssertionError: assert [{'carat': 0....G', ...}, ...] == [{'carat': 0....8', ...}, ...][0m
[1m[31mE         At index 0 diff: {'id': '1', 'carat': 0.23, 'clarity': 'SI2', 'color': 'E', 'price': 326.0, 'clarity_index': 3.0, 'color_index': 1.0, 'inde