Skip to content

Commit

Permalink
Update README to show use of new assert_equal function
Browse files Browse the repository at this point in the history
  • Loading branch information
sjwhalen committed Sep 30, 2019
1 parent 6097f98 commit 2ff06db
Show file tree
Hide file tree
Showing 2 changed files with 47 additions and 23 deletions.
69 changes: 46 additions & 23 deletions README.md
Expand Up @@ -24,30 +24,29 @@ def summarize_weekly_sales(df_to_average: DataFrame):
This is not a complex method, but perhaps we would like to verify that:

1. Dates are in fact grouped into different weeks.
1. The units sold are actually averaged and not summed.
1. Gross sales are actually summed and not averaged.
1. Units sold are actually averaged and not summed.
1. Gross sales are truly summed and not averaged.

A unit testing purist might argue that each of these assertions should be
covered by a separate test method, but there are at least two reasons why one
might choose not to go that route.
might choose not to do that.

1. Practical experience tells us that there is detectable overhead incurred for
1. Practical experience tells us that detectable overhead is incurred for
each separate PySpark transformation test, so we may want to limit the number
of separate tests in order to keep our test suite as a whole running in a
reasonable amount of time.
of separate tests in order to keep our full test suite running in a
reasonable duration.

1. When working with sets of data as we do in Spark or SQL, particularly when
using aggregate, grouping and window functions, sometimes there are
interactions between different rows that are easy to overlook, and tweaking a
1. When working with data sets as we do in Spark or SQL, particularly when
using aggregate, grouping and window functions,
interactions between different rows can be easy to overlook. Tweaking a
query to fix an aggregate function like a summation might inadvertently break
the intended behavior of a windowing function in the query, and a change to the
query might allow a summation-only unit test to pass while allowing broken
window function behavior to go undetected because we don't think to also update
the intended behavior of a windowing function in the query.
A change to the query might allow a summation-only unit test to pass while leaving broken
window function behavior undetected because we have neglected to update
the window-function-only unit test.

If we accept that we'd like to use a single test to verify the three
requirements of our query listed above, we're going to need three rows in our
input DataFrame.
requirements of our query, we need three rows in our input DataFrame.

Using unadorned pytest, our test might look like this:

Expand All @@ -73,13 +72,18 @@ def test_without_dataframe_show_reader(spark_session: SparkSession):
input_df = spark_session.createDataFrame(input_rows)

result = summarize_weekly_sales(input_df).collect()
assert 2 == len(result)

assert 2 == len(result) # Number of rows
assert 3 == len(result[0]) # Number of columns
assert 1 == result[0]['week_of_year']
assert 15 == result[0]['avg_units_sold']
assert 300 == result[0]['gross_sales']
assert 2 == result[1]['week_of_year']
assert 80 == result[1]['avg_units_sold']
assert 800 == result[1]['gross_sales']
```

Using the DataFrame show reader, our test could look like this instead:
Using the DataFrame Show Reader, our test could instead look like this:

```python
def test_using_dataframe_show_reader(spark_session: SparkSession):
Expand All @@ -89,15 +93,21 @@ def test_using_dataframe_show_reader(spark_session: SparkSession):
+-------------------+----------+-----------+
|2019-01-01 00:00:00|10 |100 |
|2019-01-02 00:00:00|20 |200 |
|2019-01-08 00:00:00|80 |800 |
|2019-01-08 00:00:00|80 |800 | This date is in week 2.
+-------------------+----------+-----------+
""", spark_session)

result = summarize_weekly_sales(input_df).collect()
assert 2 == len(result)
assert 1 == result[0]['week_of_year']
assert 15 == result[0]['avg_units_sold']
assert 300 == result[0]['gross_sales']
expected_df = show_output_to_df("""
+------------+--------------+-----------+
|week_of_year|avg_units_sold|gross_sales|
[int |double |double ]
+------------+--------------+-----------+
|1 |15 |300 |
|2 |80 |800 |
+------------+--------------+-----------+
""", spark_session)

assert_equal(expected_df, summarize_weekly_sales(input_df))
```

In the second test example, the ``show_output_to_df`` function accepts as input
Expand All @@ -112,16 +122,29 @@ input data in a more concise tabular form that may be easier for other
programmers (and our future selves) to digest when we need to maintain this
code down the road.

If the method under test were more complicated and required more rows and/or
If the method under test was more complicated and required more rows and/or
columns in order to adequately test, the length of the first test format would
grow much more quickly than that of the test using the DataFrame Show Reader.

Notice also that the ``show_output_to_df`` function gives us a convenient way
to create an ``expected_df`` to pass to the ``assert_equal`` function (to check
DataFrame equality )that is included in the package. In addition to allowing
this compact display of the expected numbers of rows and columns and data,
``assert_equal`` checks that the DataFrame schemas match, which the first
version of the test does not do.

## Running the Tests

1. Clone the git repo.
1. ``cd`` into the root level of the repo.
1. At a terminal command line, run ``pytest``

## Installation

To install the package for use in your own package, run:

`pip install py-dataframe-show-reader`

## Who Maintains DataFrame Show Reader?

DataFrame Show Reader is the work of the community. The core committers and
Expand Down
1 change: 1 addition & 0 deletions setup.py
Expand Up @@ -21,4 +21,5 @@
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
],
python_requires='>=3.6',
)

0 comments on commit 2ff06db

Please sign in to comment.