## Importing shared computation logic from library

A concept in software engineering is "testability", the ability to write and execute tests against business logic.  
Notebooks by themselves are not amenable to this approach; as it mixes data access, data processing, data ingestion, and data export in the same code unit; and it makes writing separable tests much harder. 
An approach to writing testable pyspark code, in the python eco-system is to separate certain business logic in a python library file.  This file can be separably executed and tested.  

This particular example shows to separate logic block, imported as a library file, and its testability

In [0]:
import os
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
print(notebook_path)
notebook_abs_dir = os.path.dirname(notebook_path)
print(notebook_abs_dir)

In [0]:
%reload_ext autoreload
%autoreload 2

## Load Datasets from external Data Source

These Datasets are currently stored locally and packaged with this repo. However, it is representive of interacting with external data interfaces, such as API (i.e. HTTPS GET) or Delta Lake (i.e. spark.table)

In [0]:
# Pretend we are loading data from an API to get the Class Dataset
import os
from library.fetch_data_from_api import get_class_data_from_api

df_class = get_class_data_from_api(spark)

display(df_class)

In [0]:
from library.fetch_data_from_api import get_score_data_from_api

df_score = get_score_data_from_api(spark)

display(df_score)

## Data Processing - Joining Class and Score columns


In [0]:
from library.class_business_logic import inner_join_dataframes

In [0]:
df_joined = inner_join_dataframes(df_class, df_score, "class_id")
display(df_joined)

## Calculate Class Level Statistics

I will provide 2 implementation of the same logic to provide two levels of complexity of computation
and their corresponding test cases.  The result should be the same.  I'm giving two examples of the same logic implemented in different ways.  Both approaches have their unit tests and integration tests.

To make code testable, there are a couple of important guidelines to follow: 
1. Encapsulate logic in functions and place them in a separate library file—such as those found in the   `library` folder. This allows pytest to import the logic for testing without automatically executing it.
2. Write logic in the smallest possible functional units—each function should perform a single, well-defined task. This makes it easier to wrap tests around individual pieces of logic, enabling faster error tracing and simpler debugging.

### Score Statistics with Simple Spark SQL

In [0]:
from library.class_business_logic import calculate_statistics_with_sql

df_result = calculate_statistics_with_sql(df_joined, columns=["score"], spark=spark)
# print(df_result)
display(df_result)

### Score Statistics with Pandas On Spark

In [0]:
from library.class_business_logic import calculate_statistics_with_pandas
from pyspark.sql.functions import explode


df_result = df_joined.groupBy("class_id", "class_name").applyInPandas(
    lambda pdf: calculate_statistics_with_pandas(pdf, columns=["score"]),
    schema="class_id string, class_name string, score_stats struct<score_average:double,score_min:double,score_max:double>",
)

# explode the score_stats struct
df_result = df_result.select("class_id", "class_name", "score_stats.*")
display(df_result)

In [0]:
from pyspark.sql.functions import to_date


result = (
    df_sample_data.withColumn("pickup_date", to_date("tpep_pickup_datetime"))
    .groupBy("pickup_date")
    .applyInPandas(
        lambda pdf: calculate_statistics(pdf, columns=["trip_distance", "fare_amount"]),
        schema="pickup_date date, trip_distance_stats struct<mean:double,median:double,variance:double>, fare_amount_stats struct<mean:double,median:double,variance:double>",
    )
)

display(result)