## Manually Testing Data Pipeline

This is done to:
- validatae that data is extracted, transformed, loaded as expected
- identify and fix data quality issues
- improves data reliability

### End-to-End Testing
This is doing ETL to a testing environment:
- confirms that pipeline runs on repeated attempts
- validate data pipeline checkpoints
  - This is just having checkpoints every after component of an ETL
    ```python
    # in loading data, we can check if the loaded data is correct by doing
    loaded_data = pd.read_sql(query, engine)
    print(my_df.equals(loaded__data))
    ```
  - You can check what happens after transforming the cleaned data and compare to the extratced raw data
- engage in peer review, incorporate feedback

An example of manual testing is like this:

```python
# Trigger the data pipeline to run three times
for attempt in range(0, 3):
	print(f"Attempt: {attempt}")
	raw_tax_data = extract("raw_tax_data.csv")
	clean_tax_data = transform(raw_tax_data)
	load(clean_tax_data, "clean_tax_data.parquet")
	
	# Print the shape of the cleaned_tax_data DataFrame
	print(f"Shape of clean_tax_data: {clean_tax_data.shape}")
    
# Read in the loaded data, check the shape
to_validate = pd.read_parquet("clean_tax_data.parquet")
print(f"Final shape of cleaned data: {to_validate.shape}")
```
This ensures that the shapes are the same and that the piepline does not change even efter repeating the calls

## Unit Testing a Data Pipeline

Typical testing in data engineering follows:

Unit testing should run -> end-to-end testing -> deploy

to do this we use `pytest`

```python
from pipeline import extract, transform, load

# build a test asseting the type of the resulting data after transforming
def test_transformed_date():
    raw_df = extract('raw.csv')
    clean = transform(raw_df)

    assert isinstance(clean, pd.DataFrame)
```

### Using Fixtures

When testing pipelines, we often reuse the same sample or mock data in multiple test cases.

To avoid repeating setup code, `pytest` provides **fixtures**, which are functions that return a fixed baseline input or environment for the tests to run.

#### What is a fixture?

A fixture is a function decorated with `@pytest.fixture`.

When a test function includes a parameter with the same name as a fixture, `pytest` automatically calls the fixture and passes its return value to the test function.

#### Example: Testing a Data Pipeline with a Fixture

```python
import pytest
import pandas as pd
from pipeline import transform

# Define a fixture named 'raw_data'
@pytest.fixture
def raw_data():
    data = {
        'date': ['2024-01-01', '2024-01-02'],
        'value': [10, 20]
    }
    return pd.DataFrame(data)

# This test function uses the 'raw_data' fixture
def test_transform_output_type(raw_data):
    # raw_data is automatically passed in by pytest
    result = transform(raw_data)

    # Assert that the result is a DataFrame
    assert isinstance(result, pd.DataFrame)

# Another test using the same raw_data fixture
def test_transform_values(raw_data):
    result = transform(raw_data)

    # For example, check if a 'normalized_value' column was added
    assert 'normalized_value' in result.columns
```

In simpler terms, using a fixture like:

```python
@pytest.fixture
def raw_data():
    return pd.DataFrame({...})

def test_something(raw_data):
    # raw_data is already the DataFrame returned by raw_data()
    ...
```

is like writing:

```python
def test_something():
    raw_data = raw_data()  # You'd need to define raw_data() somewhere
```
Because in `pytest`, you don’t need to call raw_data() yourself — pytest does that for you automatically, based on the parameter name in the test function.

### Unit Testing on Data Contents
You don't only check for the entire dataframe. You also check for contents like the number of columns or the data ranges:

```python
def test_transformed_data(clean_data):
    # check the number of columns
    assert len(clean_data.columns) == 4

    # check the lowerbound of a column
    assert clean_data['col'].min() >= 0
```



---

## Data Pipeline Architecture Patterns
We may have a single file called `eeitl_pipeline.py` that includes the definitions of the E, T, and L and the execution.

However, that is not the best implementation, we may want to isolate the definiiton to other file:

`>ls`
- `etl_pipeline.py`
- `pipeline_utils.py`