# Test a Kedro project

It is important to test our Kedro projects to validate and verify that our nodes and pipelines behave as we expect them to. In this section we look at some example tests for the rocketfuel project.

This section explains the following:

- How to test a Kedro node

- How to test a Kedro pipeline

- Testing best practices

In [2]:
# %pip install pytest

## Writing tests for Kedro nodes: Unit testing

- Kedro expects node functions to be pure functions; a pure function is one whose output follows solely from its inputs, without any   observable side effects. 
- Testing these functions checks that a node will behave as expected - for a given set of input values, a node will produce the expected output. These tests are referred to as unit tests.

Let us explore what this looks like in practice. Consider the node function split_data defined in the data science pipeline:

Recommendation: https://docs.pytest.org/en/7.1.x/explanation/anatomy.html#anatomy-of-a-test
1. Arrange -  prepare everything for our test
2. Act - state-changing action that kicks off the behavior we want to test
3. Assert - we look at that resulting state and check if it looks how we’d expect after the dust has settled
4. Cleanup - is where the test picks up after itself, so other tests aren’t being accidentally influenced by it

In [3]:
import logging
from typing import Dict, Tuple

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

def split_data(data: pd.DataFrame, parameters: Dict) -> Tuple:
    """Splits data into features and targets training and test sets.

    Args:
        data: Data containing features and target.
        parameters: Parameters defined in parameters/data_science.yml.
    Returns:
        Split data.
    """
    X = data[parameters["features"]]
    y = data["price"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=parameters["test_size"], random_state=parameters["random_state"]
    )
    return X_train, X_test, y_train, y_test

In [4]:
# Example test case
def test_split_data():
    # Arrange
    dummy_data = pd.DataFrame(
        {
            "engines": [1, 2, 3],
            "crew": [4, 5, 6],
            "passenger_capacity": [5, 6, 7],
            "price": [120, 290, 30],
        }
    )

    dummy_parameters = {
        "model_options": {
            "test_size": 0.2,
            "random_state": 3,
            "features": ["engines", "passenger_capacity", "crew"],
        }
    }

    # Act
    X_train, X_test, y_train, y_test = split_data(dummy_data, dummy_parameters["model_options"])

    # Assert
    assert len(X_train) == 2
    assert len(y_train) == 2
    assert len(X_test) == 1
    assert len(y_test) == 1

In [5]:
test_split_data()

This test is an example of positive testing - it tests that a valid input produces the expected output. The inverse, testing that an 
invalid output will be appropriately rejected, is called negative testing and is equally as important.

Using the same steps as above, we can write the following test to validate an error is thrown when price data is not available:

In [6]:
import pytest
def test_split_data_missing_price():
    # Arrange
    dummy_data = pd.DataFrame(
        {
            "engines": [1, 2, 3],
            "crew": [4, 5, 6],
            "passenger_capacity": [5, 6, 7],
            # Note the missing price data
        }
    )

    dummy_parameters = {
        "model_options": {
            "test_size": 0.2,
            "random_state": 3,
            "features": ["engines", "passenger_capacity", "crew"],
        }
    }

    with pytest.raises(KeyError) as e_info:
        # Act
        X_train, X_test, y_train, y_test = split_data(dummy_data, dummy_parameters["model_options"])

    # Assert
    assert "price" in str(e_info.value) # checks that the error is about the missing price data

In [7]:

test_split_data_missing_price()

## Writing tests for Kedro pipelines: Integration testing

Writing tests for each node ensures each node will behave as expected when run individually. However, we must also consider how nodes in a pipeline interact with each other - this is called integration testing. Integration testing combines individual units as a group and checks whether they communicate, share data, and work together as expected. Let us look at this in practice.

Consider the data science pipeline as a whole:

In [9]:
from kedro.pipeline import Pipeline, node, pipeline
import logging
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import max_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split


def split_data(data: pd.DataFrame, parameters: dict) -> tuple:
    """Splits data into features and targets training and test sets.

    Args:
        data: Data containing features and target.
        parameters: Parameters defined in parameters/data_science.yml.
    Returns:
        Split data.
    """
    X = data[parameters["features"]]
    y = data["price"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=parameters["test_size"], random_state=parameters["random_state"]
    )
    return X_train, X_test, y_train, y_test


def train_model(X_train: pd.DataFrame, y_train: pd.Series) -> LinearRegression:
    """Trains the linear regression model.

    Args:
        X_train: Training data of independent features.
        y_train: Training data for price.

    Returns:
        Trained model.
    """
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    return regressor


def evaluate_model(
    regressor: LinearRegression, X_test: pd.DataFrame, y_test: pd.Series
) -> dict[str, float]:
    """Calculates and logs the coefficient of determination.

    Args:
        regressor: Trained model.
        X_test: Testing data of independent features.
        y_test: Testing data for price.
    """
    y_pred = regressor.predict(X_test)
    score = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    me = max_error(y_test, y_pred)
    logger = logging.getLogger(__name__)
    logger.info("Model has a coefficient R^2 of %.3f on test data.", score)
    return {"r2_score": score, "mae": mae, "max_error": me}

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=split_data,
                inputs=["model_input_table", "params:model_options"],
                outputs=["X_train", "X_test", "y_train", "y_test"],
                name="split_data_node",
            ),
            node(
                func=train_model,
                inputs=["X_train", "y_train"],
                outputs="regressor",
                name="train_model_node",
            ),
            node(
                func=evaluate_model,
                inputs=["regressor", "X_test", "y_test"],
                outputs=None,
                name="evaluate_model_node",
            ),
        ]
    )

The pipeline takes a pandas DataFrame and dictionary of parameters as input, splits the data in accordance to the parameters, 
and uses it to train and evaluate a regression model. 

With an integration test, we can validate that this sequence of nodes runs as expected.

As we did with our unit tests, we break this down into several steps:
1. Arrange: Prepare the runner and its inputs pipeline and catalog, and any additional test setup.
2. Act: Run the pipeline.
3. Assert: Ensure a successful run message was logged.

In [20]:
ds_pipeline = create_pipeline()

In [57]:
from kedro.io import DataCatalog
from kedro.runner import SequentialRunner

def test_data_science_pipeline(caplog):    # Note: caplog is passed as an argument
    # Arrange pipeline
    pipeline = ds_pipeline

    # Arrange data catalog
    catalog = DataCatalog()

    dummy_data = pd.DataFrame(
        {
            "engines": [1, 2, 3],
            "crew": [4, 5, 6],
            "passenger_capacity": [5, 6, 7],
            "price": [120, 290, 30],
        }
    )

    dummy_parameters = {
        "model_options": {
            "test_size": 0.2,
            "random_state": 3,
            "features": ["engines", "passenger_capacity", "crew"],
        }
    }

    catalog.add_feed_dict(
        {
            "model_input_table" : dummy_data,
            "params:model_options": dummy_parameters["model_options"],
        }
    )

    # Arrange the log testing setup
    caplog.set_level(logging.DEBUG, logger="kedro") # Ensure all logs produced by Kedro are captured
    successful_run_msg = "Pipeline execution completed successfully."

    # Act
    SequentialRunner().run(pipeline, catalog)

    # Assert
    assert successful_run_msg in caplog.text

In [58]:
# test_data_science_pipeline(caplog)
# This does not work as caplog is a pytest fixture

## Testing best practices

<b>1. Where to write your tests:</b> We recommend creating a tests directory within the root directory of your project

    <pre><code>
        
    src
    │   ...
    └───spaceflights
    │   └───pipelines
    │       └───data_science
    │           │   __init__.py
    │           │   nodes.py
    │           │   pipeline.py
    │
    tests
    |   ...
    └───pipelines
    │   └───data_science
    │       │   test_data_science_pipeline.py
        
    </code></pre>

</br>

<b>2. Using fixtures:</b> In our tests, we can see that dummy_data and dummy_parameters have been defined three times with (mostly) the same values. Instead, we can define these outside of our tests as pytest fixtures

```python
    
    import pytest
    
    @pytest.fixture
    def dummy_data():
        return pd.DataFrame(
            {
                "engines": [1, 2, 3],
                "crew": [4, 5, 6],
                "passenger_capacity": [5, 6, 7],
                "price": [120, 290, 30],
            }
        )
    
    @pytest.fixture
    def dummy_parameters():
        parameters = {
            "model_options": {
                "test_size": 0.2,
                "random_state": 3,
                "features": ["engines", "passenger_capacity", "crew"],
            }
        }
        return parameters
    
```
</br>

    We can then access these through the test arguments.

```python
    def test_split_data(dummy_data, dummy_parameters):
            ...
```
</br>

<b> 3. Pipeline Slicing:</b> In the test `test_data_science_pipeline` we test the data science pipeline, as currently defined, can be run successfully. However, as pipelines are not static, this test is not robust. Instead we should be specific with how we define the pipeline to be tested; we do this by using pipeline slicing to specify the pipeline’s start and end:

```python
    def test_data_science_pipeline(self):
        # Arrange pipeline
        pipeline = create_pipeline().from_nodes("split_data_node").to_nodes("evaluate_model_node")
        ...
```

## Exercise: 

1. Understand the `test_pipeline` file at - 
https://github.com/kedro-org/kedro-academy/blob/main/kedro-databricks-bootcamp/03_intermediate/rocketfuel/tests/pipelines/data_science/test_pipeline.py
2. Try to run your test file following the doc - https://docs.kedro.org/en/stable/tutorial/test_a_project.html#run-your-tests
