This library enables functional testing for Keboola components and processors by comparing expected and real output directories.
# Basic installation
pip install keboola-datadirtest
# With VCR support (HTTP recording/replay)
pip install "keboola-datadirtest[vcr]"
# With all features
pip install "keboola-datadirtest[all]"In the tests folder create a directory structure mimicking the directory structure in production:
/path/to/project/tests
└─functional
└─test-name
├─expected-code
├─expected
│ └─data
│ └─out
│ ├─files
│ └─tables
├─source
│ └─data
│ ├─ config.json
│ ├─ set_up.py
│ ├─ tear_down.py
│ └─in
│ ├─files
│ ├─tables
source- contains data folder that would be on the input of the component- it may contain
set_up.pyandtear_down.pyscripts that are executed before and after each test respectively.
- it may contain
expected- contains data folder that is result of the execution against thesourcefolder. Include only folder that contain some files, e.g.expected/files/out/file.json
The DataDirTester looks for the component.py script and executes it against the specified source folders,
the component.py should expect the data folder path in the environment variable KBC_DATADIR.
By default it looks for the script at this path:
/path/to/project
└─src
└─component.py
Then create file test_functional.py in the /path/to/project/tests folder and input the following:
import unittest
from keboola.datadirtest import DataDirTester
class TestComponent(unittest.TestCase):
def test_functional(self):
functional_tests = DataDirTester()
functional_tests.run()
if __name__ == "__main__":
unittest.main()The DataDirTester class accepts the following parameters:
save_output(default:False): IfTrue, the resulting data folders are saved in theoutputfolder.selected_tests(list ofstr): If set, only the tests with the specified names are executed.
Both parameters can also be passed via environment variables: DIRTEST_SAVE_OUTPUT and DIRTEST_SELECTED_TESTS (comma-separated names).
Then run your tests as usual e.g. via python -m unittest discover from the root folder.
Alternatively run as:
python -m datadirtest /path/to/project/tests/functional [optionally path/to/project/script.py]
It is possible to use environment variables placeholders inside the test config.json.
To do so use following syntax: {{env.VARIABLE_NAME}}. The environment variable VARIABLE_NAME will be expected and it's value will replace the placeholder.
If the EVN variable is not present, the test will fail.
Example
{
"parameters": {
"#api_key": "{{env.API_KEY}}",
"since": "1 day ago"
}
}In the following example the #api_key value will be replaced by API_KEY ENV variable value.
In some cases you want to modify the DataDirTest behaviour for instance to prepare the environment for each test, or execute some script prior the actual test run.
To achieve this you may extend the datadirtest.TestDataDir class. This class is a Test container for each of the
test data folders and are being triggered as a part of a test suite via DataDirTester.run() method.
The DataDirTester class takes two additional arguments that allow to specify both the class extending the DataDirTester
with additional functionality and also a context (parameters) that are passed to each DataDirTester class instance.
test_data_dir_class: Type[TestDataDir]- a class with additional functionality to execute datadir tests. E.g.MyCustomDataTest(TestDataDir)context_parameters- a dictionary with arbitrary context parameters that are then available in eachTestDataDirinstance viaTestDataDir.context_parameters
** Example:**
The below code instantiates a pseudo SqlClient and runs the same sequence of queries before each DataDirtest execution.
import unittest
from keboola.datadirtest import DataDirTester, TestDataDir
class CustomDatadirTest(TestDataDir):
def setUp(self):
sql_client = self.context_parameters['sql_client']
sql_client.run_query('DROP TABLE IF EXISTS T;')
sql_client.run_query('CREATE TABLE T AS SELECT 1 AS COLUMN;')
class TestComponent(unittest.TestCase):
def test_functional(self):
sql_client = SqlClient("username", "password", "localhost")
functional_tests = DataDirTester(test_data_dir_class=CustomDatadirTest,
context_parameters={'sql_client': sql_client})
functional_tests.run()
if __name__ == "__main__":
unittest.main()You may specify custom scripts that are executed before or after the test execution. Place them into the source folder:
├─source
│ └─data
│ ├─ config.json
│ ├─ set_up.py
│ ├─ post_run.py
│ ├─ tear_down.py
│ └─in
│ ├─files
│ ├─tables
Usage
Each script (set_up.py, post_run.py and tear_down.py) must implement a run(context: TestDataDir) method. The context parameter then includes the parent
TestDataDir instance with access to context_parameters if needed. Both script files are optional. If file is found but there is no run() method defined,
the execution fails.
The set_up.py and tear_down.py are executed before and after the DataDirTest itself. The post_run.py is useful to run a script right after the component script, before the resulting data is modified. See the diagram below:
For instance, the set_up.py may contain following code:
from keboola.datadirtest import TestDataDir
def run(context: TestDataDir):
# get value from the context parameters injected via DataDirTester constructor
sql_client = context.context_parameters['sql_client']
sql_client.run_query('DROP TABLE IF EXISTS T;')
sql_client.run_query('CREATE TABLE T AS SELECT 1 AS COLUMN;')
print("Running before script")It will run the above script specific for the current test (folder) before the actual test execution
Chained tests are useful in scenarios when it is necessary to run several tests in a sequence and pass the component state between them.
You can chain tests by including them in additional folder that contains normal tests.
The folder can also contain the set_up and tear_down scripts that will be executed before and after the group of chained tests.
They also share all the other parameters like context.
e.g.:
/path/to/project/tests
└─functional
├─my-chained-test
│ ├─ set_up.py
│ ├─ tear_down.py
│ ├─01-first-test
│ │ ├─expected
│ │ │ └─data
│ │ ├─source
│ │ │ └─data
│ └─02-second-test
│ ├─expected
│ └─source
└─another-normal-test
The tests are executed in alphabetical order so it's recommended to prefix your tests with numbers, e.g. 01_,02_, ..
The execution flow of the example my-chained-test looks like this:
- run root
set_up.py - run
01-first-test - pass
01-first-testresultout/state.jsonto02-second-test/source/data/in/state.json- If
02-second-test/source/data/in/state.jsonalredy exists it is overridden!
- If
- run
02-second-test - run
tear_down_py
Note that the tests in the functional folder are executed in random order.
The VCR module enables recording and replaying HTTP interactions, making tests:
- Deterministic - Same responses every time
- Fast - No real HTTP calls during replay
- CI-friendly - No credentials needed in CI/CD pipelines
# Install with VCR support
pip install keboola-datadirtest[vcr]
# Install with pytest support
pip install keboola-datadirtest[pytest]
# Install all features
pip install keboola-datadirtest[all]python -m datadirtest tests/functional src/component.py --recordThis creates cassette files in source/data/cassettes/requests.json for each test.
python -m datadirtest tests/functional src/component.pyBy default, VCR automatically replays if cassettes exist.
tests/functional/
└── test_api_extraction/
├── source/
│ └── data/
│ ├── config.json # Config with placeholders
│ ├── config.secrets.json # Real credentials (gitignored)
│ ├── cassettes/
│ │ └── requests.json # Recorded HTTP interactions
│ └── sanitizers.py # Optional custom sanitizers
└── expected/
└── data/
└── out/
└── tables/
└── main.csv
Create config.secrets.json (add to .gitignore) with real credentials:
{
"token": "real_api_key_here",
"password": "secret_password"
}In config.json, use placeholders:
{
"parameters": {
"#token": "{{secret.token}}",
"#password": "{{secret.password}}"
}
}During recording, secrets are automatically merged and sanitized from cassettes.
# Run with auto mode (replay if cassettes exist)
python -m datadirtest tests/functional src/component.py
# Record new cassettes
python -m datadirtest tests/functional src/component.py --record
# Force replay (fail if no cassettes)
python -m datadirtest tests/functional src/component.py --replay
# Update existing cassettes
python -m datadirtest tests/functional src/component.py --update-cassettes
# Run without VCR (original behavior)
python -m datadirtest tests/functional src/component.py --no-vcr
# Run specific tests
python -m datadirtest tests/functional src/component.py --tests test_basic,test_advanced
# Verbose output with full diffs
python -m datadirtest tests/functional src/component.py --verbose
# Custom freeze time
python -m datadirtest tests/functional src/component.py --freeze-time 2024-06-15T10:00:00from keboola.datadirtest import VCRDataDirTester
# Run with automatic VCR
tester = VCRDataDirTester(
data_dir='tests/functional',
component_script='src/component.py',
vcr_mode='auto' # 'record', 'replay', or 'auto'
)
tester.run()Create source/data/sanitizers.py to customize how sensitive data is redacted:
from keboola.datadirtest.vcr import BaseSanitizer
class AccountIdSanitizer(BaseSanitizer):
def before_record_request(self, request):
# Replace account IDs in URLs
request.uri = request.uri.replace("act_123456", "act_REDACTED")
return request
def get_sanitizers(config: dict) -> list:
"""Return sanitizers for this test."""
return [AccountIdSanitizer()]Generate test folders from a definitions file:
# Create test structure and record cassettes
python -m datadirtest scaffold definitions.json tests/functional src/component.py
# Create structure only (no recording)
python -m datadirtest scaffold definitions.json tests/functional --no-recordExample definitions.json:
[
{
"name": "test_basic_extraction",
"config": {
"parameters": {"endpoint": "/api/data"}
},
"secrets": {"token": "real_api_key"},
"description": "Basic extraction test"
}
]Add to your conftest.py:
pytest_plugins = ["keboola.datadirtest.vcr.pytest_plugin"]Run with pytest:
# Auto mode
pytest tests/
# Record mode
pytest tests/ --vcr-record
# Replay mode (strict)
pytest tests/ --vcr-replay- Always gitignore secrets: Add
config.secrets.jsonto.gitignore - Commit cassettes: Cassettes should be committed to version control
- Re-record when API changes: Use
--update-cassetteswhen APIs change - Use meaningful test names: Helps identify which cassette belongs to which test
- Review cassettes: Check that no sensitive data leaked into recordings
