Skip to content

Latest commit

 

History

History
449 lines (355 loc) · 21.8 KB

DEVELOPER.md

File metadata and controls

449 lines (355 loc) · 21.8 KB

Developer Guide

Building Locally

Prerequisites:

  • Docker or Podman (note: unit tests run with Podman by default)
    • If using docker - make sure it's usable without sudo (guidelines)
    • If using podman - make sure it's setup to run rootless containers (guidelines)
  • Rust toolset
    • Install rustup
    • The correct toolchain version will be automatically installed based on the rust-toolchain file in the repository
  • Tools used by tests
    • Install jq - used to query and format JSON files
    • Install kubo (formerly known as go-ipfs) - for IPFS-related tests
  • Code generation tools (optional - needed if you will be updating schemas)
  • Cargo toolbelt
    • Prerequisites:
      • cargo install cargo-update - to easily keep your tools up-to-date
      • cargo install cargo-binstall - to install binaries without compiling
      • cargo binstall cargo-binstall --force -y - make future updates of binstall to use a precompiled version
    • Recommended:
      • cargo binstall cargo-nextest -y - advanced test runner
      • cargo binstall bunyan -y - for pretty-printing the JSON logs
      • cargo binstall cargo-llvm-cov -y - for test coverage
    • Optional - if you will be doing releases:
      • cargo binstall cargo-edit -y - for setting crate versions during release
      • cargo binstall cargo-update -y - for keeping up with major dependency updates
      • cargo binstall cargo-deny -y - for linting dependencies
      • cargo binstall cargo-udeps -y - for linting dependencies (detecting unused)
    • To keep all these cargo tools up-to-date use cargo install-update -a
  • Database tools (optional, unless modifying repositories is necessary):
    • Install Postgres command line client psql:
      • deb: sudo apt install -y postgresql-client
      • rpm: sudo dnf install -y postgresql
    • Install MariaDB command line client mariadb:
      • deb: sudo apt install -y mariadb-client
      • rpm: sudo dnf install -y mariadb
    • Install sqlx-cli: cargo binstall sqlx-cli -y

Clone the repository:

git clone git@github.com:kamu-data/kamu-cli.git

Build the project:

cd kamu-cli
cargo build

To use your locally-built kamu executable link it as so:

ln -s $PWD/target/debug/kamu-cli ~/.local/bin/kamu

When needing to test against a specific official release, you can install it under a different alias:

curl -s "https://get.kamu.dev" | KAMU_ALIAS=kamu-release sh

New to Rust? Check out these IDE configuration tips.

Configure Podman as Default Runtime (Recommended)

Set podman as preferred runtime for your user:

cargo run -- config set --user engine.runtime podman

When you run tests or use kamu anywhere in your user directory you will now use podman runtime.

If you need to run some tests under Docker use:

KAMU_CONTAINER_RUNTIME_TYPE=docker cargo test <some_test>

Build with Databases

By default, we define SQLX_OFFLINE=true environment variable to ensure the compilation succeeds without access to a live database. The default mode is fine in most of the cases, assuming the developer's assignment is not related to databases/repositories directly.

When databases have to be touched, the setup of local database containers must be configured using the following script:

make sqlx-local-setup

This mode:

  • creates Docker containers with empty databases
  • applies all database migrations from scratch
  • generates .env files in specific crates to point to databases running in Docker containers by setting DATABASE_URL variables as well as to disable SQLX_OFFLINE variable in those crates

This setup ensures any SQL queries are automatically checked against live database schema at compile-time. This is highly useful when queries have to be written or modified.

After the database-specific assignment is over, it makes sense to re-enable default mode by running another script:

make sqlx-prepare
make sqlx-local-clean

The first step, make sqlx-prepare, analyzes SQL queries in the code and generates the latest up-to-date data for offline checking of queries (.sqlx directories). It is necessary to commit them into the version control to share the latest updates with other developers, as well as to pass through GitHub pipeline actions.

Note that running make lint will detect if re-generation is necessary before pushing changes. Otherwise, GitHub CI flows will likely fail to build the project due to database schema differences.

The second step, make sqlx-local-clean would reverse make sqlx-local-setup by:

  • stopping and removing Docker containers with the databases
  • removing .env files in database-specific crates, which re-enables SQLX_OFFLINE=true for the entire repository.

Database migrations

Any change to the database structure requires writing SQL migration scripts. The scripts are stored in ./migrations/<db-engine>/ folders, and they are unique per database type. The migration commands should be launched within database-specific crate folders, such as ./src/database/sqlx-postgres. Alternatively, you will need to define DATABASE_URL variable manually.

Typical commands to work with migrations include:

  • sqlx migrate add --source <migrations_dir_path> <descriptoin> to add a new migration
  • sqlx migrate run --source <migrations_dir_path> to apply migrations to the database
  • sqlx migrate info --source <migrations_dir_path> to print information about currently applied migration within the database

Run Linters

Use the following command:

make lint

This will do a number of highly useful checks:

  • Rust formatting check
  • License headers check
  • Dependecies check: detecting issues with existing dependencies, detecting unused dependencies
  • Rust coding practices checks (Clippy)
  • SQLX offline data check (sqlx data for offline compliation must be up-to-date with the database schema)

Run Tests

Before you run tests for the first time, you need to run:

make test-setup

This will download all necessary images for containerized tests.

You can run all tests except database-specific as:

make test

In most cases, you can skip tests involving very heavy Spark and Flink engines and databases by running:

make test-fast

If testing with databases is required (including E2E tests), use:

make sqlx-local-setup # Start database-related containers 

make test-full # or `make test-e2e` for E2E only

make sqlx-local-clean

These are just wrappers on top of Nextest that control test concurrency and retries.

To run tests for a specific crate, e.g. opendatafabric use:

cargo nextest run -p opendatafabric

Build Speed Tweaks (Optional)

Building

Given the native nature of Rust, we often have to rebuild very similar source code revisions (e.g. switching between git branches).

This is where sccache can help us save the compilation cache and our time (dramatically). After installing in a way that is convenient for you, configure as follows ($CARGO_HOME/config.toml):

[build]
rustc-wrapper = "/path/to/sccache"

Alternatively you can use the environment variable RUSTC_WRAPPER:

export RUSTC_WRAPPER=/path/to/sccache # for your convenience, save it to your $SHELL configuration file (e.g. `.bashrc`, `.zshrc, etc)
cargo build

Linking

Consider configuring Rust to use lld linker, which is much faster than the default ld (may improve link times by ~10-20x).

To do so install lld, then update $CARGO_HOME/config.toml file with the following contents:

[build]
rustflags = ["-C", "link-arg=-fuse-ld=lld"]

One more alternative is to use mold linker, which is also much faster than the default ld.

To do so install mold or build it with clang++ compiler from mold sources then update $CARGO_HOME/config.toml file with the following contents:

[build]
linker = "clang"
rustflags = ["-C", "link-arg=-fuse-ld=mold"]

Building with Web UI (Optional)

To build the tool with embedded Web UI you will need to clone and build kamu-web-ui repo or use pre-built release. Now build the tool while enabling the optional feature and passing the location of the web root directory:

KAMU_WEB_UI_DIR=../../../../kamu-web-ui/dist/kamu-platform/ cargo build --features kamu-cli/web-ui

Note: we assume that kamu-web-ui repository directory will be at the same level as kamu-cli, for example:

.
└── kamu-data
    ├── kamu-cli
    └── kamu-web-ui

Note: in debug mode, the directory content is not actually being embedded into the executable but accessed from the specified directory.

Code Generation

Many core types in kamu are generated from schemas and IDLs in the open-data-fabric repository. If your work involves making changes to those - you will need to re-run the code generation tasks using:

make codegen

Make sure you have all related dependencies installed (see above) and that ODF repo is checked out in the same directory as kamu-cli repo.

Code Structure

This repository is built around our interpretation of Onion / Hexagonal / Clean Architecture patterns [1] [2].

In the /src directory you will find:

  • domain
    • Crates here contain implementation-agnostic domain model entities and interfaces for service and repositories
    • Crate directories are named after domain they represent, e.g. task-system while crate names will typically have kamu-<domain> prefix
  • adapter
    • Crates here expose domain data and operations under different protocols
    • Crate directories are named after the protocol they are using, e.g. graphql while crate names will typically have kamu-adapter-<protocol> prefix
    • Adapters only operate on entities and interfaces defined in domain layer, independent of specific implementations
  • infra
    • Crates here contain specific implementations of services and repositories (e.g. repository that stores data in S3)
    • Crate directories are named as <domain>-<technology>, e.g. object-repository-s3 while crate names will typically have kamu-<domain>-<technology> prefix
    • Infrastructure layer only operates on entities and interfaces defined in domain layer
  • app
    • Crates here combine all layers above into functional applications

Dependency Injection

This architecture relies heavily on separation of interfaces from implementations and dependency inversion principle, so we are using a homegrown dependency injection library dill to simplify gluing these pieces together.

Async

The system is built to be highly-concurrent and, for better or worse, the explicit async/await style concurrency is most prevalent in Rust libraries now. Therefore:

  • Our domain interfaces (traits) use async fn for any non-trivial functions to allow concurrent implementations
  • Our domain traits are all Send + Sync
    • for implementations to be used as Arc<dyn Svc>
    • to have implementations use interior mutability

Error Handling

Our error handling approach is still evolving, but here are some basic design rules we settled on:

  • We don't return Box<dyn Error> or any fancier alternatives (like anyhow or error_stack) - we want users to be able to handle our errors precisely
  • We don't put all errors into a giant enum - this is as hard for users to handle as Box<dyn Error>
  • We are explicit about what can go wrong in every function - i.e. we define error types per function
  • Errors in domain interfaces typically carry Internal(_) enum variant for propagating errors that are not part of the normal flow
  • We never want ? operator to implicitly convert something into an InternalError - a decision that some error is not expected should be explicit
  • We want Backtraces everywhere, as close to the source as possible

With these ideas in mind:

  • We heavily use thiserror library to define errors per function and generate error type conversions
  • We use our own internal-error crate to concisely box unexpected errors into InternalError type

Test Groups

We use the homegrown test-group crate to organize tests in groups. The complete set of groups is:

  • containerized - for tests that spawn Docker/Podman containers
  • engine - for tests that involve any data engine or data framework (query, ingest, or transform paths), subsequently grouped by:
    • datafusion - tests that use Apache DataFusion
    • spark - tests that use Apache Spark
    • flink - tests that use Apache Flink
  • database - for tests that involve any database interaction, subsequently grouped by:
    • mysql - tests that use MySQL/MariaDB
    • postgres - tests that use PostreSQL
  • ingest - tests that test data ingestion path
  • transform - tests that test data transformation path
  • query - tests that test data query path
  • flaky - special group for tests that sometimes fail and need to be retried (use very sparingly and create tickets)
  • setup - special group for tests that initialize the environment (e.g. pull container images) - this group is run by CI before executing the rest of the tests

Typical Workflows

Feature Branches

  1. Our policy is to have master branch always stable, ready to be released at any point in time, thus all changes are developed on feature branches and merged to master only when they pass all the checks
  2. Continuous upkeep of our repo is every developer's responsibility, so before starting a feature branch check if major dependency update is due and perform it on a separate branch
  3. Please follow this convention for branch names:
    1. bug/invalid-url-path-in-s3-store
    2. feature/recursive-pull-flag
    3. refactor/core-crate-reorg
    4. ci/linter-improvements
    5. docs/expand-developer-docs
    6. chore/bump-dependencies
  4. Include brief description of your changes under ## Unreleased section of the CHANGELOG.md in your PR
  5. (Recommended) Please configure git to sign your commits
  6. Branches should have coarse-grained commits with good descriptions - otherwise commits should be squashed
  7. Follow minor dependency update procedure - do it right before merging to avoid merge conflicts in Cargo.lock while you're maintaining your branch
  8. (Optional) We usually create a new release for every feature merged into master, so follow the release procedure
  9. Maintainers who merge branches should do so via git merge --ff-only and NOT rebasing to not lose commit signatures

Release Procedure

  1. Start by either creating a release branch or with an existing feature branch
  2. We try to stay up-to-date with all dependencies, so before every release we:
    1. Run cargo update to pull in any minor releases
    2. Run cargo upgrade --dry-run --incompatible and see which packages have major upgrades - either perform them or ticket them up
    3. Run cargo deny check to audit updated dependencies for licenses, security advisories etc.
  3. Bump the version using: make release-patch / make release-minor / make release-major
  4. Create a dated CHANGELOG entry for the new version
  5. Create PR, wait for tests, then merge as normal feature branch
  6. On master tag the latest commit with a new version: git tag vX.Y.Z.
  7. Push the tag to the repo: git push origin tag vX.Y.Z
  8. GitHub Actions will pick up the new tag and create a new GitHub release from it

Minor Dependencies Update

  1. Run cargo update to pull in any minor updates
  2. Run cargo deny check to audit new dependencies for duplicates and security advisories
  3. (Optional) Periodically run cargo clean to prevent your target dir from growing too big

Major Dependencies Update

  1. (Optional) Start by upgrading your local tools: cargo install-update -a
  2. Run cargo update to pull in any minor releases first
  3. Run cargo upgrade --dry-run and see which packages have major upgrades
  4. To perform major upgrades You can go crate by crate or all at once - it's up to you
  5. The tricky part is usually arrow and datafusion family of crates, to upgrade them you need to:
    1. See what is the latest version of datafusion
    2. Go to datafusion repo, switch to corresponding tag, and check its Cargo.toml to see which version of arrow it depends on
    3. Update to those major versions. For example datafusion v32 depends on arrow v47, so the command is:
      cargo upgrade -p arrow@47 -p arrow-digest@47 -p arrow-flight@47 -p datafusion@32
    4. Note that arrow-digest is our repo versioned in lockstep with arrow, so if the right version of it is missing - you should update it as well
  6. If some updates prove to be difficult - ticket them up and leave a # TODO: comment in Cargo.toml
  7. Run cargo update again to pull in any minor releases that were affected by your upgrades
  8. Run cargo deny check to audit new dependencies for duplicates and security advisories
  9. (Optional) Periodically run cargo clean to prevent your target dir from growing too big

Building Multi-platform Images

We release multi-platform images to provide our users with native performance without emulation. Most of our images are built automatically by CI pipelines, so you may not have to worry about building them. Some images, however, are still built manually.

To build multi-platform image on a local machine we use docker buildx. It has ability to create virtual builders that run QEMU for emulation.

This command is usually enough to get started:

docker buildx create --use --name multi-arch-builder

If in some situation you want to run an image from different architecture on Linux under emulation - use this command to bootstrap QEMU (source):

docker run --rm --privileged multiarch/qemu-user-static --reset -p yes

Tips

IDE Configuration

When using VSCode we recommend following extensions:

  • rust-analyzer - Rust language server
    • Setting up clippy:
      // settings.json
      {
          // other settings
          "rust-analyzer.check.overrideCommand": "cargo clippy --workspace --all-targets"
      }  
  • Error Lens - to display errors inline with code
  • Even Better TOML - for editing TOML files
  • crates - displays dependency version status in Cargo.toml
    • Note: It's better to use cargo upgrade --dry-run when upgrading to bump deps across entire workspace

Debugging

Logs

When running kamu it automatically logs to .kamu/run/kamu.log. Note that the run directory is cleaned up between every command.

You can control the log level using standard RUST_LOG environment variable, e.g.:

RUST_LOG=debug kamu ...
RUST_LOG="trace,mio::poll=info" kamu ...

The log file is in Bunyan format with one JSON object per line. It is intended to me machine-readable. When analyzing logs yourself you can pipe it through [bynyan] tool (see installation instructions above):

cat .kamu/run/kamu.log | bunyan

You can also run kamu with verbosity flags as kamu -vv ... for it to log straight to STDERR in a human-readable format.

Tracing

Using kamu --trace flag allows you to record the execution of the program and open Perfetto UI in a browser, allowing to easily analyze async code execution and task performance.

Note: If you are using Brave or a similar high-security browser and get an error from Perfetto when loading the trace - try disabling the security features to allow the UI app fetch data from http://localhost:9001.

Perfetto UI displaying a trace