Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oxidize Nodestream #288

Open
7 tasks
zprobst opened this issue Apr 10, 2024 · 0 comments
Open
7 tasks

Oxidize Nodestream #288

zprobst opened this issue Apr 10, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@zprobst
Copy link
Contributor

zprobst commented Apr 10, 2024

Justification

For nodestream, Rust has several advantages that other languages do not have a perfect combination of.

  1. Preposterously Fast: Rust's syntax and semantics provides LLVM an enormous set of opportunities to optimize the crap out of the rust code with its notion of Zero Cost Abstractions. Speed is obviously of paramount importance on ingest.
  2. Strict Typing: As nodestream's code has grown, we have encountered scalability issues in terms of type-correctness in the code. While that can be solved in python with things like mypy, we get it out of the box in rust. Rust also forces you to think about typing which is going to be a critical aspect of redesigning the types as we port them.

With these two advantages, Rust can do all the heavy lifting. That means that the code that users write (still in python) is really the only code that actually is python. Writing this code in python is still important as it is central to nodestreams value proposition of allowing for any use case and allowing people to get up in running quickly. However, everything that the user doesn't write can be executed very fast across the full breadth of the machine's resources maximizing efficiency.

Additionally, using rust as a platform, we can develop changes and new features that increase nodestream's reliability.

Technical Dependencies

By leveraging core creates in the rust ecosystem, we can improve nodestream.

pyo3

Py03 provides rust bindings for the Python interpreter. It allows for control flow both from and to Python. In other words, the entry point for the application can be rust , which calls python. Or python, that calls rust. In most of the packages outlined earlier, this is what has happened. In the short term, nodestream could do the same. In the medium term rust can be the entry point. By doing so, this brings the advantages of rusts far more advanced and fast async ecosystem with creates like tokio. This allows us to be free from the constraints of the GIL allowing for true parallelism, reduced memory usage, and faster builtin components.

maturin

From the README

Build and publish crates with pyo3, rust-cpython, cffi and uniffi bindings as well as rust binaries as python packages with minimal configuration. It supports building wheels for python 3.8+ on windows, linux, mac and freebsd, can upload them to pypi and has basic pypy and graalpy support.

This will take over build-system duties from poetry (and possibly completely obsolete poetry) fro the way that we build and publish nodestream to pypi.

Project Approach

nodestream has a lot of different modules and plugins. This document issue only accounts for the core library nodestream. Eventually, a similar approach may be done for some of the other plugins as well.

Some of the sub-packages of nodestream are more coupled to other parts of the code than others. Ideally we minimize the amount of times we incur FFI overhead when converting from rust to python and back and balance that with a logical sequencing and delivering more benefit to users in the mean time. Based on some research of the code we have in nodestream today, I've identified a roughly outside-in approach that should work to make the porting as easy as possible.

There are three phases. Each phase would likely need to correspond with a new 0.x release.

Phase 1

Phase one is largely about laying the foundation. It composes porting two sections of the code to rust objects.

  • The nodestream.model package that contains core data types such as Node, Relationship, and others that are used a lot at runtime.
  • The nodestream.schema package that contains other primitives such as Migration Operations.

These sections of the code are relatively decoupled from the rest of the code and make few "outbound" calls from their code to some other part of nodestream. This makes them easiest to port and a good "testing ground" for getting things going with rust in nodestream.

Obviously this phase would also include creating the initial wiring of py03 and changing the packaging to include the rust libraries. For this phase, the packaging is the likely trigger for a breaking release as will require different build system packages with different system requirements. The actual porting of the code in the nodestream.model and nodestream.schema packages likely can be done with no breaking changes with the bound types in python being identical to the current ones in API.

Phase 2

Phase two is about reversing the directionality of the FFI. As mentioned before, if we were to change rust to call python instead of the other way around, it would allow us to take better advantage of Rusts benefits. This release is intended to focus on porting:

  • The nodestream.project package which contains some logic as well as the definition of how to parse and use the nodestream.yaml file.
  • The nodestream.cli package which contains the CLI bindings that generally call methods on the project interface.

This phase will probably introduce breaking changes in both nodestream.yaml as well as the interface to the cli as we port from one model to the other. This may also be an ideal time to perform a redesign of nodestream.yaml that allows for #240

Phase 3

Phase 3 is about taking truly taking advantage of Rust. The remaining major parts of the code that need to be ported are:

  • The nodestream.interpreting package which defines how data gets mapped from records in the pipeline to nodes and relationships.
  • The nodestream.database package which contains a lot of interface definitions for interacting with the database as well as common behavior that it shared amongst all database interactions.
  • The nodestream.pipeline package which contains all of the runtime aspects of executing the pipeline including builtin extractors, filters, transformers and the pipeline itself.

These three packages are relatively intertwined with one another. That is a big part of why they need to be handled together. If we did not, calls would be constantly pinging back and forth from python to rust far more than is ideal. Unfortunately, this is where the vast majority of the benefit of moving to rust would be realized. By doing this phase, we can leverage true parallelism as well as build correctness validations that we simply cannot in python without incurring huge runtime performance costs.

This work effectively forces us to do #187

@zprobst zprobst added the enhancement New feature or request label Apr 10, 2024
@zprobst zprobst self-assigned this Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant