Skip to content
This repository has been archived by the owner on Aug 25, 2024. It is now read-only.

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

docs: dataflows: Improve docs #1279

Closed
johnandersen777 opened this issue Dec 10, 2021 · 26 comments
Closed

docs: dataflows: Improve docs #1279

johnandersen777 opened this issue Dec 10, 2021 · 26 comments
Assignees
Labels
documentation Changes to documentation

Comments

@johnandersen777
Copy link

johnandersen777 commented Dec 10, 2021

These are notes and scratch work around the purpose and future of the project.

Mission: Provide a clear, meticulously validated, ubiquitously adopted reference architecture for an egalitarian Artificial General Intelligence (AGI) which respects the first law of robotics.

To do so we must enable the AGI with the ability to act in response to the current system context where it understands how to predict possible future system contexts and understands which future system contexts it wishes to pursue are acceptable according to guiding strategic plans (such as do no harm). We must also ensure that human and machine can interact via a shared language, the universal blueprint.

AI has the potential to do many great things. However, it also has the potential to to terrible things too. Recently there was an example of scientists who used a model that was good a generating life saving drugs, in reverse, to generate deadly poisons. GPU manufacturers recently implemented anti-crypto mining features. Since the ubiquitous unit of parallel compute is a GPU, this stops people from buying up GPUs for what we as a community at large have deemed undesirable behavior (hogging all the GPUs). There is nothing stopping those people from buying for building their own ASICs to mine crypto. However, the market for that is a subset of the larger GPU market. Cost per unit goes up, multi-use capabilities go down. GPU manufacturers are effectively able to ensure that the greater good is looked after because GPUs are the ubiquitous facilitator of parallel compute. If we prove out an architecture for an AGI that is robust, easy to adopt, and integrates with the existing open source ecosystem, we can bake in this looking after the greater good.

As we democratize AI, we must be careful not to democratize AI that will do harm. We must think secure by default in terms of architecture which has facilities for guard rails, baking safety into AI.

Failure to achieve ubiquitous adoption of an open architecture with meticulously audited safety controls will result in further consolidation of wealth and widening inequality.

@johnandersen777 johnandersen777 added the documentation Changes to documentation label Dec 10, 2021
@johnandersen777 johnandersen777 self-assigned this Dec 10, 2021
@johnandersen777
Copy link
Author

By convention operations which have a single output we usually name that output result

@johnandersen777
Copy link
Author

johnandersen777 commented Dec 10, 2021

  • DataFlows are all effectively typed streams
  • Complete state transition of system

@johnandersen777
Copy link
Author

johnandersen777 commented Dec 14, 2021

  • Manifest is a domain specific way to describe system state
  • Manifests to describe pipelines, BOMs, tests
    • We need this so that we can do our polyrepo setup
  • Manifest contains enough information to reproduce the run
    • Decouples intent from implementation
      • Reduces lock in
      • In absence of a tool (EOL) one could implement the operations to reproduce the execution of the manifest
  • Human and machine editable text files
  • Will be checked in to version control
    • Added to PR comments to allow for version overrides

@johnandersen777
Copy link
Author

  • You could have a manifests for anything
    • Any CLI tool could have a manifest which could be converted into it’s CLI args
      • See subprocess orchestrator branch
  • Convert from manifest into data flow description
    • Ideally well defined machine readable (auditable) and writable
    • consoletest even plays with documentation as manifest / dataflow
  • Manifests are a problem space specific way of defining a dataflow
    • We convert to data flow so people have common way to understand implementations for different problem spaces

@johnandersen777
Copy link
Author

johnandersen777 commented Dec 14, 2021

  • Data Flow is generic representation of program flow
    • Implement executors or synthesizers based off dataflow (Orchestrators)
      • Instead of directly from manifest, can be reused across manifest formats
      • Allows us to template all of our plugins in our polyrepo setup
    • Data Flow maps well to concepts like a Jenkins’s Pipeline, GitHub Actions Workflow, etc.
      • Operations within flows map to Jenkins Step, GitHub Actions Action, etc.
  • Security
    • Allows for automated auditing

@johnandersen777
Copy link
Author

  • Orchestrator allows for
    • Switching execution method easily
      • Run on my local machine to test
      • Run in kubernetes cluster
      • Run in Intel DevCloud for access to machines with ML hardware
    • Local development
      • Rapid iteration on CI jobs
      • No need to push repo to validate CI is working
      • Finally we can support running all tests locally
  • Operation abstraction layer allows for
    • Overrides
      • Use implementation X when running in k8s, do Y when running in DevCloud, do Z when running locally
    • Overlays
    • Add extend flows when in different environments or for different purposes or different deployment models
    • https://intel.github.io/dffml/examples/dataflows.html

@johnandersen777
Copy link
Author

johnandersen777 commented Dec 14, 2021

  • Zero-th phase parser
    • Every format has a parser
    • Formats with multiple versions need parser per version
      • You might maintain the version 1 parser in the 1.x branch, version 2 in the 2.x branch, etc.
  • Write "next phase" parsers specific to task and format name and format version
    • Generate data flow description of format (in case of a manifest)
  • First line of decision making
    • shim layer acts as reverse proxy
    • validates, parses, directs to appropriate next phase
    • next phase only ever needs to worry about its job
      • Verification that we should be executing is done (security)
      • Validation that format conforms to schema is done (complete and correct)

@johnandersen777
Copy link
Author

johnandersen777 commented Dec 14, 2021

  • What does this allow us to do?
    • downstream validation of all DFFML plugins
    • throw all data from every execution into a data lake

@johnandersen777
Copy link
Author

Generic flow (data,work,program) executor

@johnandersen777
Copy link
Author

johnandersen777 commented Jan 5, 2022

Why: unikernels
Can build smallest possibls attack surface
Could even build scilicon / RTL to optimize for sepfic data flow

@johnandersen777
Copy link
Author

  • Show how we convert from abitrary manifests into dataflow format
    • Will likely need to flush out config / object loading.
    • Probably need to support full object paths and then validate that they are BaseConfigurables
  • Need to support secret "unlock" stuff, this is probably similar to dataflow for config
    • This also amounts to: my input goes through this flow to become a different value

@johnandersen777
Copy link
Author

johnandersen777 commented Jan 14, 2022

Manifest Schema

Manifests allow us to focus less on code and more on data.
By focusing on the data going into and out of systems. We can achieve standard
documentation of processes via a standard interface (manifests).

Our manifests can be thought of as ways to provide a config class with it's
parameters or ways to provide an operation with it's inputs.

References:

Validating

Install jsonschema, and pyyaml python modules

pip install pyyaml jsonschema

This is how you convert from yaml to json

$ python -c "import sys, pathlib, json, yaml; pathlib.Path(sys.argv[-1]).write_text(json.dumps(yaml.safe_load(pathlib.Path(sys.argv[-2]).read_text()), indent=4) + '\n')" manifest.yaml manifest.json

Example below validates, checking status code we see exit code 0 which means
success, the document conforms to the schema.

$ jsonschema --instance manifest.json manifest-format-name.0.0.2.schema.json
$ echo $?
0

Writing

Suggested process (in flux)

  • Make sure you can run the jsonschema validator

    • TODO Validation micro service
  • Look at existing problem space

    • What data is needed? This likely will becomes the inputs of a dataflow,
      or an operation, or config.

    • Write first draft of what a valid manifest would be

  • Write schema based off initial manifest

    • Do not include fields for future use. Only include what you currently intend
      to use for each version

      • Instead, create a new format name and new schema. If we stick to the rule
        of if you have the data you have to act on it, there is never any if A then
        B situations. If you want a different outcome, you create different manifest.
        This helps keep architectures loosely coupled
        https://medium.com/@marciosete/loosely-coupled-architecture-6a2b06082316

      • We also decided that we could potentially combine manifests. This allows for
        you to use the data you wanted, but just keep it sperate and make the decision
        to combine the equivalent of adding variables purely as conditional on use of data.
        This way if the data is present, it is always used!

      • By ensuring that data present is always used, we can begin to map manifests to
        dataflows, in this way, we can check the validity of a dataflow simply by ensuring
        all manifest data is used as an input or config.

        • As such, a passing validity check ensures we have a complete description of a
          problem. We know all the inputs and system constraints (manifests), and we are
          sure that they will be taken into account on execution (dataflow run).
    • Each field with a type MUST have a description

  • Write ADR describing context around creation and usage of manifest

    • The ADR should describe how the author intends the manifest to be used

    • Treat the ADR + manifest like a contract. If something
      accepts the manifest (valid format and version, see shim)
      it is obligated to fulfil the intent of the ADR. The consumer
      MUST return an error response when given a manifest if it
      cannot use each piece of data in the manifest as directed by
      the ADR and descriptions of fields within the manifest schema.

    • The Intent section of the ADR should describe how you want manifest
      consumers to use each field.

ADR Template

my-format-name
##############

Version: 0.0.1
Date: 2022-01-22

Status
******

Proposed|Evolving|Final

Description
***********

ADR for a declaration of assets (manifest) involved in the process
of greeting an entity.

Context
*******

- We need a way to describe the data involved in a greeting

Intent
******

- Ensure valid communication path to ``entity``

- Send ``entity`` message containing ``greeting``

@johnandersen777
Copy link
Author

johnandersen777 commented Jan 21, 2022

State transition, issue filing, estimating time to close issue, all have to do with having the complete mapping of inputs to problem (data flow). If we have an accurate mapping then we have a valid flow, we can create an estimate that we understand how we created the estimate because we have a complete description of the problem. See also: estimation of GSoC project time, estimation of time to complete best practices badging program activities, time to complete any issue, helps with prioritization of who in an org should work on what, when, to unblock others in the org. Related to builtree discussion.

@johnandersen777
Copy link
Author

johnandersen777 commented Jan 28, 2022

We use dataflows because they are a declarative approach which allows you to define different implementations based on different execution environments, or even swap out pieces of a flow or do overlays to add new pieces.

They help solve the fork and pull from upstream issue. When you fork code and change it, you need to pull in changes from the upstream (the place you forked it from). This is difficult to manage with the changes you have already made, using a dataflow makes this easy, as we focus on how the pieces of data should connect, rather than implementations of their connections.

This declarative approach is important because the source of inputs change depending on your environment. For example, in CI you might grab from an environment variable populated from secrets. In your local setup, you might grab from the keyring

@johnandersen777
Copy link
Author

Notes from work in progress tutorial:

We need to come up with serveral metrics to track and plot throughout.
We also need to plot in relation to other metrics for tradeoff analysis.

We could also make this like a choose your own adventure style tutorial,
if you want to do it with threads, here's your output metrics. We can
later show that we're getting these metrics by putting all the steps
into a dataflow and getting the metrics out by running them. We could then
show how we can ask the orchestrator to optimize for speed, memory, etc.
Then add in how you can have the orchestrator take those optimization
constriants from dynamic conditions such as how much memory is on the
machine you are running on, or do you have access to a k8s cluster. Also
talked about power consumption vs. speed trade off for server vs. desktop.
Could add in edge constraints like network latency.

Will need to add in metrics API and use in various places in
orchestrators and expose to operations to report out. This will be the
same APIs we'll use for stub operations to estimate time to completion,
etc.

  • Make sure to measure speed and memory useage with ProcessPoolExecutor
    ThreadPoolExecutor. Make sure we take into accout memory from all
    processes.

  • Start to finish speed

    • Plot with number of requests made
  • Memory consumed

    • Plot with number of requests made

This could be done as an IPython notebook.

  • Show basic downloader code

    • Observe speed bottleneck due to download in series
  • Parallelize download code

    • Observe increase in speed

    • Observe error handling issues

  • Add in need to call out via subprocess

    • Observe subprocess issues
  • Move to event loop

    • Observe increase in speed (? Not sure on this yet)

    • Observe successful error handling

    • Observe need to track fine grained details

  • Move to event based implemention with director (orchestrator, this file
    minus prev pointers in Base Event)

    • Observe visablity into each event state of each request

    • Observe lack of visablity into chain of events

  • Add prev pointers

    • Open Liniage
  • Move to data flow based implemention

  • Demo full DFFML data flow using execution on k8s

    • Use k8s playground as target environment

@johnandersen777
Copy link
Author

InputNetwork, any UI is just a query off of the network for data linkages. Any action is just a retrigger of a flow. On flow execution end combine caching with central database so that alternate output querys can be run later. Enabling data lake.

@johnandersen777
Copy link
Author

Classes become systems of events (dataflows) where the interface they fit into is defined by contracts (manifests)

@johnandersen777
Copy link
Author

To implement and interface one but satisfy system usage contraints. I.e. must be ready to accept certain events (manifest) and fulfill contract. Might also need to give certain events (inputas manifest)

@johnandersen777
Copy link
Author

  • Run dataflow, collect usage statistics when running locally or k8s for CPU, memory, etc. Build model to predict how much CPU or memory is needed, check if cluster has enough before warn if orchestrator predicts using built model that number of context executing will exceed resource constraints based on historical estimated usage.
  • How would we write a decorator to cache operations which do API calls which are ratelimited?

@johnandersen777
Copy link
Author

Run whatever you want, wherever you want, however you want, with whatever you want, for whoever you want.

@johnandersen777
Copy link
Author

Hitting Critical Velocity. The fully connected dev model.

@johnandersen777
Copy link
Author

johnandersen777 commented Apr 20, 2022

City planning as dataflows plus CI

Imagine you're playing a city simulator. Each building has an architecture and purpose within the architecture of your overall city. Imagine that there are at certain guiding overall strategies which the entities within the city understand must be taken into account to perform any actions they're directed to do. For example, one strategic goal or piece of a strategic plan might be that the city should always collect garbage and there should never be a day where garbage is not collected from more than 75% of the residents. The garbage crews as agents need to know that their course of action in terms of actions they should take or next steps sent by the city should have been vetted by the strategic plan which involves the assurance of residents garbage being picked up at the expected percentage. Entities also make decisions based on data used to train their models in an active learning situation. Data used to train agent action / strategic plans should come only from flows validated by a set of strategic plans, or strategic plans with certain credentials (verified to ensure kill no humans is applicable for this subset of data). This will allow us to add controls on training models, to ensure that their data does not come from sources which would cause malicious behavior, or behavior unaligned with other any active strategic plans. We must also be able to add new strategic plans on the fly and modify the top level strategic decision maker. This example maps too the provenance information we will be collecting about the plans given to agents or inputs given to opimps. This provenance information must include in attestation or valid claim that that certain sets of strategic plans were taken into consideration by the top level strategic decision maker when the orders come down to that agent or opimp.


Optimize for efficiency is post captializm society
Map people and what makes them happy and feel good health wise, things they jive with conceptually. This is like how to find the optimal agent to run the job to execute any active strategic plans (model optimization targets), because certain agents or opimps have attributes like attestation abilities which is why we might pick them if one of our active strategic plans is to optimize for hardening for threats within a treat model (effectively alternate mitigations, which show relationship between intent to secure assets or maintain security properties of the system)
Using something is a metric that might be accounted for by the strategic optimization models. One hour, one cycle, one instance
How to you know what the modifications to the strategy (system context) should be? Predict the outputs plus structured logged metrics based on models of historical data. Run optimization collector flows across all the predicted flow + system context purmutations. Use automl feature engineering to generate new possibilities for system context (alternative input values). Create alternate flows using threat model alternative mitigation implementor which understands optimizing for strategic security goals with regards to assest protection by understanding intent via intuitive shared human machine language : dataflows
The models we build on top of the data from the optimization collector flows are the strategic plans. These models effectively are encoder/decoder language translation models with high accuracy as assessed by an arbitrary accuracy scorer (we may need a way to assess the aggregate accuracy as the percentage of good thoughts). Their individual scores tell us within their scorers description of meaning (human scorer in case of allowlist forms). For example with optimizing for security one could output (or raise exception) a value that says this is an absolute veto power moment from that strategy, signifying we should not consider the result of the prediction acute (similar to allowlist set to conditional if crypto detected). There is a top level strategic model which makes the final decisions as to what system contexts will be explored (what thoughts are acted on and what are thought through further)
Its almost like we tie the agents in and we are thinking many thoguhts (dataflows) and some thoughts we want to act on and some we want to continue to think about to see how they play out with more throetical paths and maybe even variations on startegic plans used to create those system input contexts. We add back in real data to the training sets as we play our the real paths, strategic plans get weighted by the accuracy of their model by another encoder decoder running in the top level stategic model (this one is different for each person, effort, deployment, engagement)
Predict the future with me. Open source AI for a post capitalism society
You can satisfy the kill no humans thing by using DICE for device to device attestation where devices only execute orders that can be validated via provenance information metadata of system context inputs applicable to this agent/opimp tied back DID provenance chain to plan which had to he thought of by strategic top level plan which did run with attestation provenance model and accepted its veto with ultimate veto authority. Therefore we know that any plan that would bender mode kill all humans would have been stopped.
An architecture for generic artificial intelligence
We collectively can figure things out how to organize to achieve goals (business, climate change, etc.) using this architecture as a mechanism for communication

@johnandersen777
Copy link
Author

johnandersen777 commented Apr 20, 2022

Alice's Adventures in Wonderland

Blog series

Together we'll build Alice, an Artificial General Intelligence. We'll be successful when Alice successfully maintains a DFFML plugin as the only maintainer for a year. Debugging issues, writing fixes, reviewing code, accepting pull requests, refactoring the code base post PR merge, dealing with vulnerabilities, cutting releases, maintaining release branches, and completing development work in alignment with the plugin's universal blueprint. She will modify, submit pull requests to, and track upstreaming of patches to her dependencies to achieve the cleanest architecture possible. We'll interact with her as we would any other remote developer.

We'll need to build the foundations of Alice's thought processes. Throughout this series, we'll rely heavily on a mental model based on how humans think and problem solve. By the end of this series we'll have ensured Alice has all the primitive operations she requires to carry out the scientific process.

Terminology

  • Universal Blueprint
    • Standard architecture we use to describe anything. Provides the ability to use / reference domain specific architectures as needed to define architecture of whole.
  • Think
    • Come up with new data flows and system context input
  • Thoughts
    • Data Flows and system context input pairs (these two plus orchestration config we get the whole system context)

Expectations

Alice is going to be held to very high standards. We should expect this list to grow for a long time (years). This list of expectations may at times contain fragments which need to be worked out more and are only fragment so the ideas don't get forgotten.

  • Alice will maintain a system which allows her to respond to asynchronous messages
    • Likely a datastore with the ability to listen for changes
    • Changes would be additions of messages from different sources (email, chat, etc.)
  • Alice should be able to accept a meeting, join it, and talk to you
    • You should be able to have a conversation about a universal blueprint and she should be able to go act on it.

Alice's Understanding of Software Engineering

We'll teach Alice what she needs to know about software engineering though our InnerSource series. She'll follow the best practices outlined there. She'll understand a codebase's health in part using InnerSource metric collectors.

@johnandersen777
Copy link
Author

johnandersen777 commented Apr 20, 2022

What we end up with is a general purpose reinforcement learning architecture. This architecture can be feed any data and make sense of how the data relates to it's universal blueprint. The trained models and custom logic that form its understanding of how the data relates to it's universal blueprint are it's identity. As such, our entity named Alice will be trained on data making her an open source maintainer.

We'll show in a later series of blog posts how to create custom entities with custom universal blueprints (strategic goals, assets at their disposal, etc.). Entities have jobs, Alice's first job is to be a maintainer. Her job is reflected in her universal blueprint, which will contains all the dataflow, orchestration configs, dataflows used to collect data for and train models used in her strategic plans, as well as any static input data or other static system context.

We can save a "version" of Alice by leveraging caching. We specify dataflows used to train models which are then used in strategic plans. Perhaps there is something here with on dataflow instantiation query the inodes from shared config and sometimes a config will be defined by the running of a dataflow which will itself consume inputs or configs from other inodes within shared config. So on dataflow instantiation find leaf nodes in terms of purely static plugins to instantiate within shared configs region. This shared config linker needs to have access to the system context. For example if the flow is the top level flow triggered from the CLI then the system context should contain all the command line arguments somewhere within it's input network (after kick off or when looking at a cached copy from after kick off). Definition a plugin can be done by declaring it will be an instance where the config provided is from the output of a dataflow. That dataflow can be run as a subflow with a copy on write version of the parent system context (for accessing things like CLI flags given). There could be an operation which is runs an output operation dataflow on the CoW parent system context. That operations output can then be formed into it's appropriate place in the config of the plugin it will be used to instantiate.

We will of course need to create a dependency graph between inodes. We should support requesting of re-instantiation of instances within shared configs via event based communication to strategic decision maker. Configuration and implementation of the strategic decision maker (SDM) determine what active strategic plans are taken into account. The SDM must provide attested claims for each decision it makes with any data sent over potentially tamperable communication channels (needs understanding of properties of use of all instances of plugins, for example in memory everything is different than and operation implementation network connected over the internet).

@johnandersen777
Copy link
Author

@johnandersen777
Copy link
Author

Serializable graph data structure with linkage, can be used for "shared config", just add another property like an inode to the plugin, config baseconfigurable code in dffml.base. Then populate configs based off instantiated plugins with inodes in shared_configs section.

@intel intel locked and limited conversation to collaborators Apr 20, 2022
@johnandersen777 johnandersen777 converted this issue into discussion #1369 Apr 20, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
documentation Changes to documentation
Projects
None yet
Development

No branches or pull requests

1 participant