Skip to content

Commit

Permalink
improving introduction
Browse files Browse the repository at this point in the history
  • Loading branch information
gfursin committed Sep 17, 2020
1 parent 89401cd commit 96dcfcf
Show file tree
Hide file tree
Showing 2 changed files with 147 additions and 37 deletions.
108 changes: 71 additions & 37 deletions docs/src/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,66 +43,100 @@ published in this [Nature article](https://www.nature.com/articles/sdata201618).
## What is CK?

We have developed the [Collective Knowledge framework (CK)](https://github.com/ctuning/ck)
as a small Python library with a unified command line interface (CLI).
CK provides commands to organize new software projects or rearrange existing ones
as a ***CK repository*** - a human-readable database
of reusable ***CK components***.
Such components wrap user artifacts and provide an extensible JSON meta description
and [common automation actions](https://cKnowledge.io/actions) with a unified API, CLI and JSON input/output.
You can find a partial list of shared CK repositories at [cKnowledge.io/repos]( https://cKnowledge.io/repos ).

Since we want CK to be non-intrusive and technology neutral, we use a simple 2-level directory structure
to wrap user artifacts into CK components. We also use ``.cm`` directories (similar to ``.git``)
to store meta information of all components including automatically assigned Unique IDs (***CK UID***):
as a small Python library with minimal dependencies to be very portable
and have the possibility to be implemented in other languages such as C, C++, Java, and Go.
The CK framework has a unified command line interface (CLI), a Python API,
and a JSON-based web service to manage ***CK repositories***
and add, find, update, delete, rename, and move ***CK components***
(sometimes called ***CK entries*** or ***CK data***).

CK repositories are human-readable databases of reusable CK components that can be created
in any local directory and inside containers, pulled from GitHub and similar services,
and shared as standard archive files.
CK components simply wrap user artifacts and provide an extensible JSON meta description
with [***common automation actions***](https://cKnowledge.io/actions) for related artifacts.

***Automation actions*** are implemented using [***CK modules***]( https://cKnowledge.io/modules ) - Python modules
with functions exposed in a unified way via CK API and CLI
and using extensible dictionaries for input/output (I/O).
The use of dictionaries makes it easier to support continuous integration tools
and web services and extend the functionality while keeping backward compatibility.
The unified I/O also makes it possible to reuse such actions across projects
and chain them together into unified pipelines and workflows.

Since we wanted CK to be non-intrusive and technology neutral, we decided to use
a simple 2-level directory structure to wrap user artifacts into CK components:

![wrappers](../static/wrappers.gif)

The root directory of the CK repository contains ``.ckr.json`` to describe this repository
and specify dependencies on other CK repositories to reuse their components and automation actions.
The root also contains directories to group related components together such as a ``dataset`` or ``program``
(***CK component type***).
Finally, the root directory contains ``.cm`` directory that keeps Unique IDs of all components
to be able to keep track of all created components even if their names have changed (***CK alias***).

The second directory level (``CK entries`` or ``CK data``) is used to store artifacts (any files and sub-directories)
for components such as ``dataset/text1234-for-nlp`` or ``dataset/some-images-from-imagenet``.
Each such sub-directory also contains ``.cm`` directory with a ``meta.json`` file to describe a given component
and a ``info.json`` file to keep the provenance of a given component
including copyrights, licenses, creation date, contributors and so on.

Each ***CK component type*** must have a related ***CK module***
with the same name (for example ``dataset`` or ``program``)
that contains a simple Python module with common automation actions for a given component type
such as ``compile program`` or ``run program``.
These modules are also stored as CK entries in CK repositories
such as ``module/dataset/module.py`` or ``module/program/module.py``.
The root directory of the CK repository contains the ``.ckr.json`` file to describe this repository
and specify dependencies on other CK repositories to explicitly reuse their components and automation actions.

CK uses ``.cm`` directories similar to ``.git`` to store meta information of all components
as well as Unique IDs of all components to be able to find them even
if their user-friendly names have changed over time (***CK alias***).

***CK modules*** are always stored in ***module / < CK module name >***
directories in the CK repository.
For example, ``module/dataset`` or ``module/program``.
They have a ``module.py`` with associated automation actions (for example, ``module/dataset/module.py`` or ``module/program/module.py``).
Such approach allows multiple users to add, improve, and reuse
common automation action for related components rather than
reimplementing them from scratch for each new project.

Note that CK framework has an internal [***default CK repository***]( https://github.com/ctuning/ck/tree/master/ck/repo )
with stable [***CK modules***](https://github.com/ctuning/ck/tree/master/ck/repo/module) and most commonly used automation actions.
When CK is used for the first time, it also creates a ***local CK repository***
in the user space to be used as a working repository or a scratch-pad.

CK components are stored in ***< CK module name > / < CK data name >*** directories.
For example, ``dataset/text1234-for-nlp`` or ``dataset/some-images-from-imagenet``.

Each CK component has a ``.cm`` directory with the ``meta.json`` file
describing a given artifact and ``info.json`` file to keep the provenance
of a given artifact including copyrights, licenses, creation date,
names of all contributors, and so on.

CK framework has an internal [***default CK repository***]( https://github.com/ctuning/ck/tree/master/ck/repo )
with [***stable CK modules***](https://github.com/ctuning/ck/tree/master/ck/repo/module) and the most commonly used automation actions
across many research projects.
When CK framework is used for the first time, it also creates a ***local CK repository***
in the user space to be used as a scratch pad.


CK provides a simple command line interface similar natural language to manage CK repositories, entries, and actions:
```bash
ck <action> <CK module name> (flags) (@input.json) (@input.yaml)
ck <action> <CK module name>:<CK entry name> (flags) (@input.json) (@input.yaml)
ck <action> <CK module name>:<CK entry name> (flags) (@input.json or @input.yaml)
ck <action> <CK repository name>:<CK module name>:<CK entry name>
```

For example:
The next example demonstrates how to compile and run the shared automotive benchmark
on any platform, and then create a copy of the ***CK program component***:

```bash
pip install ck

ck pull repo --url=https://github.com/ctuning/ck-crowdtuning

ck search dataset --tags=jpeg

ck search program:cbench-automotive-*

ck find program:cbench-automotive-susan

ck load program:cbench-automotive-susan

ck help program

ck compile program:cbench-automotive-susan --speed
ck run program:cbench-automotive-susan --env.OMP_NUM_THREADS=4

ck cp program:cbench-automotive-susan program:cbench-automotive-susan-copy
ck run program --help

ck cp program:cbench-automotive-susan local:program:new-program-workflow

ck find program:new-program-workflow

ck benchmark program:new-program-workflow --record --record_uoa=my-test

ck replay experiment:my-test

```

Expand Down
76 changes: 76 additions & 0 deletions docs/src/specs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# CK specs

## CK repository

Here we describe the structure of a CK repository.
You may want to look at any CK repository such as [ck-env](https://github.com/ctuning/ck-env)
to better understand this structure.

Note that CK creates this structure automatically when you use
[CK CLI or Python API](commands.md).

### Root files

* *.ckr.json* : JSON meta description of this repository including UIDs and dependencies on other repositories.
Most of this information is automatically generated when a new CK repository is created.

```Json
{
"data_alias": # repository name (alias) such as "ck-env"
"data_name": # user-friendly repository name such as "CK environment"
"data_uid": # CK UID for this repository (automatically generated)
"data_uoa": # repository alias or repository UID if alias is empty
"dict": {
"desc": # user-friendly description of this repository
"repo_deps": [
{
"repo_uoa": # repository name
"url": # Git URL of this repo
}
...
],
"shared": # =="git" if repository is shared
"url": # Git URL of this repository
}
}
```

### Root directories (CK modules)

Root of the CK repository can contain any sub-directories to let users gradually convert their ad-hoc projects into the CK format.
However, if a directory is related to CK entry, it should have the same name as an associated CK module and two files in the *.cm* directory:

* *CK module name*
* *.cm/alias-a-{CK module name}* : contains UID of the CK module
* *.cm/alias-u-{UID}* : contains CK module name (alias)

These 2 files in .cm help CK to understand that a given directory inside CK repository is associated with some CK entry!
They also support fast such for a given CK entry by UIDs or aliases.

However, in the future, we may want to remove such files and perform automatic indexing when CK pulls repositories (similar to Git). See [this ticket](https://github.com/ctuning/ck/issues/118).

### Sub-directories for CK entries

If the directory in the CK repository is a valid CK module name, it can contain CK entries associated with this CK module.

If CK entry does not have a name (an alias), it will be stored as a CK UID (16 lowercase hexadecimal characters):

* *UID* : holder for some artifacts

If CK entry has a name (an alias), there will be two more files in the *.cm* directory:

* *CK entry name* : holder for some artifacts
* *.cm/alias-a-{CK entry name}* : contains UID of the CK entry
* *.cm/alias-u-{UID}* : contains CK entry name (alias)

Once again, these .cm files allow CK to quickly find CK entries by UID and aliases in all CK repositories without the need for any indexing.

### CK entry

Each valid CK entry has at least 3 files in the *.cm* directory:

* *.cm/meta.json* : JSON meta description of a given CK entry
* *.cm/info.json* : provenance for a given CK entry (date of creation, author, copyright, license, CK used, etc)
* *.cm/desc.json* : meta description SPECs (under development)

This entry can also contain any other files and directories (for example models, data set files, algorithms, scripts, papers and any other artifacts).

0 comments on commit 96dcfcf

Please sign in to comment.