diff --git a/aboutcode-data/README.rst b/aboutcode-data/README.rst new file mode 100644 index 0000000..9beb9cd --- /dev/null +++ b/aboutcode-data/README.rst @@ -0,0 +1,1202 @@ +ABCD: Using the AboutCode Data structures to describe data ... about code. +========================================================================== + +Summary +------- + +ABCD is an abbreviation for ABout Code Data. The AboutCode Data goal is +to provide a simple, standardized and extensible way to document data +about software code such that: + +- It is a common way to exchange data about code between any nexB tools + by import and export. + +- It becomes the preferred way to exchange data between nexB tools and + other tools. + +- It could become a valuable structure to exchange data between any + tools concerned with data about software. + +ABCD is technology and programming language neutral, preferring JSON or +YAML document formats. + +ABC Data is structured around a few basic objects: Products, Components, +Packages, Files, Parties and Licenses. It is extensible to other +specific or future object types. + +Objects have "attributes" that are simple name/value pairs. A value can +be either a plain value or another object or a list of objects and +attributes. + +ABC Data is minimally specified by design: only a few basic objects and +attributes are documented with conventions to name and structure data +and how to define relationships between objects. There is only a small +reference dictionary for some well known attributes documented here. + +The planned benefit for tools using ABC Data is simplified data exchange +and integration between multiple best-of-breed tools. + +Context +------- + +There is currently no easy way to describe information about code and +software in a simple and standardized way. There have been many efforts +to provide this data in a more structured way such as: + +- SPDX (focused on packages and licenses), +- DOAP (focused on projects), +- The original ABOUT metafile format, and +- The many different software package metadata formats (Maven, NPM, + RPM, Deb, etc). + +These data structures are fragmented and generally too purpose- or +technology-specific. + +Recently there have been efforts to collect and expose more data such +as: + +- libraries.io (a catalog of packages, AGPL-licensed) + and dependencyci.com its companion commercial service, +- versioneye.com (a catalog of package versions updates, now + MIT-licensed), +- softwarearchive.org (an effort to build an all-encompassing software + source code archive), +- sources.debian.net (a Debian-focused code and metadata serach + facility), +- searchcode.com (an add-supported source code search engine exposing + some metadata) +- appstream (a cross-distro effort to normalize desktop package + metadata access to Linux desktops + https://www.freedesktop.org/software/appstream/docs/). + +These efforts are all useful, but they do not address how you can +consistently exchange data about code in a user-centric and +technology-neutral, normalized way. + +Why does this matter? Software and code are everywhere. FLOSS code is +exploding with millions of components and packages.  The data about this +code is out there somewhere but getting it is harder than needed. This a +problem of data normalization, aggregation and exchange.   + +Whether you consume or produce software, accessing and creating +normalized data about your code and the code you use should be made +easier such that: + +- You can efficiently make code selection and re-use decisions, + +- You can discover what is in your code, and + +- You can continuously track updates, bugs, licenses, liveliness, + quality and security attributes of the code you use or consider + using. + +With the notable exceptions of SPDX and the earlier versions of the +ABOUT format, available data formats about software have been designed +first for a specific technology (e.g. Linux distros) or programming +language (e.g. maven, npm, etc.) and documentation of code provenance +and related attributes has been secondary and minimal. In most cases, +the primary focus has been to provide first comprehensive support for +package installation, dependency resolution or package building and +provenance and licensing information is often treated with lesser +details. + + +ABCD: the AboutCode Data structure +---------------------------------- + +ABCD is an abbreviation for ABout Code Data. The goal is to provide a +simple, standardized and extensible way to document things about +software code. + +In contrast with other approaches, the AboutCode Data structure is +focused on providing data that is useful to users first and is not +limited to software package data only. AboutCode Data does not need not +be as strictly specified as traditional package manager data formats +because its purpose is not to drive a software build, package creation +or software installation nor is it to compute the resolution of +dependencies. It only provides information (metadata) about the code. + +The vision for the ABC Data structure is to provide a common way to +exchange data about code between all nexB tools, such that these tools +can all import and export data about code seamlessly (TraceCode, +ScanCode, AboutCode Manager, AttributeCode, upcoming MineCode, etc.). +The ABCD structure should also be the preferred way to exchange data +about code between nexB tools and other tools. We may create small +adapters to convert other data formats in and out of the ABCD structure +and encourage other tool authors to natively support ABC Data though +the main focus is on our tools. + +The ABCD structure is technology and programming language neutral and +designed so that the parties exchanging data about code can do so +reliably with some minimal conventions; and that the data is easily +processed by machines and not hard to read by humans. + +ABC Data is structured around "objects". Objects have "attributes" that +are simple name/value pairs. A value can be either a plain value or +another object or a list of nested objects and attributes. + +ABC Data is organized around: + +- a few known object types, + +- simple conventions to create lists of objects and describe object + relationships, + +- simple conventions to create attributes as name/value pairs, and + +- a small dictionary or vocabulary of well-known attribute names that + have a common definition across all tools. + +ABC Data is "under-specified" by design: only a few essential objects +and attributes are documented here with the conventions on how to +structure the ABC Data. + + +Basic objects describing data about code +---------------------------------------- + +At the top level we have these main object types: + +- Product(s): a software product, application or system such as a Cost + Accounting application. + +- Component(s): a software component, such as the PostgreSQL 9 database + system, usually a significant or major version + +- Package(s): a set of files that comprise a Component as used, such as + a postgresql-9.4.5-linux-x86.zip archive. The version is exact and + specific. + +- File(s): any file and directory identified by a path, such as a + binary package or a source code directory or file. + + +And these secondary, important but less prominent object types: + +- Party(ies): a person or an organization. An organization can be a + project, formally or informally organized, a company, a department + within a company, etc. A Party typically has contact information + (such as an email or physical address or home url). A Party may have + defaults that apply to much of its software (for an org that creates + software) such as a default Apache license for Apache Foundation + projects. Parties often relate to other objects through a + role relationship such as owner, author, maintainer, etc. + +- License(s): information about the license of code. A License + typically has a name and text and additional categories, tags or + attributes. + + +Each of these objects has a few identifying attributes and eventually +many tool- or application-specific data attributes. Each tool defines +and documents the attributes they can handle and care for. When some +agreement is reached on the definition of new attributes or objects, the +ABCD dictionary may be updated accordingly with new objects types such +as for software security, quality or other interesting aspects. + +Objects are interrelated with other objects. Objects can relate to each +other via a reference using identifiers pointing to other objects or via +an embedded list of objects. The nature of the relationship between two +objects can also be specified with additional attributes as needed. + +Here are typical relationships between objects: + +|image1| + +Here is an example of relationships for a simple Widget product: + +|image2| + +Tools can define any custom objects and some used more commonly may be +promoted to be documented here over time. + + +Attribute Names and Values +-------------------------- + +By convention, a tool receiving ABC Data should process only the data it +knows and should ignore unknown attributes or objects. This is important +to allow the data structure to evolve and provide some forward and +backward compatibility. When an ABCD payload contains data elements that +a receiver does not know about, the receiver should still be able to +process the known objects and attributes. + +- Attributes are name/value pairs. + +- Attribute names are always strings, not numbers, not booleans, not any + other data format. In these strings, leading and trailing white spaces + (spaces, tabs, line returns, etc) are not significant and can be safely + ignored or removed. + +- Attribute values are one of the standard JSON types: string, number, + boolean or null. In strings, leading and trailing white spaces (spaces, + tabs, line returns, etc) are not significant and can be safely ignored + or removed. + +- Self-explicit names should be used rather than obscure names or + abbreviations: names should be self-explicit and self-evident. + +Except for the data organization conventions described here and the use +of the well-known object and attribute names, nothing is mandatory in +the ABCD format. This means that even partial, incomplete or sketchy +data about code can be transferred in this format. + +The meaning of well known object names such as Product, Component, +Package, File, Party and License is defined in this document. + + +Name conventions +---------------- + +- Names are strings composed ONLY of ASCII letters, numbers or + underscores. Names cannot start with a number. Names cannot contain + spaces nor other punctuation, not even a dot or period. + +- Names are NOT case sensitive: upper or lowercase does not matter and + the standard is to use lowercase. It is a mistake to use upper or + mixed case but this is something a parser receiving ABC Data should + recover from nicely by converting the names to lowercase. + +- Names are made of English words: there is no provision currently for + non-English names. Tools that deal with multilingual content may + define their own conventions to provide content in other languages. + ABCD may add one of these conventions in the future. + +- Parser implementation can be smarter and gentler: For names, anything + that is not ASCII or number or underscore can be accepted by a parser + and could be replaced by an underscore, including a starting digit if + any. Or a parser may provide a warning if there is an unknown name + that is very close to a well known name. Or a parser may accept + CamelCase and transform names to underscore_case and perform another + transformation to conventional ABC Data. + +- Names are singular or plural: When a name refers to more than one + item, the name of the field is plural and the value is a list of + values. For instance "url" and "urls". + +- Top level known objects are ALWAYS plural and stored in lists: + "parties" or "files" or "products" or "components". This makes it + easier to write tools because the top level types are always lists, + even when there is a single object in that list. + +- A value must not be used as a name: in an attribute name/value pair, + the name is always a name, not a value and every value must have a + name. + +- For instance, this JSON snippet would not be correct where a URL is + used as a name:: + + {"http://someurl.com": "this is the home URL"} + +- Use rather this form to specify a name for the URL attribute:: + + {"url": "http://someurl.com", "note": "this is the home URL"} + +- But this would be correct when using a list of plain values where + "urls" is plural:: + + {"urls": ["http://someurl.com", "http://someurl2.com"]} + +- An attribute names without a value is not needed. Only names with + values are needed, and attributes without values can be omitted: each + tool may do what it wants for these cases. For instance it may handy + to provide all attributes even if not defined in an API payload. But + when serializing data as YAML meant for human editing, including all + empty values may not help with reading and processing the YAML text. + An undefined attribute without a set value should be assigned with + the null JSON value: this has the same meaning as if the attribute + was not specified and absent from the payload. If you want to specify + that an attribute has an empty value and does not have a value (as + opposed to have an unknown value) use an empty string instead. + +- Avoid abbreviated names, with some exceptions: names should be always + fully spelled out except for: + + - url: uniform resource locator + - uri: uniform resource identifier + - urn: uniform resource name + - vcs: version control system + - uuid: universally unique identifier, used for uuid4 string + `https://tools.ietf.org/html/rfc4122.html `__  + - id: identifier + - info: information + - os: operating system + - arch: architecture + +- For some common names we use the common compound form such as: + + - codebase: and not code_base + - filename: and not file_name + - homepage: and not home_page + +Well known attribute names include: + +- name: the name of a product, component, license or package. +- version: the version of a product, component, package. +- description: description text. +- type: some type information about an object. For instance, a File + type could be: directory, file or link. +- keywords: a list of keywords about an object. For example, the + keywords of a component used to "tag" a component. +- path: the value is the path to a file or directory, either absolute + or relative and using the POSIX convention (a forward slash as + separator). For Windows paths, replace backslash with forward + slashes. Directories should end with a slash in a canonical form. +- key: the value is some key string, slug-like, case-insensitive and + composed only of ASCII letters and digits, dash, dot and underscore. + No white spaces. For example: org.apache.maven-parent +- role: the value describes the role of a Party in a relationship with + other objects. For instance a Party may be the + "owner" or "author" of a Component or Package. +- uuid: a uuid4 string + `https://tools.ietf.org/html/rfc4122.html `__  +- algorithms for checksums: to store checksums we use a name/value + pairs where the name is an algorithm such as sha1 and the value is a + checksum in hexadecimal such as "sha1": "asasa231212" . The value is + the standard/default string created by command line tools such as + sha1sum. Supported algorithm may evolve over time. Common checksums + include md5, sha1, sha256, sha512. +- notes: some text notes. This is an exception to the singular/plural + rule for names: notes is a single text field and not a list. + +As the usage of the ABCD structure matures, more well known names will +be documented in a vocabulary. + + +Value conventions +----------------- + +- Attribute values are one of the standard JSON types: string, number, + boolean or null. In strings, leading and trailing white spaces + (spaces, tabs, line returns, etc) are not significant and can be + safely ignored or removed. + +- To represent a date/time use the ISO format such as 2016-08-15 + defaulting to UTC time zone if the time zone is not specified in the + date/time stamp. + +- All string values are UTF-8 encoded. + + +Well known name prefixes or suffixes can be used to provide a type hint +for the value type or meaning: + +- xxx_count, xxx_number, xxx_level: the value is an integer number. + Example: results_count or curation_level + +- date_xxx or xxx_date: the value is a date/time stamp in ISO format + such as 2016-08-16 (See https://www.ietf.org/rfc/rfc3339.txt ). + Examples: last_modified_date, date_created + +- xxx_url: the value is a URL for web http(s) or ftp url that points + to an existing valid web resource (that could possibly no longer + exist on the web). Example: homepage_url or api_url + +- xxx_uri: the value is a URI typically used as an identifier that may + not point to an existing web resource. Example: + git://github.com/nexb/scancode-toolkit + +- xxx_file or xxx_path: the value is a file path. This can come handy + for external files such as a license file. Example: notice_file + +- xxx_filename: the value is a file name. Example: notice_filename + +- xxx_text: the value is a long text. This is only a hint that it may + be large and may span multiple lines. Example: notice_text + +- xxx_line: such as start_line and end_line: the value is a line + number. The first line number is 1. + +- xxx_status: such as configuration_status. Indicates that the value + is about some status. + +- xxx_name: such as short_name. Indicates that the value is a name. + Commonly used for long_name, short_name. The bare name shout be + preferred for the obvious and most common way an object is named. + +- xxx_flag, is_xxx, has_xxx: such as is_license_notice. Indicates + that the string value is a boolean. + + +Object identifiers +------------------ + +We like objects to be identifiable: there is a natural way to identify +and name most objects: for instance the full name of a person or +organization or the name and version of a Component or Package or the +path to a File are all natural identifiers to an object. + +However, natural names are not always enough to fully identify an object +and may need extra context to reference an object unambiguously. There +could be several persons or organizations with the same name at a +different address.. Or the foo-1.4 Package could be available as a +public RubyGem and also as an NPM; or a private Python package foo-1.4 +has been created by a company and is also available on Pypi. Or the +"foo" Package is the name of a Linux Package, an NPM and a Ruby Package +but these three packages are for unrelated components. + +Hence each object may need several attributes to be fully identified. + +For example, public package managers ensure that a name is unique within +the confines of a source. "logging" is the unique name of a single +Sourceforge project at +`https://sourceforge.net/projects/logging/ `__ . +"logging" is the unique name of an Apache project at the Apache +Foundation `http://logging.apache.org/ `__ . + +Yet, these two names point to completely different software. In most +cases, providing information about the "source" where an identifier is +guaranteed to be unique is enough to ensure proper identification. This +"source" is easily identified by its internet source name, and an +internet source name is guaranteed to be unique globally. The "source" +of identifiers is not mandatory but it is strongly encouraged to use as +an attribute to provide good unique identifiers: still, tools exchanging +ABC Data must be able to exchange under-specified and partially +identified data and may sometimes rely on comparing many attributes of +two objects to decide if they are the same. + +The minimal way to identify top level objects is the combination of a +"source" and a unique identifier within this source. The source can be +implicit when two parties are exchanging data privately or explicit +using the "source" attribute. + +Within a source, we use the most obvious and natural identifies for an +object. For example: + +- For Products, Components and Packages we can use their name and + version. + +- For Files we use a path of a file or directory, possibly relative to + a package or a product codebase; or a checksum of a file or archive + such as a sha1. + +- For Parties, we use a name possibly supplemented with a URL or email. + +- For all object types we can use a "universally unique id" or UUID-4 + (https://tools.ietf.org/html/rfc4122.html) + +- For all object types, we can use a key, which is a slug-like string + identifier such as a license key. + +- For all object types, we can use a URN + (https://en.wikipedia.org/wiki/Uniform_resource_name) Tools may + also define their own URNs namespaces and names such as a DejaCode + urn as is urn:dje:component:16fusb:1.0 + + + +Beyond direct identification, an object may have several alternative +identifiers aka "external references". For instance a Package may have +different names and slightly different versions in the Linux Debian or +Fedora distros and a Pypi Package with yet another name where all these +Packages are for the same Component and the same code. Or a Party such +as the Eclipse Foundation may be named differently in DejaCode and the +NVD CPEs. + +To support these cases, the "external_reference(s)" attribute can be +used where needed in any object to reference one or more external +identifiers and what is the source for this identifier (note: "external" +is really a matter of point of view of who owns or produces the ABC +Data.)  An attribute with name suffix of "xxx_reference" may also be +used to provide a simpler external reference such as "approval_reference". + + +For example this ABC Data could describe the external id of Party to a +CPE and to TechnoPedia (here in a YAML format):: + + parties: +   - name: Apache Foundation +     homepage_url: http://apache.org +     type: organization +     external_references: +         - source: nvd.nist.gov +           identifier: apache +         - source: technopedia.com +           identifier: Apache Foundation (The) +         - source: googlecode.com +           identifier: apache-foundation + +Other identifiers may be used as needed by some tools, such as +in hyperlinked APIs. + + +Organizing data and relationships +--------------------------------- + +Describing relationships between objects is essential in AboutCode Data. +There are two ways to describe these relationship: by referencing or by +embedding objects. + +When using a reference, you relate objects by providing identifiers to +these objects and may provide additional object details in separate +lists. When embedding, you include not only the reference but also the +related object details in another object data. This could include all +data about an object or a subset as needed. + +For example this components list embeds a list of two packages. Note +that components is always a list even when it has a single component:: + + {"components": [{ +     "source": "http://apache.org", +     "name": "Apache httpd", +     "version": "2.3", +     "packages": [ +         {"name": "httpd", + "version": "2.3.4", + "download_url": "http://apache.org/dist/httpd/httpd-2.3.4.zip", +         "sha1": "acbf23256361abcdf", + "size": 3267, + "filename": "httpd-2.3.4.zip" + },         + +         {"name": "httpd", + "version": "2.3.5", + "download_url": "http://apache.org/dist/httpd/httpd-2.3.5.tar.gz", +          "sha1": "ac8823256361adfcdf", + "size": 33267, + "filename": "httpd-2.3.5.tar.gz" + } +     ] + }]} + + +In this example, the component list references two packages that are +listed separately and uses the checksum as package identifiers for the +reference. This data is strictly equivalent to the previous example but +using a different layout. When all the data is provided, the effect of +embedding or referencing objects results in the same data, just +organized differently:: + + {"components": [{ +     "source": "http://apache.org", +     "name": "Apache httpd", +     "version": "2.3", +     "packages": [ +         {"sha1": "aacbf23256361abcdf"}, +         {"sha1": "ac8823256361adfcdf"} +     ] + }], + + "packages": [ +     {"name": "httpd", "version": "2.3.4", + "download_url": + "http://apache.org/dist/httpd/httpd-2.3.4.zip", +      "sha1": "acbf23256361abcdf", "size": 23267, "filename": "httpd-2.3.4.zip"}, + +     {"name": "httpd", "version": "2.3.5", + "download_url": "http://apache.org/dist/httpd/httpd-2.3.5.tar.gz", +      "sha1": "ac8823256361adfcdf", "size": 33267, "filename": "httpd-2.3.5.tar.gz"} + ]} + +In this third example the packages are referencing one component +instead. That component is always wrapped in a components list. The +component detail data is not provided. The details may be available +elsewhere in a tool that tracks components:: + + "packages": [ +     {"name": "httpd", "version": "2.3.4", + "download_url": "http://apache.org/dist/httpd/httpd-2.3.4.zip", +      "sha1": "acbf23256361abcdf", "size": 23267, "filename": "httpd-2.3.4.zip", +      "components": [ + {"source": "http://apache.org", "name": "Apache httpd", "version": "2.3"} + ] +     }, + +     {"name": "httpd", "version": "2.3.5", + "download_url":"http://apache.org/dist/httpd/httpd-2.3.5.tar.gz", +      "sha1": "ac8823256361adfcdf", "size": 33267, "filename": "httpd-2.3.5.tar.gz", +      "components": [ + {"source": "http://apache.org", "name": "Apache httpd", "version": "2.3"} + ] + } + ] + + +Relationships can be documented with this approach in different ways. +Typically when the primary concern is about a Product, then the Product +object may embed data about its Components. When the primary concern is +Packages, they may embed or reference Products or Components or files. +For example: + +- A tool may prefer to provide data with products or components as top level + objects. The components used in a Product are naturally embedded in the products. + +- A tool may be concerned more with files will provide files as top + level objects and may embed packages details when they are found for + a file or directory path. + +- Another tool may focus on packages and provide packages first with + components references and possibly embedded files. A matching tool + may provide packages first and reference matched files. The file + paths of a package are naturally embedded in the package, though + using references may help keep the data simpler when there is a large + volume of files + +- A tool that generates attribution documentation may be interested + first by components and second by licenses or packages references. + +- A tool dealing with security vulnerabilities may define a + Vulnerability object and reference Packages and Files that are + affected by a Vulnerability. + +To better understand the embedding or referencing relationships: + +- using references is similar to a tabular data layout, akin to a + relational database table structure + +- using embedding is similar to a tree data layout such as in a + file/directory tree or nested data such as XML. + +Another way to think about these relationships is a "GROUP BY" statement +in SQL. The data can be grouped-by Component, then Packages or +grouped-by Files then Components. + +Both referencing and embedding layouts can be combined freely and are +not mutually exclusive. When using both at the same time, some care is +needed to avoid creating documents with conflicting or duplicated data +that is referenced and embedded at the same time. + +Using references is often useful when there is an agreement on how to +reference objects between two tools or parties. For instance, when using +nexB tools, a unique and well defined license key is used to reference a +license rather than embedding the full license details. A concise +reference to the name and version of a public package from a well known +package repository such as RPM or Maven can be used to the same effect. +Or an SPDX license identifier can be used to reference an SPDX-listed +license without having to embed its full license text. + +The nature of the relationship between two objects can be specified when +it is not obvious and requires some extra specification.  Each tool can +define additional attributes to document these. For instance a common +relationship between a party and a product or component is a role such +as owner. For packages a role can be maintainer, author, etc.  Or the +license of a file or package may be the "asserted" license by the +project authors. It may differ from the "detected" license from a scan +or code inspection and may further differ from a "concluded" license or +a "selected" license when there is a license choice. At the package and +license level the types of relationships documented in the SPDX +specification are a good source for more details. For example this +component references two parties where one is the author and the other +is the maintainer documented using a role attribute:: + + "components": [{ +     "source": "http://apache.org", +     "name": "Apache httpd", +     "version": "2.3", +     "parties": [ +         {"name": "John Doe", "type": "person", "role": "author"}, +         {"name": "Jane Smith", "type": "person", "role": "maintainer"}, +         {"name": "Jane Smith", "type": "person", "role": "owner"}, +     ] + }] + + +Document format conventions +--------------------------- + +The default ABC Data format is JSON (though it can be serialized to +anything else that would preserve its structure). YAML is also supported +and preferred for storage of simple documents that document one or a few +top level objects and that need to be edited by a human. + +The data structure by nested name/value pairs attributes and lists of +values maps naturally to the corresponding JSON and YAML constructs. In +JSON-speak these are arrays (lists) and objects (name/value pairs). + +ABC Data can be provided as simple files or embedded in some API +payload. As files, their content can be either JSON or YAML and should +have either a .json or .yml extension by convention. For backwards +compatibility with previous AboutCode conventions, the .ABOUT extension +can be used for YAML documents. For instance this is used in the legacy +about_code_tool and its successors. The DocumentCode tool can store +individual attribution data in a .ABOUT yml file. + +The top level structure of an ABC Data block is always a JSON object or +YAML dictionary. Depending on the context this top level structure may +be wrapped in another data structure (for instance when exchanging +AboutCode Data in some web api, the API may provide ABC Data as a +payload in a "results" or "body" or "data" block and also have some +"headers" or "meta" block). + +The top level elements must contain at least one of the object names and +a list of objects such as here with a list of files:: + + files: +     - path: this/foo/bar +       size: 123 +      sha1: aaf35463472abcd +    - path: that/baz + +Optionally an "aboutcode_version" attribute can be added at the top +level to document which version of the AboutCode Data structure is used +for a document. For example: aboutcode_version: 4.0 + +Order of attributes matters to help reading documents: tools that write +ABC Data should attempt to  use a consistent order for objects and +attribute names rather than a random ordering. However, some tools may +not be able to set a specific order so thi is only a recommendation. The +preferred order is to start with identifiers and keys and from the most +important to the least important attributes, followed by attributes +grouped logically together,  followed by related objects. + + +References between documents and payload, embedding other files +--------------------------------------------------------------- + +ABC Data may reference other data. For instance in a hyperlinked REST +API a list of URLs to further invoke the API and get licenses details +may be provided with an api_url attribute to identify which API calls +to invoke. The ways to reference data and the semantics and mechanics of +each of these embeddings or references needed to get the actual data are +not specified here. Each tool may offer its own mechanism. A convention +for an hyperlinked REST API JSON payload could be to use +api_url(s) identifier to specify additional "GET"able endpoints. The +AttributeCode tool use \*_file attributes in YAML or JSON documents +to reference external license and notices text files to load with the +text content. + +Another convention is used in ScanCode to reference license texts and +license detection rules by key: +An ABC Data YAML file contains the ABC Data. And side by side there is a +file with the same base name and a LICENSE, SPDX or NOTICE, RULE, +extension that contains the actual text corresponding to the license, +the SPDX text or the rule text. The convention here is to use an +implicit reference between files because they have the same base name +and different extensions. + +In the future, we may specify how to embed an external ABC Data file in +another ABC Data file; this would only apply to file-based ABC Data +payload though and could not apply to hyperlinked REST APIs. + + +Document-as-files naming, exchange and storage +---------------------------------------------- + +Each tool handling ABC Data may name an ABC Data file in any manner and +store the data in any way that is appropriate. The structure is a set of +data exchange conventions and may be used for storage but nothing is +specified on how to do this. + +For consistency, tools consuming AboutCode Data are encouraged to use +the same data structure internally and in their user interface to +organize and name the data, but this is only a recommendation. + +For instance, the AtttributeCode tool use a convention to store ABC Data +as YAML in a file with an .ABOUT extension and use the ABC Data structures +internally and externally. + +When exchanging data (for instance over an API) the API provider of ABC +Data should support a request to return either embedded data or data by +reference and ideally allow the caller to specify which objects and +attributes it is interested in (possibly in the future using something +like GraphQL). + +When interacting with tools through an  API, the conversation could +start by sending an ABC Data payload with some extra request data and +receiving an ABC Data payload in return. For instance, when requesting +matching packages from a matching tool, you could start by passing scan +data with checksums for several files at once and receive detailed data +for each of the matched files or packages. + + +Documenting and validating attributes +------------------------------------- + +Each tool handling ABC Data may only be interested in processing certain +objects and attributes when accepting data in, or when providing data +out. Attributes that are unknown should be ignored. To document which +objects and which attributes a tool can handle, a tool should provide +some documentation. The documentation format is not specified here, but +it could use a JSON schema in the future. This should include +documentation regarding if and how data is validated, and when and how +errors or warnings are triggered and provided when there is a validation +error. For example, a validation could be to check that an SPDX license +id exists at SPDX or that a URL is valid. + + +Notes on YAML format +-------------------- + +YAML is the preferred file format for ABC Data destined for reading or +writing primarily by humans. + +- Block-style is better. + +- When you write AboutCode Data as YAML, you should privilege block-style + and avoid flow-style YAML which is less readable for humans. + +- Avoid Multi-document YAML. + +- Multi-document YAML documents should be avoided (when using the --- + separators). + +- Beware of parser shenanigans: Most YAML parsers recognize and convert + automatically certain data types such as numbers, booleans or dates. + You should be aware of this because the ABC Data strings may contain + date stamps. You may want to configure a YAML parser to deactivate some + of these automated format conversions to avoid unwanted conversions. + + +Notes on JSON Format +-------------------- + +JSON is the preferred file format for ABC Data destined for reading and +writing primarily by machines. + +- "Streamable"  JSON with JSON-lines. + +A large JSON document may benefit from being readable line-by-line +rather than loaded all at once in memory. For this purpose, the +convention is to use JSON lines where each line in the document is a +valid JSON document itself: this enables reading the document in +line-by-line increments. The preferred way to do so is to provide one +ABCD top level object per document where the first line contains meta +information about the stream such as a notice, a tool version or  the +aboutcode version. + +- Avoid escaped slash. + +The JSON specification says you CAN escape forward slash, but this is +optional. It is best to avoid escaping slash when not needed for better +readability. + +For instance for URLs this form:: + + "https://enterprise.dejacode.com/component_catalog/nexB/16fusb/1.0/" + +should be preferred over this escaped form when backslashes are not +needed:: + + "https:\\/\\/enterprise.dejacode.com\\/component_catalog\\/nexB\\/16fusb\\/1.0\\/" + + +Notes on embedding ABC Data in source code files. +------------------------------------------------- + +It could be useful to include ABC Data directly in a source code file, +such as to provide structured license and provenance data for a single +file. This requires of course a file modification. While this is not a +preferred use case, it can be handy to document your own code one file +at a time. Using an external ABC Data file should be preferred but here +are conventions for this use case: + +- The ABC Data should be embedded in a top level block of comments. +- Inside that block of comments the preferred format is YAML. +- How a tool collects that ABC Data when embedded in code is to be + determined. +- Tools offering such support should document and eventually enforce + their own conventions. + + +Notes on spreadsheet and CSV files +---------------------------------- + +ABC Data does not support or endorse using CSV or spreadsheets for data +exchange. + +CSV and other spreadsheet file formats are NOT recommended to store ABC +Data. In most cases you cannot store a correct data set in a spreadsheet. +However, these tools are also widely used and convenient. +Here are some recommendations when you need  to communicate ABC data in +a CSV or spreadsheet format: even though ABC Data is naturally nested +and tree-like, it should be possible to serialize certain ABCD objects +as flat, tabular data. + +- Naming columns + +The table column names may need to be adjusted to correctly reference +the multiple level of object and attribute nesting using a dot as a +separator. The dot or period is otherwise not allowed in attribute +names. For example, you could use files.path for files or +components.name to reference a component name. Some tools may prefer to +create tabular files with their own column names and layout, and provide +mappings to ABC Data attribute and object names. + +- Example for an inventory: + +Since ABC Data can be related by reference, the preferred (and +cumbersome) way to store ABC Data in a spreadsheet is to use one tab for +each object type and use identifying attributes to relate objects +between each others across tabs. For instance, in a Bill of Materials +(BOM) spreadsheet for a Product, you could use a tab to describe the +Product attributes and another tab to describe the Components used in +this Product and possibly additional tabs to describe the related +packages and files corresponding to these + +- Care is needed for Packages, Components and other names and for dates, + versions, unicode and UTF-8 to avoid damaging content (aka. mojibake) + +Spreadsheet tools such as Excel or LibreOffice  automatically recognize +and convert data to their own format: a date of 20016-08-17 may be +converted to a date number when a CSV is loaded and difficult to recover +as a correct original date stamp string afterwards. Or a version 1.0 may +be irreversibly converted to 1 or 1.90 to 1.9 losing important version +information. + +Spreadsheet tools may not recognize and handle properly UTF-8 texts and +damage descriptions and texts. These tools may also treat strings +starting with the equal sign as a formula. When incorrectly recognizing +special accentuated characters this may damage texts creating what is +called "mojibake" (See https://en.wikipedia.org/wiki/Mojibake) + +Always use these tools with caution and be prepared for damage to your +data if you use these tools to save or create ABC Data. + + +Impact on AttributeCode +~~~~~~~~~~~~~~~~~~~~~~~ + +As an integration tool, AttributeCode itself may specify only a very few elements. + +The new structure will need to be implemented. Here could be an example +in YAML:: + + aboutcode_version: 4.0 + components: +  -  source: dejacode.com +     name: bitarray +     version: 0.8.1 +     homepage_url: https://github.com/ilanschnell/bitarray +     copyright: Copyright (c) Ilan Schnell and others +     files: +         - path: some/directory/ +            type: dir +         - path: bitarray-0.8.1-cp27-cp27m-macosx_10_9_intel.whl +         - path: someotherdir/bitarray-0.8.1-cp27-cp27m-manylinux1_i686.whl +         - path: bitarray-0.8.1-cp27-cp27m-manylinux1_x86_64.whl +         - path: bitarray-0.8.1-cp27-cp27m-win_amd64.whl +         - path: bitarray-0.8.1-cp27-cp27m-win32.whl +         - path: bitarray-0.8.1-cp27-cp27mu-manylinux1_i686.whl +         - path: bitarray-0.8.1-cp27-cp27mu-manylinux1_x86_64.whl +         - path: bitarray-0.8.1-cp27-none-macosx_10_6_intel.whl +         - path: bitarray-0.8.1.tar.gz + +     parties: +       - role: owner +         name: Ilan Schnell + +     packages: +       - download_url: http://pypi.python.org/packages/source/b/bitarray/bitarray-0.8.1.tar.gz +         sha1: 468456384529abcdef342 + +     license_expression: psf + +     licenses: +       - source: scancode.com +         key: psf +         text_file: PSF.LICENSE + + +And here would be similar data in JSON:: + + {"components": [{ +                "name": "bitarray", +                "version": "0.8.1" +                "homepage_url": "https://github.com/ilanschnell/bitarray", +                "copyright": "Copyright (c) Ilan Schnell and others", +                "license_expression": "psf", +                "licenses": [{"key": "psf", "text_file": "PSF.LICENSE", "source": "scancode.com"}], +                "packages": [{"download_url": "http://pypi.python.org/packages/source/b/bitarray/bitarray-0.8.1.tar.gz" +                             "sha1": "468456384529abcdef342" +                 }], +                "parties": [{"name": "Ilan Schnell", "role": "owner"}], + +                "files": [{"path": "some/directory/", "type": "dir"}, +                          {"path": "bitarray-0.8.1-cp27-cp27m-macosx_10_9_intel.whl"}, +                          {"path": "bitarray-0.8.1-cp27-cp27m-manylinux1_i686.whl"}, +                          {"path": "bitarray-0.8.1-cp27-cp27m-manylinux1_x86_64.whl"}, +                          {"path": "bitarray-0.8.1-cp27-cp27m-win_amd64.whl"}, +                          {"path": "bitarray-0.8.1-cp27-cp27m-win32.whl"}, +                          {"path": "bitarray-0.8.1-cp27-cp27mu-manylinux1_i686.whl"}, +                          {"path": "bitarray-0.8.1-cp27-cp27mu-manylinux1_x86_64.whl"}, +                          {"path": "bitarray-0.8.1-cp27-none-macosx_10_6_intel.whl"}, +                          {"path": "bitarray-0.8.1.tar.gz"}], +                }], + +  aboutcode_version: "4.0"} + + +Impact on ScanCode Toolkit +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The new format will need to be implemented for scan results in general +and for packages in particular. + +ScanCode will specify Package and several attributes related to scanning +and referencing clues for files, directories and packages. + +Alternatively Packages could be extracted to an independent PackagedCode library. + +The changes will minimal impact onthe layout of the scan results. Here is an +example of a scan payload in ABCD format: this is essentially the standard scan +format:: + + { +   "scancode_notice": "Generated with ScanCode and provided .......", +   "scancode_version": "2.0.0.dev0", +   "files_count": 7, +   "files": [ +     { +       "path": "samples/JGroups/src/", +       "type": "directory", +       "files_count": 29 +       "licenses" : [ +           { "key":"apache-2.0", +             "concluded": true} +       ] +     } +     { +       "path": "samples/JGroups/src/GuardedBy.java", +       "date": "2015-12-10", +       "programming_language": "Java", +       "sha1": "981d67087e65e9a44957c026d4b10817cf77d966", +       "name": "GuardedBy.java", +       "extension": ".java", +       "file_type": "ASCII text", +       "is_text": true, +       "is_source": true, +       "md5": "c5064400f759d3e81771005051d17dc1", +       "type": "file", +       "is_archive": null, +       "mime_type": "text/plain", +       "size": 813, +       "copyrights": [ +         { +           "end_line": 12, +           "start_line": 9, +           "holder": "Brian Goetz and Tim Peierls", +           "statement": "Copyright (c) 2005 Brian Goetz and Tim Peierls" +         } +       ], +       "licenses": [ +         { "detected": true, +           "key": "cc-by-2.5", +           "short_name": "CC-BY-2.5", +           "homepage_url": "http://creativecommons.org/licenses/by/2.5/", +           "dejacode_url": "https://enterprise.dejacode.com/license_library/Demo/cc-by-2.5/", +           "text_url": "http://creativecommons.org/licenses/by/2.5/legalcode", +           "owner": { +             "name": "Creative Commons", +           }, +           "detection_score": 100.0, +           "start_line": 11, +           "end_line": 11, +           "category": "Attribution", +           "external_reference": { +             "source": "spdx.org", +             "key": "CC-BY-2.5" +             "url": "http://spdx.org/licenses/CC-BY-2.5", +           }, +         } +       ], +     }, +     { +       "path": "samples/JGroups/src/ImmutableReference.java", +       "date": "2015-12-10", +       "md5": "48ca3c72fb9a65c771a321222f118b88", +       "type": "file", +       "mime_type": "text/plain", +       "size": "1838", +       "programming_language": "Java", +       "sha1": "30f56b876d5576d9869e2c5c509b08db57110592", +       "name": "ImmutableReference.java", +       "extension": ".java", +       "file_type": "ASCII text", +       "is_text": true, +       "license_expression": "lgpl-2.1-plus and lgpl-2.0-plus", +       "is_source": true, +       "copyrights": [{ +         "end_line": 5, +         "start_line": 2, +         "holder": "Red Hat, Inc.", +         "statement": "Copyright 2010, Red Hat, Inc." +       }], +       "licenses": [ +         { "detected": true, +           "key": "lgpl-2.1-plus",   +           "category": "Copyleft Limited", +           "homepage_url": "http://www.gnu.org/licenses/old-licenses/lgpl-2.1-standalone.html", +           "start_line": 7, +           "end_line": 10, +           "short_name": "LGPL 2.1 or later", +           "owner": "Free Software Foundation (FSF)", +           "dejacode_url": "https://enterprise.dejacode.com/license_library/Demo/lgpl-2.1-plus/", +           "detection_score": 100.0, +           "external_reference": { +             "url": "http://spdx.org/licenses/LGPL-2.1+", +             "source": "spdx.org", +             "key": "LGPL-2.1+" +           } +         }, +         { "concluded": true, +           "key": "lgpl-2.0-plus", +           "short_name": "LGPL 2.0 or later", +           "homepage_url": "http://www.gnu.org/licenses/old-licenses/lgpl-2.0.html", +           "end_line": 20, +           "dejacode_url": "https://enterprise.dejacode.com/license_library/Demo/lgpl-2.0-plus/", +           "text_url": "http://www.gnu.org/licenses/old-licenses/lgpl-2.0-standalone.html", +           "owner": "Free Software Foundation (FSF)", +           "start_line": 12, +           "detection_score": 47.46, +           "category": "Copyleft Limited", +           "external_reference": { +             "url": "http://spdx.org/licenses/LGPL-2.0+", +             "source": "spdx.org", +             "key": "LGPL-2.0+" +           } +         } +       ], +     }, +     { +       "path": "samples/arch/zlib.tar.gz", +       "file_type": "gzip compressed data, last modified: Wed Jul 15 11:08:19 2015, from Unix", +       "date": "2015-12-10", +       "is_binary": true, +       "md5": "20b2370751abfc08bb3556c1d8114b5a", +       "sha1": "576f0ccfe534d7f5ff5d6400078d3c6586de3abd", +       "name": "zlib.tar.gz", +       "extension": ".gz", +       "size": 28103, +       "type": "file", +       "is_archive": true, +       "mime_type": "application/x-gzip", +       "packages": [ +         { +           "type": "plain tarball" +         } +       ], +     } +   ] + } + + +AboutCode Manager +~~~~~~~~~~~~~~~~~ + +As a primary GUI for data review and integration, AboutCode Manager +will need to be fluent in ABC Data to read/write ABC Data locally and +remotely through API from several sources.  + +The short term changes would include: + +- Support reading ABC Data from ScanCode +- Writing ABC Data, adding conclusions as related objects in the proper + lists + + +New and Future tools +~~~~~~~~~~~~~~~~~~~~ + +- TraceCode: would likely specify low level attributes for files (such + as debug symbols, etc) and how files are related from devel to deploy + and back. +- VulnerableCode: would likely specify a new Vulnerability object and + the related attributes and may track several identifiers to the NIST + NVD CPE and CVE. +- DeltaCode: would likely specify attributes to describe the changes + between codebases, files, packages. + +Copyright (c) 2016 nexB Inc. + +.. |image1| image:: image00.png +.. |image2| image:: image02.png diff --git a/aboutcode-data/image00.png b/aboutcode-data/image00.png new file mode 100644 index 0000000..6982e41 Binary files /dev/null and b/aboutcode-data/image00.png differ diff --git a/aboutcode-data/image01.png b/aboutcode-data/image01.png new file mode 100644 index 0000000..5ada6ef Binary files /dev/null and b/aboutcode-data/image01.png differ diff --git a/aboutcode-data/image02.png b/aboutcode-data/image02.png new file mode 100644 index 0000000..8f7f906 Binary files /dev/null and b/aboutcode-data/image02.png differ