Skip to content
bstefanescu edited this page May 7, 2011 · 23 revisions

Now that we've seen how ECR runtime works, how bundles are wired and how components can be declared we can start talking about the core part and the most important one - the content repository.

Motivation

Why needing a content repository? Can't we just use a database to store our data inside? Yes, but ... in almost all content based applications you need not only to store raw structured data - but you also need semantics for access control on your data, versioning and other stuff not implemented by databases. Also you don't want to spend your time on re-inventing the weel and re-implementing things like versioning, access control, database abstraction and optimization. Many programmers are tempted to start from zero and create their logic and re-implenting all the stuff around a content repository but this is useless - and now that standards like CMIS are around - try to use them!

The content repository allows you to store, version, protect and search your data. The data you store is structured - even if you only want to store a binary file you must specify some common properties like the name (or title), an optional description, the content type of the file, an optional ACL for protecting your data etc. Thus, in ECR you can store anything - from binary files to structured data containing simple or complex properties, text or any other stuff you need.

Note that ECR provides a CMIS bridge to access ECR repositories using CMIS semantics.

ECR Documents

The data is stored in ECR as an unit called a document. A document always have a type - the document type, a set of properties (that can be scalar properties - like strings, dates, numbers or complex properties - like maps, lists). You can also attach binary files - that are stored in the document as special properties called blob properties. Also, documents access can be protected by adding an ACL.

Document hierarchy

Documents are stored in the repository in a hierarchical way - thus any document have a parent document. The root document is the only document that doesn't have a parent. This is a special document that is created the first time the repository is initialized. You cannot remove it.

Also each document has an Unique Identifier and a name. The name is a sort of local ID and it is used to identify a document inside it parents (like file names in a file system). So, the name is always unique inside the document parent.

Note that documents cannot have multiple parents - but ECR provides a way to create document links so you can put a reference to a document in another parent.

We will now discuss about each feature related to an ECR document.

Document type

A document type is defining how a document is structured and what are its capabilities. Document types can be extended to create new types that inherit the parent type structure and capabilities.

The document type structure is defined by using document schemas. A document type may have multiple schemas. This approach is letting you reusing the schema definitions between document types. Instead of re-defining each time the same properties that belongs to a same use case - you can group these properties in logical units - schemas - and then reuse them in your document types.

Example

I will take a simple example to illustrate how document schemas can be reused. Let say you want to store in the same repository two type of documents: photos and books.

For the photo document type you want to provide the following informations:

  1. a title
  2. a description
  3. the author
  4. the place were the photo was taken
  5. the format of the attached image.
  6. the attached image itself
  7. and some other photo related properties.

For the boot document type you want to have:

  1. a title
  2. a description
  3. the author
  4. the place were the book was written
  5. the format of the attached book file (PDF etc.)
  6. and some other book related properties

You can see that the first 4 properties are present in both the photo and the book type. So to not waste you time on redefining the type of the properties you can simply create 4 different schemas: a common schema that groups the first 4 properties, a file schema that contain the property for the attached file, a schema for photos specific properties, and another one for specific book properties.

Built-in schemas

Because many type of documents make use of the same properties (like in our example title, description, author etc.) ECR is already providing some common schemas that can reuse when you are defining your document types.

Here is a list of some of these schemas:

  • dublincore schema - see http://dublincore.org
  • file schema - for attaching a blob property
  • files schema - for attaching a list of blobs
  • note schema - for creating online content like blogs etc.

The dublincore schema is one of the most important schema since almost all document types may use it.

Note that the dublincore schema provided by ECR only contains a subset of the standard dublincore schema.

Document schemas

So, document schemas are logical units that defines document properties. A schema has a name and a namespace. The namespace serves as an unique identifier for the schema. While the name is human readable label for the schema. The namespace provides a prefix that can be used to refer to properties in that schema - using XPAtH like expressions.

For example the dublincore ECR schema name is dublincore, the namespace is http://www.nuxeo.org/ecm/schemas/dublincore/ and the prefix dc. Having a name and a prefix is maybe redundant but there are some historical reason for this. A recommended approach when defining your schema is to use the same string for the name and the prefix. A short one. To refer to the title property in dublincore schema you will write dc:title.

A document schema is defined using an XSD file. However, note that not all the XSD semantics are recognized - so only a subset of XSD is used to define schemas in XSD.

The properties defined in a schema can be scalars (primitive values like strings, numbers, dates), complex properties like maps, list properties. Both complex and list properties may contain other complex properties. We will see this in more details in Document properties section.

Document facets

Facets are used to express document capabilities. When defining a document type you can attach any number of facets for that type. Example or possible facets are:

  • Versionable - document is versionable
  • Folderish - document may have children
  • HiddenInNavigation - document should be hidden when navigating through an User Interface.

etc.

Document properties

As we've seen above properties are defined in schemas. A property is either a scalar, complex or list property.

A property have a name and a type of value it accepts.

Scalar properties

These are the most used type of properties. You can express any "primitive" type using scalar properties, like:

  • string
  • integer
  • double
  • date
  • boolean
  • arrays of other scalar properties

Complex properties

Complex properties are of two kind: map like properties, or list type properties. Both map and list properties are composite properties - they may contain other complex properties.

Blob properties

This is a special type of a complex property. You can use it to define a blob (an attached file). This property contains the following sub-properties:

  • name - the file name
  • mime-type - the content type of the file
  • encoding - the encoding of the data in the file
  • length - the length in bytes of the file
  • digest - the MD5 file digest
  • data - the binary content of the file

Access control - protecting your documents

Every document can be protected using a set of permissions. The object containing this information is attached to the document and is called ACP.

ACP - Access Control Policy

The ACP is an object that can be attached to a document to control permissions for a given identity that is trying to access the document. The identity is usually an user or a group of users.

Each document in the repository may have its own ACP. When performing a permission check, the ACP of the document is checked to test id the permission is granted to the given identity. This mechanism is repeated for each of the parents of the document. If neither of the documents in the parent chain contained a GRANT or a DEBY for that permission for the given identity then the identity will be DENIED. If some control rule (i.e. ACE) matched then the outcome of that match is returned.

Conclusion: permission check is hierarchical.

Having an hierarchical permission system is very important. You can thus refine your permissions on documents by creating container documents that adds more and more permission rules.

Also, ECR is able to block the permission inheritance. That means that if you don't want to inherit parent permissions in a sub-tree of the document hierarchy - you can block it. You simply add a special permission rule on the ACP of the document that says do not look in my parents for permission checks.

ACL - Access Control List

An ACL is defined by a name and a list of ACE objects.

When scanning an ACL for controlling the access each ACE is examined in turn until one rule match the subject/permission pair which is checked. If any rule in the ACL matches the given subject/permission pair then an UNKNOWN state is returned which will trigger the check on the rest of the ACL on the document.

We saw that an ACP may contain several ACLs. Why this? Why not directly attaching the ACL to a document? Why we need an ACP object? (e.g. a list of ACLs). At the first sight having multiple ACLs its not needed. But in fact there is an important use case:

  • Administrators may manage permissions - they are adding new rules (ACE entries in an ACL) to prevent people not make actions on a document. They are doing this by modifying the default ACL of the document ACP. This ACL is name is local.
  • But some repository services may need to run in unrestricted mode on documents, even if the current user is not allowed to fully access the document. If a service needs special permissions to correctly operate then it will add an ACL specific to that service on the document (usually the name of this ACL is the name of the service). So that, when the administrator is modifying the ACL it should not overwrite internal ACLs set by repository services.

ACE - Access Control Entry

An ACE is composed of three values:

  1. a key - a string value used to store the subject on which the permission apply. Usually this is an user name or group name.
  2. a permission - a string value that identify the permission
  3. GRANT or DENY - a boolean value used to allow or disallow the permission to the ACE subject.

ACP Inheritance

Document versioning

Document search - NXQL Queries

Document links - Publishing