Skip to content
sfermigier edited this page May 11, 2011 · 23 revisions

Now that we've seen how ECR runtime works, how bundles are wired and how components can be declared, we can start talking about the core part and the most important one - the content repository.

Motivation

Why needing a content repository? Can't we just use a database to store our data inside?

Yes, but ... in almost all content based applications you need not only to store raw structured data - but you also need semantics for access control on your data, versioning and other stuff not implemented by databases. Also you don't want to spend your time on re-inventing the weel and re-implementing things like versioning, access control, database abstraction and optimization.

Many programmers are tempted to start from zero and create their logic and re-implenting all the stuff around a content repository but this is useless - and now that standards like CMIS are around - try to use them!

The content repository allows you to store, version, protect and search your data.

The data you store is structured - even if you only want to store a binary file you must specify some common properties like the name (or title), an optional description, the content type of the file, an optional ACL for protecting your data, etc.

Thus, in ECR you can store anything - from binary files to structured data containing simple or complex properties, text or any other stuff you need.

Note: ECR provides a CMIS bridge to access ECR repositories using CMIS semantics.

ECR Documents

The data is stored in ECR as an unit called a document. A document always have a type - the document type, a set of properties (that can be scalar properties - like strings, dates, numbers or complex properties - like maps, lists).

You can also attach binary files - that are stored in the document as special properties called blob properties.

Also, documents access can be protected by adding an permission rules on a document. See ACP section below.

Document hierarchy

Documents are stored in the repository in a hierarchical way - thus any document have a parent document.

The root document is the only document that doesn't have a parent. This is a special document that is created the first time the repository is initialized. You cannot remove it.

Also each document has an Unique Identifier and a name. The name is a sort of local ID and it is used to identify a document inside it parents (like file names in a file system). So, the name is always unique inside the document parent.

Note: documents cannot have multiple parents - but ECR provides a way to create document links so you can put a reference to a document in another parent.

We will now discuss about each feature related to an ECR document.

Document type

A document type is defining how a document is structured and what are its capabilities. Document types can be extended to create new types that inherit the parent type structure and capabilities.

Important: Document types can be extended so you can create a new document type by extending a new one and thus inherit all the parent document type definitions like its structure (i.e. schemas) and facets. If your document type is not explicitly inheriting any other document type - it will automatically inherit the document type Document which is the root of all the document types - and doesn't define any structure or facet.

See in Document Query section on how this root type can be used.

The document type structure is defined by using document schemas. A document type may have multiple schemas. This approach is letting you reusing the schema definitions between document types. Instead of re-defining each time the same properties that belongs to a same use case - you can group these properties in logical units - schemas - and then reuse them in your document types.

Example

I will take a simple example to illustrate how document schemas can be reused. Let say you want to store in the same repository two type of documents: photos and books.

For the photo document type you want to provide the following informations:

  1. a title
  2. a description
  3. the author
  4. the place were the photo was taken
  5. the format of the attached image.
  6. the attached image itself
  7. and some other photo related properties.

For the boot document type you want to have:

  1. a title
  2. a description
  3. the author
  4. the place were the book was written
  5. the format of the attached book file (PDF etc.)
  6. and some other book related properties

You can see that the first 4 properties are present in both the photo and the book type.

So to not waste your time on redefining the type of the properties, you can simply create 4 different schemas: a common schema that groups the first 4 properties, a file schema that contains the property for the attached file, a schema for photos specific properties, and another one for specific book properties.

Built-in schemas

Because many type of documents make use of the same properties (like in our example title, description, author etc.) ECR is already providing some common schemas that can reuse when you are defining your document types.

Here is a list of some of these schemas:

  • dublincore schema - see http://dublincore.org
  • file schema - for attaching a blob property
  • files schema - for attaching a list of blobs
  • note schema - for creating online content like blogs etc.

The dublincore schema is one of the most important schema since almost all document types may use it.

Note that the dublincore schema provided by ECR only contains a subset of the standard dublincore schema.

Document schemas

So, document schemas are logical units that defines document properties.

A schema has a name and a namespace. The namespace serves as an unique identifier for the schema. While the name is human readable label for the schema. The namespace provides a prefix that can be used to refer to properties in that schema - using XPath like expressions.

For example the dublincore ECR schema name is dublincore, the namespace is http://www.nuxeo.org/ecm/schemas/dublincore/ and the prefix dc. Having a name and a prefix is maybe redundant but there are some historical reason for this. A recommended approach when defining your schema is to use the same string for the name and the prefix. A short one.

To refer to the title property in dublincore schema you will write dc:title.

A document schema is defined using an XSD file. However, note that not all the XSD semantics are recognized - so only a subset of XSD is used to define schemas in XSD.

The properties defined in a schema can be scalars (primitive values like strings, numbers, dates), complex properties like maps, list properties. Both complex and list properties may contain other complex properties. We will see this in more details in Document properties section.

Document facets

Facets are used to express document capabilities. When defining a document type you can attach any number of facets for that type. Example or possible facets are:

  • Versionable - document is versionable
  • Folderish - document may have children
  • HiddenInNavigation - document should be hidden when navigating through an User Interface.

etc.

Built-in document types

There are several built-in document types in ECR, the most important are: "Folder", "File", "Note", "Workspace" and of course the base document type "Document".

You can find the definitions of these document types in org.eclipse.ecr.core bundle.

Document properties

As we've seen above properties are defined in schemas. A property is either a scalar, complex or list property.

A property have a name and a type of value it accepts.

Scalar properties

These are the most used type of properties. You can express any "primitive" type using scalar properties, like:

  • string
  • integer
  • double
  • date
  • boolean
  • arrays of other scalar properties

Complex properties

Complex properties are of two kind: map like properties, or list type properties. Both map and list properties are composite properties - they may contain other complex properties.

Blob properties

This is a special type of a complex property. You can use it to define a blob (an attached file).

This property contains the following sub-properties:

  • name - the file name
  • mime-type - the content type of the file
  • encoding - the encoding of the data in the file
  • length - the length in bytes of the file
  • digest - the MD5 file digest
  • data - the binary content of the file

Access control - protecting your documents

Every document can be protected using a set of permissions. The object containing this information is attached to the document and is called ACP.

ACP - Access Control Policy

The ACP is an object that can be attached to a document to control permissions for a given identity that is trying to access the document. The identity is usually an user or a group of users - and the type of access is expressed by a permission.

Each document in the repository may have its own ACP. When performing a permission check, the ACP of the document is checked to test if the permission is granted to the given identity. This mechanism is repeated for each of the parents of the document.

If neither of the documents in the parent chain contained a GRANT or a DENY for that permission for the given identity then the access will be DENIED.

Conclusion: permission check is hierarchical.

Having an hierarchical permission system is very important. You can thus refine your permissions on documents by creating container documents that adds more and more permission rules.

Also, ECR is able to block the permission inheritance. That means that if you don't want to inherit parent permissions in a document - you can block it by adding a special permission rule on the ACP of the document.

ACL - Access Control List

The ACP object store permissions rules in one or multiple ACL objects.

An ACL has a name and contains a list of permission rules (i.e. ACE objects).

When scanning an ACL for a permission check, each ACE is examined in turn until one rule matches the subject/permission pair which is checked.

If no rule in the ACL matches the given subject/permission pair, then an UNKNOWN state is returned which will trigger a permission check on the rest of the ACL defined by the document ACP.

When an administrator modify permission rules on a document it is always modifying the ACL named local. The administrator should not modify the other ACLs (if any) on the document. The usage of additional ACLs is reserved for the system to generate special ACLs that may be needed to some services to correctly operate. This is why a document may have multiple ACLs and not just one.

ACE - Access Control Entry

The ACE object describe a permission rule. It is composed of three values:

  1. a key - a string value used to store the subject on which the permission apply. Usually this is an user name or group name.
  2. a permission - a string value that identify the permission
  3. GRANT or DENY - a boolean value used to allow or disallow the permission to the ACE subject.

Permissions and subjects

In ECR the subjects of a permission check are users or user groups. An user may be part of one or multiple groups and groups may be part of other groups.

A permission can be also be contained by another composite permission. This is useful to refine permission checks. You can for example define a "Write" permission which contains the "SetProperty" permission and the "SetACL" permission. In that case a "Write" permission will automatically imply a "SetACL" permission.

Both users, groups and permissions are represented using a string ID in ACL objects.

Checking permissions

So, to recapitulate, when checking a permission P for an identity I on a document D the following steps are taken:

  1. Resolve all permissions that imply the permission P. Result is stored in a permission set PSET.
  2. Resolve all identities that imply the identity I. Result is stored in a identity set ISET.
  3. Check int turn all ACL present in the document D as following:
    1. If an ACE match one of the identity from PSET and one of the permission from ISET then return the stored privilege for that rule (GRANT or DENY).
    2. If a blocking inheritance ACE is found return UNKNOWN.
  4. If UNKNOWN was returned get the document parent if any and repeat the procedure on the parent (from 3 to 4).
  5. If UNKNOWN was returned and no more parents exists return DENY.

Note: when checking the ACLs present on a document (i.e. on an ACP object) first the system ACL are checked and last the local ACL is checked.

Document retrieval and search

You can refer to any document from the repository either by using its UID (the document unique identifier which is unique in the repository and is generated at document creation time), either using the document path (which is a UNIX like path composed form all the document names in the hierarchy - example: /workspaces/developers/tasks/task_74).

To search documents you stored in ECR you have two methods: either use the native ECR query language named NXQL, either use the CMIS query language.

We will discuss here only the native ECR query language (i.e. NXQL). For the CMIS query language, refer to the CMIS specifications.

NXQL

The NXQL query is inspired from SQL but was adapted to query a repository tree made of typed documents so it is introducing some specific expressions and limitations regarding on what you can do in SQL.

Also, note that only selecting documents is implemented in NXQL. You cannot create, update or make other actions that 'select' on documents.

The biggest difference from SQL in how you construct the query is the select part. As NXQL is only exposing documents (and not document properties) the select part was modified to fit the document model.

So after the SELECT you must specify '*' - you cannot specify properties.

Also you cannot specify a table name for your select since tables doesn't exists from the point of view of the document repository. The data can be stored anywhere and anyhow - in a file system, in a RDBMS or in an object database.

Instead of specifying a table name you will be specifying a document type.

Example

  • SELECT * FROM Document - will return all the documents in the repository.
  • SELECT * FROM File - will return all the documents of type "File" (or on a type extending File) in the repository.

The WHERE clause is similar to SQL. To select documents by putting constraint on properties you should use the full property name (XPath like) and the constraint on the value:

  • SELECT * FROM Document WHERE dc:creator='John' - will return all the documents created by John.
  • SELECT * FROM File WHERE dc:title LIKE '%june%' - will return all the documents of type File that have a title containing the 'june' string.
  • SELECT * FROM Person WHERE person:age > 18 - will return all the documents of type Person that have the property 'age' greater than 18.

For more information on all you can do with NXQL queries see http://doc.nuxeo.com/display/NXDOC/Querying+and+Searching.

Document life cycle

TODO

Document versioning

TODO

Document links - Publishing

TODO

Repository sessions and transactions

Any time you use the repository you should open a new session. The session is the main entry point to the repository API.

Repository sessions are exposed to the application through JCA. This means, connection pooling and transaction are available on the repository sessions.

Anyway, you should not have any knowledge about JCA in order to open and work with a repository session.

The transaction management is the responsibilities of the caller (the one which is executing operations in the repository). When accessing the repository you are usually either in a web request, a listener notification context, or in a background job. Each such task (web request, listener execution or job execution) may open a transaction and commit or rollback the transaction when the tasks ends.

If you are using default ECR entry points like web requests or listener execution ECR will manage the transaction for you. If you want to create your own execution context you should manage yourself the transaction.

For web requests the simple way to manage transactions is to install the TransactionFilter provided by ECR on your servlet. THis filter will automatically start and end transactions when the before the request is dispatched to the servlet.

For listener execution the event service provided by ECR will manage the transaction - so you not need to worry about.

Repository listeners

ECR is providing a notification mechanism every time a document is create, modified or deleted.

You can register a listener to be notified by any of these document changes. Listeners can be notified either synchronously (in the same thread as the document change operation), either asynchronously.

Synchronous listeners are executed in the same transaction as the document change operation while asynchronous listeners are executed in a separate transaction.

Here is list of the main available events:

  • aboutToCreate - a new document is about to be created
  • documentCreated - a document was created
  • aboutToRemove - a document is about to be removed
  • documentRemoved - a document was removed
  • aboutToRemoveVersion - a document version is about to be removed
  • versionRemoved - a document version was removed.
  • beforeDocumentModification - a document is about to be modified
  • documentModified - a document was modified
  • beforeDocumentSecurityModification - a document ACP is about to modified
  • documentSecurityUpdated - document ACP was modified
  • documentLocked - a document was locked
  • documentUnlocked - a document was unlocked
  • aboutToCopy - about to copy a document
  • documentCreatedByCopy - a new document was created by a copy operation
  • aboutToMove - about to move a document
  • documentMoved - document was moved (i.e. its parent changed)
  • documentPublished - a document was published
  • etc.

As a general rule - events that starts with about or before prefixes are called before document changes were saved. So, this type of event should not be handled by asynchronous listeners. You usually use this type of events from a synchronous listener to add other changes on the document (for example generating a value for an automatic document property).

Listeners that are doing heavy processing should usually be declared as asynchronous listeners. For example this can be the case for a listener that automatically converts documents in formats like PDF etc. every time a document is created or modified.

Audit service

ECR is not providing yet an implementation for document audit. Anyway, it provides an API to implement custom audit services (an audit service implementation will be available soon).

Anyway, it is easy to create your own audit service. To bootstrap your service you can use repository listeners to be notified each time something happened in the repository.

Next: Go to the Authentication section to learn how to authenticate users on your application.