Skip to content

Development Guide

Petar Petrov edited this page Jun 9, 2013 · 17 revisions

Developers Guide

for version 0.4.0


This is a short development guide that offers some insights into the structure and architecture of C3PO. If you want to contribute or develop your own features, you are at the right place and you are awesome. If you have any questions after reading this guide, please don't hesitate to contact me at my email address or via twitter @peshkira.

Want to contribute

Well, thank You! You are awesome!

  1. Fork this repository and clone your own fork (also star it :P).
  2. Finish reading this guide :).
  3. The master branch contains the last stable release. Usually there is an integration branch that has the bleeding edge of the code. Decide from which state you want to do your new feature and create branch for your pull request (PR) from that stage.
  4. #DEV HACK
  5. Once your feature is ready, please create a PR from your branch to the integration branch of this fork. (If for some reason there is no integration branch does not exists, then you can create a PR to master)

I will then review your code and merge it, or give you feedback. If you have any questions please contact me. Also, if you find any issues, please report them here in this repository (issues are enabled).

License

C3PO is released under the Apache 2.0 License from version 0.4.0.

System Architecture

Ok, let's dive in. In order to get an overview of the system architecture take a look at the following stack diagram: C3PO Architecture

Logically C3PO is divided into three (vertical) modules.

Data Model

The data model module, the core module and the applications module. On the bottom there are is a simple data model that encapsulates digital objects (we call them __Element__s), their meta data properties and values and some simple provenance information.

Core

The core module lies on top of that. It offers many different interfaces for processing and analysing the meta data.

Gatherer On the left is the Gatherer. It is a interface that knows how to connect to a source and read in some raw meta data. Currently there is a single implementation of this interface and it is the LocalFileGatherer, which traverses the file system and reads in meta data files and archive files (containing meta data).

Adaptors The next component in the core module are the adaptors. The adaptors are responsible for translating the raw meta data (that was gathered before) into the internal data model of C3PO. Every adaptor has to extend the AbstractAdaptor class and to implement its abstract methods. In version 0.4.0 C3PO has two adaptors. One for FITS files and one for raw Apache TIKA files. Read on to see how you can implement an adaptor for any other type of meta data that you want to support.

Analysis The analysis component offers a set of interfaces that can do some interesting things with the data. For example, it can generate an aggregated profile of the meta data (in a specified XML format), or it can select sample objects from your meta data collections that are representative to the whole collection. There are some other features implemented here, such as the CSV export and some of the aggregations.

Persistence This component offers an interface for reading and writing data (from the data model) to and from a back-end data store. C3PO currently has a default persistence implementation for a MongoDB data store. If you want to extend it, you will have to implement the PersistenceLayer interface. Read on to see how to do that.

Controller On top of these components, there is a Controller that exposes them to other parts of the system and acts as a facade to this functionality.

Applications

The last (vertical) part of C3PO is the applications layer. C3PO has two (three) applications - a command line interface (for near data processing), a web application (for data visualisation) and a REST interface (for integration with other services, e.g.: Plato and Scout

Project Structure

C3PO is built and organised with Maven. Well, most of it. Here is the current structure.

c3po 
|-- c3po-api     - has all interfaces and some abstract implementations
|-- c3po-core    - has all the business logic and all implementations
|-- c3po-cmd     - has a few commands and a CLI parser
|-- c3po-webapi  - the web application and REST api 

The first three subfolders are maven modules (api, core and cmd). The last one is not, because it is a play project and I didn't find any good maven play plugins at the time of creation. If you want to mavenize it, feel free to contribute.

Java Docs

If you need the java docs online, look no more: Java Docs

How to build

Well, import the project in your favourite IDE (I use Eclipse). Then you can use a Maven plugin or Maven from the command line.

In order to build the app use mvn clean install. If you are on the master branch, then this should work without any additional effort. If you are on any other branch, there might be some failing tests. If that is the case (consider fixing them ;) ), you can use mvn clean install -Dmaven.test.skip

In order to package the command line app into an executable jar, navigate to the cmd project and run mvn assembly:assembly. This will create a jar with dependencies in the target folder. Just run it with java -jar c3po-0.4.0-jar-with-dependencies.jar and you will get some usage info.

In order to run the play app in development mode, you will need the play framework (currently we use 2.0.4). Once you install it, just navigate to the webapi folder and execute play run. This will build the app and run it in development mode. All changes you make to the webapi will be immediately visible upon refresh.

Note that the web app is not yet in version 0.4.0 and you have to use the old version 0.3.0 as libraries.

Extension Points

Ok there are a few extension points, where you might want to contribute. You can add new Adaptor (you get 1 hug), or you can add a new Gatherer (more hugs), or you can add a new persistence layer - e.g. HBase (you get 3 hugs and a couple of beer when you visit Vienna, Austria). Of course you can tinker with the implementation and add any new feature you desire. In the latter case consider looking at the ROADMAP for inspiration.

How do I create a new Adaptor?

Well for starters you have to understand the data model of C3PO. Every digital object in C3PO is represented by the com.petpet.c3po.api.model.Element class. Every element has a collection to which it belongs a name (does not have to be unique, but it is not a bad idea), a unique identifier and a List of com.petpet.c3po.api.model.MetadataRecord. Every MetadataRecord represents a single value for a given property. MetadataRecords have a Property, a value (in a String form), a status and a list of sources.

If you want to create an adaptor, you will have to think about a way to map the raw meta data that you have to this model. Once you have done this, your job is simple. You just have to extend the AbstractAdaptor class and implement its abstract methods.

public class MyAdaptor extends AbstractAdaptor {

  public void configure() {
    // called once when the adaptor is initialised.
    // use it to read the configurations and setup your adaptor
  }

  public String getAdaptorPrefix() {
    // a prefix to denote this adaptor.
    return "myadaptor";
  }

  public Element parseElement( String name, String data ) {
    // parse the data here and create a new Element and add it to it...
    Element e = new Element("some-uid", "some name");
    return e;
  }
}

The name in the parseElement method is the file/object name, which often hast some important information (the name itself, maybe a uid, maybe a collection name, or a timestamp, etc. The data string is the raw content of your meta data file that you have to parse. At the end you return the Element and that is it!

In order to make sure C3PO uses your adaptor, you will have to do one last (one-line) change to the com.petpet.c3po.Controller, as it currently does not load Adaptors dynamically from the class path. Just add the following line within the Controller constructor at the appropriate place to register your adaptor.

...
this.knownAdaptors.put( "MyAdaptor", MyAdaptor.class );
...

Hints:

  1. What are the other methods for? The getAdaptorPrefix() should return a simple string denoting this adaptor. Why is it important? Because you can use it to pass configuration to your adaptor. The second abstract method configure will be called exactly once before the adaptor is initialised and submitted to run. In there you can use the protected methods of the AbstractAdaptor to obtain the values for your adaptor config and do some stuff with them. The adaptor will have access to all configs starting with c3po.adaptor and c3po.adaptor.[prefix] where the prefix is the thing you return in the getAdaptorPrefix() method.

  2. The power of C3PO (and FITS) is that it offers a unified view over the data. Meaning that properties that denote the same concepts are always merged into one property before persisting. That is why it is important you use the properties C3PO already knows. How the hell, do I know what properties does C3PO already support? Take a look at the fits_property_mapping.properties file and the tika_property_mapping.properties file. It is a good start. If you have new properties, that are not listed there - no problem, just use them in your MetadataRecords.

  3. How do I create a new Property? Use the protected getCache method from the adaptor class. It offers you methods for obtaining Source and Property objects. If these are not found in the cache, they will be retrieved from the DB and if not in the DB - a new property will be generated for you.

  4. PreProcessingRules, what are they and should I use them in my adaptor? Well, yes it is advisable you use them, but it is not critical. The Abstract Adaptor will give you a set of preprocessing rules (sorted descending by priority). Every ProcessingRule has a simple method that gives you a hint if you should skip the given value or not.

  5. As adaptors are not dynamically pluggable, you will have to add the Adaptor in the Controller - just take a look there and see how the other adaptors are bound.

How do I add a new Gatherer?

Just extend the com.petpet.c3po.com.petpet.c3po.api.gatherer.MetaDataGatherer interface and you are ready to go. Note, that you will have to change the Controller a bit if you want to use your new gatherer, as it is not loaded dynamically. Consider adding a CLI option for selecting the correct gatherer.

public interface MetaDataGatherer extends Runnable {

  void setConfig( Map<String, String> config );

  MetadataStream getNext();

  boolean hasNext();

  boolean isReady();
}

Notice the com.petpet.c3po.api.model.helper.MetadataStream interface that is returned by the getNext() method. If you have some special gatherer, you might want to subclass it and do the reading once its getData() method is called. However, if you can access the data locally, you can use the default implementation FileMetadataStream object.

How do I implement a new PersistenceLayer?

This is a bigger task. Basically, you have to implement one interface, however you will have to think of a way to map the Model classes to the data store and implement the serializers and deserializers for all of the Model classes.

Here is the interface, you have to implement:

import java.util.Iterator;
import java.util.List;
import java.util.Map;

import com.petpet.c3po.api.model.Model;
import com.petpet.c3po.api.model.Property;
import com.petpet.c3po.api.model.helper.Filter;
import com.petpet.c3po.api.model.helper.NumericStatistics;
import com.petpet.c3po.utils.exceptions.C3POPersistenceException;

public class MyPersistenceLayer implements PersistenceLayer {

  public void clearCache() {
    // clears the cache and all internally managed cached results
  }

  public void close() throws C3POPersistenceException {
    // called when the connection to the datastore should be closed.
  }

  public <T extends Model> long count( Class<T> clazz, Filter filter ) {
    // counts all occurrences of the given T (subclass of Model) that are matched by the given filter
    return 0;
  }

  public void establishConnection( Map<String, String> config ) throws C3POPersistenceException {
    // called when a connection to the data store has to be established.
  }

  public <T extends Model> Iterator<T> find( Class<T> clazz, Filter filter ) {
    // finds all occurrences of the given T (sublcass of Model) that match the filter 
    // and retrieves an iterator over them.
    return null;
  }

  public <T extends Model> List<String> distinct( Class<T> clazz, String f, Filter filter ) {
    // finds a list of distinct strings for the given field of the given Model class matching
    // the given filter.
    return null;
  }

  public Cache getCache() {
    // obtains the cache
    return null;
  }

  public NumericStatistics getNumericStatistics( Property p, Filter filter ) throws UnsupportedOperationException, IllegalArgumentException {
    // gets the numerical statistics for the given property of the Element class
    // according to the given filter
    return null;
  }

  public <T extends Model> Map<String, Long> getValueHistogramFor( Property p, Filter filter )
      throws UnsupportedOperationException {
    // gets the value histogram for the given property of the Element class
    // according to the given filter
    return null;
  }

  public <T extends Model> void insert( T object ) {
    // inserts T into the data store
  }

  public boolean isConnected() {
    //checks if the connection to the data store is open
    return false;
  }

  public <T extends Model> void remove( Class<T> clazz, Filter filter ) {
    // removes all Model objects of the given class matching the filter
  }

  public <T extends Model> void remove( T object ) {
    // removes the given object
  }

  public void setCache( Cache c ) {
    // sets the current cache
  }

  @Override
  public <T extends Model> void update( T object, Filter f ) {
    // updates all objects matching the filter with the given object
  }
}

You might have noticed that some of these methods accept a com.petpet.c3po.api.model.helper.Filter object. The filter is a simple wrapper around a query. You will have to write a FilterSerializer that knows how to translate the filter into a data store query. The convention of what a Filter means is as follows:

The persistence layer provider should interpret the conditions by logically concatenating all conditions with an AND. If two or more filter conditions apply to the same property, then they should be applied with a logical OR. Consider the following example where we want to filter all objects from a collection 'A' that have either text/html or text/xml mimetype. In such a case the filter that will be passed to the PersistenceLayer will have three com.petpet.c3po.api.model.helper.FilterCondition objects:

first - property is 'collection' and value is 'A'
second - property is 'mimetype' and value is 'text/html'
third - property is 'mimetype' and value is 'text/xml'

Consider now that the same filter has to be applied but it has go over both collection 'A' and collection 'B'. The caller has to include an additional FilterCondition where the property is 'collection' and the value is 'B' and the interpreter of this filter should use a logical OR for the collection property as well.

Additionally, if a condition of the filter has a null value for a given field, then this has to be interpreted as: where any given value exists for this property. For example, if a condition has a property 'mimetype' and a value 'null', then the filter should interpret where the searched objects have a mimetype field that exists.

Note that a filter might also contain a com.petpet.c3po.api.model.helper.BetweenFilterCondition object for numeric properties and this should be interpreted accordingly.

Tips

  • For the tests, mongo will need to run. If it is not running, the tests will be skipped (no failures will be reported)
  • If you hack on the web app and for some reason you need to change something in the other projects api or core, you will have to regenerate the jars again and copy them to the lib folder of the webapi project. (This has to be done, because there is no maven repository for the c3po libraries.) If you do this, you will have to restart the app unfortunately. This won't be valid from version 0.4.0 of the Web APP.
Clone this wiki locally
You can’t perform that action at this time.