No description, website, or topics provided.
Python Java CSS Other
Failed to load latest commit information.
UI initial frontend commit Nov 18, 2016
api Added python pandas basic support. sketching convenient api Feb 9, 2017
benchmarking tweaks post-deadline Nov 14, 2016
dataanalysis wip; all changes in Feb 14, 2017
ddprofiler Merge branch 'master' into exp-onto Feb 25, 2017
frontend api returns provenance as well as drs data Oct 19, 2016
glove testing ontomatch ideas. sketching metrics, algos Jan 26, 2017
inputoutput keeping lsh indexes after building stage and serialized for future reuse Feb 3, 2017
knowledgerepr adding new level and testing Feb 11, 2017
modelstore new matcher Feb 20, 2017
nearpy Code format all python files using a script Aug 25, 2016
ontomatch first e2e ontomatch prototype Feb 25, 2017
.gitignore updated gitignore Feb 7, 2017
LICENSE Add MIT License Sep 23, 2016 tweaks Sep 11, 2016 neighbor search includes results from metadata Dec 8, 2016 completed impl. of 3 levels of matchings Feb 10, 2017 new condition on ind dep for numerical values Nov 5, 2016 apply python formatting script Sep 7, 2016 checkpoint before merging Feb 9, 2017
requirements.txt Added python pandas basic support. sketching convenient api Feb 9, 2017 fix paths when drs's have identical hits Nov 17, 2016 move make_drs to algebra. Allow arrays of input Nov 17, 2016 working on paths_between fixes Sep 30, 2016 overhaul to path between queries. stabilized all paths between. still… Oct 8, 2016

Aurum: Data Discovery at Scale

Aurum consists of three layers that can be run independently. L1 is the data discovery profiler. Its purpose is to read raw data from CSV files or databases and create profiles---concise representations of the information---that are stored in a configured store (elasticsearch for the time being). L2 reads the profiles from the store and creates a model that represents the relationships between them. This model is then used by L3, the discovery API, to answer queries posed by users.

Next there is a brief WiP description of how to build, configure and deploy the three layers.

Deploying L1

The profiler is built in Java (you can find it under /ddprofiler) and is meant to be deployed standalone. The input to L1 are data sources to analyze, the output is stored in a store. Elasticsearch is the store supported at the moment. Next, you can find instructions to build and deploy L1 as well as to install and configure Elasticsearch.

Building L1

Just go to 'ddprofiler' (visible from the project root) and do:

$> ./gradlew clean fatJar

Note that the gradle wrapper (gradlew) does not require you to install any software; it will handle the entire build process without help.

After that command, L1 is built into a single jar file that you can find in ddprofiler/build/libs/ddprofiler.jar

Deploying Elasticsearch (tested with 2.3.3)

Download the software from:

Uncompress it and then simply run from the root directory:

$> ./bin/elasticsearch

that will start the server in localhost:9200 by default, which is the address you should use to configure L1

Configuration of L1

L1 can run in online mode, in which it receives commands and data sources to analyze through a REST API, or offline mode, in which you can indicate the folder with data sources to analyze.

For offline mode, this is a typical configuration:

$> java -jar --execution.mode 1 --sources.folder.path is used internally to identify the folder of data

--execution.mode is used to indicate whether L1 will work online (0), offline, reading from files (1) or offline, reading from a DB (2).

--sources.folder.path when execution mode is 1, this option indicates the folder with the files to process (CSV files only for now).

You can consult all configuration parameters by appending --help or <?> as a parameter. In particular you may be interested in changing the default elasticsearch ports (consult --store.http.port and --store.port) in case your installation does not use the default ones.

Running L2


Requires Python 3 (tested with 3.4.2, 3.5.0 and 3.5.1). Use requirements.txt to install all the dependencies:

$> pip install -r requirements.txt

In a vanilla linux (debian-based) system, the following packages will need to be installed system-wide:

  • sudo apt-get install pkg-config libpng-dev libfreetype6-dev (requirement of matplotlib)
  • sudo apt-get install libblas-dev liblapack-dev (speeding up linear algebra operations)
  • sudo apt-get install lib32ncurses5-dev

Some notes for MAC users:

There have been some problems with uWSGI. One quick workaround is to remove the version contraint explained in the requirements.txt file.

There have been problems when using any other elasticsearch version than 2.3.3.


The core implementation of L2 is the file This file accepts one parameters --opath that expects a path to a folder where you want to store the built model. For example:

$> python --opath test/testmodel/

Once the model is built, it will be serialized into the provided path.

Running L3

The file is the core implementation of Aurum's API. One easy way to access it is to deserialize a desired model and constructing an API object with that model. The easiest way to do so is by importing init_system() function from main. Something like:

$> from main import init_system $> api, reporting = init_system(, reporting=False)

The last parameter of init_system, reporting, controls whether you want to create a reporting API that gives you access to statistics about the model. Feel free to say yes, but beware that it may take long times when the models are big.