Skip to content

moret/brgs

Repository files navigation

This is my masters project implementation. The purpose of this project is to parse a large RDF dataset, create a graph of it’s data and enable keyword searches of it.

You know, this is still WIP.

First, rvm is your friend. If you were not prompted yet, install rvm, leave and re-enter the project folder. Second, once you are in the project folder run:

$ bundle install

To work on development and run tests, you will need redis-server up and running. After it’s up, you can run the tests with:

$ spec rspec

And see things integrated in your development machine with:

$ foreman start

If you want to run more workers, you can set it up on foreman’s concurrency parameters - but keep in mind you should have only one resque-web running:

$ foreman start -c web=1,worker=3

Now you can use the API to parse NTriples files. See the example by running each command and waiting for them to finish by checking the log and the web interface:

$ rake brgs:admission $ rake brgs:spider

You can point to where your web server was deployed and to a different nt file by using environment variable. For instance:

$ NTFILE=s3.amazonaws.com/example-rdfs/stw.nt REMOTE=ec2.my.server.com rake brgs:admission $ REMOTE=ec2.my.server.com rake brgs:spider

The deploy process is based on capistrano and foreman. It will need a servers.rb file on the project root as in this example example:

set :redis_server, ‘ip-10-123-32-210.ec2.internal’ server ‘ec2-12-23-13-123.compute-1.amazonaws.com’, :redis, :foreman, :concurrency => ‘web=1,worker=2’ server ‘ec2-12-23-13-124.compute-1.amazonaws.com’, :foreman, :concurrency => ‘worker=4’

You can also generate this file using the rake task that will list your servers that have a tag named brgs_roles accordingly:

$ AWS_ACCESS_KEY_ID=your-access-key AWS_SECRET_ACCESS_KEY=your-secret-access-key rake brgs:servers_rb > servers.rb

You can set your access keys in env.rb . Check lib/network_builder.rb for further details.

As of now, the deploy process expects all servers to be Ubuntu Servers, accessible via SSH by the user ‘ubuntu’ using ssh-key. The first time you run it, and everytime there’s a change on the requirements, run:

$ cap deploy:setup

On the first time you deploy, to avoid errors when starting foreman to take care of the processes, run:

$ cap deploy:cold

Finally, everytime later you wish to deploy just run:

$ cap deploy

And the app code will be updated and all foreman roles restarted. You can also start and stop things individually: check cap -vT for more information.

These are the types of jobs that can be created:

Input: RDF File Action: Enqueue a RDF Parsing job for every 1M lines

Input: RDF file piece with 1M lines Action: Index nodes and relations, feed predicate-object collection

Input: Graph name Action: Enqueue a Graph Crawler job for each source

Input: Graph name and source node index Action: Run BFS, index paths, create Path Processer job for each path

Input: Graph name and path index Action: Index template, feed sparse matrix

Uses: ElasticSearch From node index to node value and from node value to node index

Uses: ElasticSearch From relation index to relation value and from relation value to index

Uses: Redis From node index to list of tuples with predicate relation index and object node index

Uses: Redis From path index to list of node-relation-node-…-relation-node indexes

Uses: Redis From template index to list of relation-relation-…-relation indexes

Uses: Redis From pair of node-path indexes to tuple of node_pos, path_len, template index

Go on and develop.

Danilo Moret PUC-Rio Globo.com

About

Large RDF Graph storage, indexing, keyword searching and sub-graph extraction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages