This is my masters project implementation. The purpose of this project is to parse a large RDF dataset, create a graph of it’s data and enable keyword searches of it.
You know, this is still WIP.
First, rvm is your friend. If you were not prompted yet, install rvm, leave and re-enter the project folder. Second, once you are in the project folder run:
$ bundle install
To work on development and run tests, you will need redis-server up and running. After it’s up, you can run the tests with:
$ spec rspec
And see things integrated in your development machine with:
$ foreman start
If you want to run more workers, you can set it up on foreman’s concurrency parameters - but keep in mind you should have only one resque-web running:
$ foreman start -c web=1,worker=3
Now you can use the API to parse NTriples files. See the example by running each command and waiting for them to finish by checking the log and the web interface:
$ rake brgs:admission $ rake brgs:spider
You can point to where your web server was deployed and to a different nt file by using environment variable. For instance:
$ NTFILE=s3.amazonaws.com/example-rdfs/stw.nt REMOTE=ec2.my.server.com rake brgs:admission $ REMOTE=ec2.my.server.com rake brgs:spider
The deploy process is based on capistrano and foreman. It will need a servers.rb file on the project root as in this example example:
set :redis_server, ‘ip-10-123-32-210.ec2.internal’ server ‘ec2-12-23-13-123.compute-1.amazonaws.com’, :redis, :foreman, :concurrency => ‘web=1,worker=2’ server ‘ec2-12-23-13-124.compute-1.amazonaws.com’, :foreman, :concurrency => ‘worker=4’
You can also generate this file using the rake task that will list your servers that have a tag named brgs_roles accordingly:
$ AWS_ACCESS_KEY_ID=your-access-key AWS_SECRET_ACCESS_KEY=your-secret-access-key rake brgs:servers_rb > servers.rb
You can set your access keys in env.rb . Check lib/network_builder.rb for further details.
As of now, the deploy process expects all servers to be Ubuntu Servers, accessible via SSH by the user ‘ubuntu’ using ssh-key. The first time you run it, and everytime there’s a change on the requirements, run:
$ cap deploy:setup
On the first time you deploy, to avoid errors when starting foreman to take care of the processes, run:
$ cap deploy:cold
Finally, everytime later you wish to deploy just run:
$ cap deploy
And the app code will be updated and all foreman roles restarted. You can also start and stop things individually: check cap -vT for more information.
These are the types of jobs that can be created:
Input: RDF File Action: Enqueue a RDF Parsing job for every 1M lines
Input: RDF file piece with 1M lines Action: Index nodes and relations, feed predicate-object collection
Input: Graph name Action: Enqueue a Graph Crawler job for each source
Input: Graph name and source node index Action: Run BFS, index paths, create Path Processer job for each path
Input: Graph name and path index Action: Index template, feed sparse matrix
Uses: ElasticSearch From node index to node value and from node value to node index
Uses: ElasticSearch From relation index to relation value and from relation value to index
Uses: Redis From node index to list of tuples with predicate relation index and object node index
Uses: Redis From path index to list of node-relation-node-…-relation-node indexes
Uses: Redis From template index to list of relation-relation-…-relation indexes
Uses: Redis From pair of node-path indexes to tuple of node_pos, path_len, template index
Go on and develop.
Danilo Moret PUC-Rio Globo.com