Process not start with huge database #61

SergeC · 2014-07-09T07:03:56Z

I've downloaded FreeBase data dump (27Gb freebase-rdf-2014-07-06-00-00.gz) and uncompressed it (gzip -cd freebase-rdf-2014-07-06-00-00.gz > freebase.nt 330Gb freebase.nt). When I starting process it takes a lot of time and then process got killed. Log:
`root@ns501558:/home# time ./cayley_0.3.0-pre_linux_amd64/cayley http --dbpath=freebase.nt
Killed

real 16m34.086s
user 14m44.880s
sys 0m13.356s`

Is any solution for it?

May be its better to use compressed databases like I suggested before #57 ?

The text was updated successfully, but these errors were encountered:

miku · 2014-07-09T08:12:37Z

The standard backend is mem, so unless you are on a machine with a terabyte of RAM (just estimating), the process will get killed at one point, when it tries to allocate memory. You could try some other backend, e.g. leveldb via -db=leveldb. I just loaded 106M triples (which is about 1/17th of freebase) into a leveldb backend in about 13h.

SergeC · 2014-07-09T08:22:49Z

Can you please provide more detailed instructions how to run it? Step-by-step instructions would be great.
Do I need to install leveldb?
How to load FreeBase dump to leveldb?

miku · 2014-07-09T08:26:51Z

What OS are you targeting?

SergeC · 2014-07-09T08:30:47Z

On my dev machine OSX 10.9 with 16Gb of RAM but I have very limited HDD space on it. I'd like to use OSX.
On my server Ubuntu 14.04 64bit with 2Gb of RAM which hosts uncompressed database.

I can connect to server via NFS and use its HDD space.

miku · 2014-07-09T08:54:58Z

Setting up leveldb was quite straightforward[1]. Replace the /tmp/testdb with the path, where the DB should be created.

$ cayley init -db="leveldb" -dbpath="/tmp/testdb"
$ ls /tmp/testdb
000001.log  CURRENT  LOCK  LOG  MANIFEST-000000

Then load the triples into the database. Note, that this process will take hours or days, if you want to import the complete 330G from freebase (extrapolating from my single data point (106M triples, 13h) the freebase import would take over 200 hours[2]).

$ cayley load -db="leveldb" -dbpath="/tmp/testdb" -triples="freebase.nt"

[1] You do not need to install leveldb beforehand, since the used library is pure Go, too.

[2] I am looking at methods to speed up import. I think there are two ways: a) make the input smaller, e.g. by applying namespace or vocabulary shortcuts (I wrote some prototype for that with here); b) use a distributed backend.

SergeC · 2014-07-09T11:37:57Z

How to query database after data load ($ cayley load -db="leveldb" -dbpath="/tmp/testdb" -triples="freebase.nt")?

What do you think about using grep for simple queries? It's also possible to use zegrep on compressed dump file.

May be its possible to cut only some domains? For example I need only /music domain and don't care about rest data. So cut /music, load, use...

What do think about asking developers to add possibility to load only certain domains ($ cayley load -db="leveldb" -dbpath="/tmp/testdb" -triples="freebase.nt" -domain="/music")?

Can your prototype export from freebase.nt to json file only /music domain? So I'll load it into mongodb and it would be fine for me.

miku · 2014-07-09T11:58:43Z

a) I think the cayley docs are quite nice,
b) grep can be a viable option for adhoc searches - depends on your use case,
c) I would argue against something like -domain, since files can be easily preprocessed with command line tools,
d) No, there is no domain filter in my prototype, I'd just grep out the /music n-triples and then convert those to json.

SergeC · 2014-07-09T12:23:33Z

If you add to your tool nttoldj possibility to read compressed FB data dump (described #57 (comment)) and process certain domains data load process should become faster. And I assume will be possible to run multiple instances.
And even more you can use modern CPU with onboard memory controller and 32GB of RAM, upload freebase dump(28GB) to RAM disk and operations will be more faster.

since files can be easily preprocessed with command line tools

Can you provide me some examples of commands? Currently I need to do this https://www.freebase.com/user/sergec/views/artists_by_record_label?mql= but get whole list

barakmich · 2014-07-09T18:52:22Z

@miku is spot-on. I'll also point out that by tweaking some of the database flags (ie, the configuration docs) you can load even more triples into leveldb in even less time :)

barakmich · 2014-07-31T16:43:32Z

Closing, for lack of scope, and a good answer.

barakmich closed this as completed Jul 31, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process not start with huge database #61

Process not start with huge database #61

SergeC commented Jul 9, 2014

miku commented Jul 9, 2014

SergeC commented Jul 9, 2014

miku commented Jul 9, 2014

SergeC commented Jul 9, 2014

miku commented Jul 9, 2014

SergeC commented Jul 9, 2014

miku commented Jul 9, 2014

SergeC commented Jul 9, 2014

barakmich commented Jul 9, 2014

barakmich commented Jul 31, 2014

Process not start with huge database #61

Process not start with huge database #61

Comments

SergeC commented Jul 9, 2014

miku commented Jul 9, 2014

SergeC commented Jul 9, 2014

miku commented Jul 9, 2014

SergeC commented Jul 9, 2014

miku commented Jul 9, 2014

SergeC commented Jul 9, 2014

miku commented Jul 9, 2014

SergeC commented Jul 9, 2014

barakmich commented Jul 9, 2014

barakmich commented Jul 31, 2014