Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process not start with huge database #61

Closed
SergeC opened this issue Jul 9, 2014 · 10 comments
Closed

Process not start with huge database #61

SergeC opened this issue Jul 9, 2014 · 10 comments

Comments

@SergeC
Copy link

SergeC commented Jul 9, 2014

I've downloaded FreeBase data dump (27Gb freebase-rdf-2014-07-06-00-00.gz) and uncompressed it (gzip -cd freebase-rdf-2014-07-06-00-00.gz > freebase.nt 330Gb freebase.nt). When I starting process it takes a lot of time and then process got killed. Log:
`root@ns501558:/home# time ./cayley_0.3.0-pre_linux_amd64/cayley http --dbpath=freebase.nt
Killed

real 16m34.086s
user 14m44.880s
sys 0m13.356s`

Is any solution for it?

May be its better to use compressed databases like I suggested before #57 ?

@miku
Copy link

miku commented Jul 9, 2014

The standard backend is mem, so unless you are on a machine with a terabyte of RAM (just estimating), the process will get killed at one point, when it tries to allocate memory. You could try some other backend, e.g. leveldb via -db=leveldb. I just loaded 106M triples (which is about 1/17th of freebase) into a leveldb backend in about 13h.

@SergeC
Copy link
Author

SergeC commented Jul 9, 2014

Can you please provide more detailed instructions how to run it? Step-by-step instructions would be great.
Do I need to install leveldb?
How to load FreeBase dump to leveldb?

@miku
Copy link

miku commented Jul 9, 2014

What OS are you targeting?

@SergeC
Copy link
Author

SergeC commented Jul 9, 2014

On my dev machine OSX 10.9 with 16Gb of RAM but I have very limited HDD space on it. I'd like to use OSX.
On my server Ubuntu 14.04 64bit with 2Gb of RAM which hosts uncompressed database.

I can connect to server via NFS and use its HDD space.

@miku
Copy link

miku commented Jul 9, 2014

Setting up leveldb was quite straightforward[1]. Replace the /tmp/testdb with the path, where the DB should be created.

$ cayley init -db="leveldb" -dbpath="/tmp/testdb"
$ ls /tmp/testdb
000001.log  CURRENT  LOCK  LOG  MANIFEST-000000

Then load the triples into the database. Note, that this process will take hours or days, if you want to import the complete 330G from freebase (extrapolating from my single data point (106M triples, 13h) the freebase import would take over 200 hours[2]).

$ cayley load -db="leveldb" -dbpath="/tmp/testdb" -triples="freebase.nt"

[1] You do not need to install leveldb beforehand, since the used library is pure Go, too.

[2] I am looking at methods to speed up import. I think there are two ways: a) make the input smaller, e.g. by applying namespace or vocabulary shortcuts (I wrote some prototype for that with here); b) use a distributed backend.

@SergeC
Copy link
Author

SergeC commented Jul 9, 2014

How to query database after data load ($ cayley load -db="leveldb" -dbpath="/tmp/testdb" -triples="freebase.nt")?

What do you think about using grep for simple queries? It's also possible to use zegrep on compressed dump file.

May be its possible to cut only some domains? For example I need only /music domain and don't care about rest data. So cut /music, load, use...

What do think about asking developers to add possibility to load only certain domains ($ cayley load -db="leveldb" -dbpath="/tmp/testdb" -triples="freebase.nt" -domain="/music")?

Can your prototype export from freebase.nt to json file only /music domain? So I'll load it into mongodb and it would be fine for me.

@miku
Copy link

miku commented Jul 9, 2014

a) I think the cayley docs are quite nice,
b) grep can be a viable option for adhoc searches - depends on your use case,
c) I would argue against something like -domain, since files can be easily preprocessed with command line tools,
d) No, there is no domain filter in my prototype, I'd just grep out the /music n-triples and then convert those to json.

@SergeC
Copy link
Author

SergeC commented Jul 9, 2014

If you add to your tool nttoldj possibility to read compressed FB data dump (described #57 (comment)) and process certain domains data load process should become faster. And I assume will be possible to run multiple instances.
And even more you can use modern CPU with onboard memory controller and 32GB of RAM, upload freebase dump(28GB) to RAM disk and operations will be more faster.

since files can be easily preprocessed with command line tools

Can you provide me some examples of commands? Currently I need to do this https://www.freebase.com/user/sergec/views/artists_by_record_label?mql= but get whole list

@barakmich
Copy link
Member

@miku is spot-on. I'll also point out that by tweaking some of the database flags (ie, the configuration docs) you can load even more triples into leveldb in even less time :)

@barakmich
Copy link
Member

Closing, for lack of scope, and a good answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants