Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag to remove some parts of string #1

Closed
SergeC opened this issue Jul 16, 2014 · 17 comments
Closed

Add flag to remove some parts of string #1

SergeC opened this issue Jul 16, 2014 · 17 comments

Comments

@SergeC
Copy link

SergeC commented Jul 16, 2014

I converted grep result from freebase data dump with this tool to json file. Then I imported json to mongo db. In mongo db it looks like on screenshot.
screen shot 2014-07-16 at 13 49 27

Can you add a flag that can remove http://rdf.freebase.com/ns/ while conversion to json. I need it to make DB compact and faster. So data in mongo db will look like:
screen shot 2014-07-16 at 14 14 37

Q2: I also need -l flag example. I tried -l="en" and -l="@en" to get only english text but process stopped with error.
Thanks.

@miku
Copy link
Owner

miku commented Jul 16, 2014

I think the tool (version 1.0.13) meets your requirements already. If I run

 $ nttoldj -i -a -l "en" freebasedump.nt

the resulting JSON looks like this:

{"s":"fb:award.award_winner","p":"fb:type.type.instance","o":"fb:m.0zdhp5r"}
{"s":"fb:award.award_winner","p":"fb:type.type.instance","o":"fb:m.0z02xn1"}
{"s":"fb:award.award_winner","p":"fb:type.type.instance","o":"fb:m.0n66rtj"}
...

The prefix abbreviations are applied with -a. There are some errors in the dump, I think, so to not stop processing on errors, pass the -i flag.

Sidenote: I am reworking this tool soon, and it will get - among other things - better documentation.

@SergeC
Copy link
Author

SergeC commented Jul 16, 2014

But I don't want fbkey: and fb: in output. How to change rules? I ask because I'm not familiar with Go lang.

@miku
Copy link
Owner

miku commented Jul 16, 2014

I'll see if it is sensible to include such a flag in the new version. Until then, please consider something ad-hoc like this:

$ nttoldj -i -a -l "en" freebasedump.nt | sed -e 's/fb://g' | sed -e 's/fbkey://g'

@SergeC
Copy link
Author

SergeC commented Jul 16, 2014

It will slow down export process a lot. May be its possible temporary to change rule to something like:
# generic freebase
fb http://rdf.freebase.com/ns/ -->
null http://rdf.freebase.com/ns/
or
nil http://rdf.freebase.com/ns/
or
void http://rdf.freebase.com/ns/
or
'' http://rdf.freebase.com/ns/

Please also consider to add support for reading freebase data dump from gz file without unpacking it. cayleygraph/cayley#57 (comment)

@miku
Copy link
Owner

miku commented Jul 16, 2014

Have you measured it? In my experience these older tools are usually super fast (plus they will run in a separate process).

@SergeC
Copy link
Author

SergeC commented Jul 16, 2014

Process is not finished but sed slows it down:
screen shot 2014-07-16 at 19 22 55

I have strange output in terminal
screen shot 2014-07-16 at 19 20 34

@SergeC
Copy link
Author

SergeC commented Jul 23, 2014

Process on previous screenshot took 5 days to complete. Any solutions for speeding it up?

@miku
Copy link
Owner

miku commented Jul 24, 2014

Thanks for this data point, I am benchmarking various solutions myself and will report them later here. I think there is a chance to reduce the running time by a one if not two orders of magnitude.

@SergeC
Copy link
Author

SergeC commented Jul 24, 2014

I don't think its possible to get too much performance from CPU. Have you tried OpenCL, CUDA technologies? Have a look at https://github.com/bkase/CUDA-grep and https://bitbucket.org/genbattle/go-opencl

@miku
Copy link
Owner

miku commented Jul 24, 2014

An ugly approach like this long winded sed pipeline will perform the same task in about half the time (140M lines – nttoldj: 80min, sed: 45min) . It could be probably made more efficient by matching processes and cores.

@SergeC
Copy link
Author

SergeC commented Jul 24, 2014

Unpacked freebase-rdf-2014-07-06-00-00.gz have 2623380169 lines.
Can discuss speed up methods in Skype or some other messengers?

@SergeC
Copy link
Author

SergeC commented Jul 29, 2014

Have you tried http://sphinxsearch.com/ or http://gearman.org/ ?

@miku
Copy link
Owner

miku commented Jul 29, 2014

With careful sed work balancing, I was able to convert/shrink about 40k lines of n-triples per second. The resulting program can be found here: https://github.com/miku/ntto – eventually I will merge nttoldj and ntto. Just wanted to let you know, that on a single machine, this is probably the simplest and fastest way to do it. Also ntto supports empty replacements, if you want to get rid of namespaces completely.

You could always use something like Hadoop or Gearman to distribute work - and in the case of Freebase, this is likely the way to go.

@SergeC
Copy link
Author

SergeC commented Jul 29, 2014

Is it possible to use sed for n-triples to json conversion?
Core idea of ntto is to run multiple sed processes?

@miku
Copy link
Owner

miku commented Jul 29, 2014

Just another data point. By replacing sed -e with LANG=C perl -lnpe we were able to cut runtime for a file with 1M lines from 32s to 6.8s. We thought sed was optimized for this task, until we stubled over this blog post.

So ntto is basically just a little wrapper to utilize multicore. I think, converting NT (ntriples) to JSON via sed or perl is possible, but probably tiresome. I hope I can merge the different NT tools, soon.

@miku
Copy link
Owner

miku commented Jul 29, 2014

Well, this seems to escalate into an optimization spree. Actually, there is a small utility called replace, that ships with mysql-server. It's some special-purpose C code that will mass replace 114M triples in about 40s, which amounts to almost 3M triples per second.

@miku
Copy link
Owner

miku commented Jul 31, 2014

I'm going to close this, since the original issue has been resolved:

$ echo "<NULL> http://www.w3.org/1999/02/22-rdf-syntax-ns#" > RULES
$ ntto -a -r RULES -o output.nt file.nt

output.nt will contain abbreviated ntriples with the prefix http://www.w3.org/1999/02/22-rdf-syntax-ns# completely removed.

Furthermore, the performance issues have been addressed. It might be possible to shrink the freebase dump in a few hours.

@miku miku closed this as completed Jul 31, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants