Add flag to remove some parts of string #1

SergeC · 2014-07-16T10:56:39Z

I converted grep result from freebase data dump with this tool to json file. Then I imported json to mongo db. In mongo db it looks like on screenshot.

Can you add a flag that can remove http://rdf.freebase.com/ns/ while conversion to json. I need it to make DB compact and faster. So data in mongo db will look like:

Q2: I also need -l flag example. I tried -l="en" and -l="@en" to get only english text but process stopped with error.
Thanks.

The text was updated successfully, but these errors were encountered:

miku · 2014-07-16T12:59:08Z

I think the tool (version 1.0.13) meets your requirements already. If I run

 $ nttoldj -i -a -l "en" freebasedump.nt

the resulting JSON looks like this:

{"s":"fb:award.award_winner","p":"fb:type.type.instance","o":"fb:m.0zdhp5r"}
{"s":"fb:award.award_winner","p":"fb:type.type.instance","o":"fb:m.0z02xn1"}
{"s":"fb:award.award_winner","p":"fb:type.type.instance","o":"fb:m.0n66rtj"}
...

The prefix abbreviations are applied with -a. There are some errors in the dump, I think, so to not stop processing on errors, pass the -i flag.

Sidenote: I am reworking this tool soon, and it will get - among other things - better documentation.

SergeC · 2014-07-16T13:12:30Z

But I don't want fbkey: and fb: in output. How to change rules? I ask because I'm not familiar with Go lang.

miku · 2014-07-16T13:40:05Z

I'll see if it is sensible to include such a flag in the new version. Until then, please consider something ad-hoc like this:

$ nttoldj -i -a -l "en" freebasedump.nt | sed -e 's/fb://g' | sed -e 's/fbkey://g'

SergeC · 2014-07-16T13:54:06Z

It will slow down export process a lot. May be its possible temporary to change rule to something like:
# generic freebase
fb http://rdf.freebase.com/ns/ -->
null http://rdf.freebase.com/ns/
or
nil http://rdf.freebase.com/ns/
or
void http://rdf.freebase.com/ns/
or
'' http://rdf.freebase.com/ns/

Please also consider to add support for reading freebase data dump from gz file without unpacking it. cayleygraph/cayley#57 (comment)

miku · 2014-07-16T14:16:28Z

Have you measured it? In my experience these older tools are usually super fast (plus they will run in a separate process).

SergeC · 2014-07-16T16:24:29Z

Process is not finished but sed slows it down:

I have strange output in terminal

SergeC · 2014-07-23T15:22:27Z

Process on previous screenshot took 5 days to complete. Any solutions for speeding it up?

miku · 2014-07-24T09:32:49Z

Thanks for this data point, I am benchmarking various solutions myself and will report them later here. I think there is a chance to reduce the running time by a one if not two orders of magnitude.

SergeC · 2014-07-24T10:06:33Z

I don't think its possible to get too much performance from CPU. Have you tried OpenCL, CUDA technologies? Have a look at https://github.com/bkase/CUDA-grep and https://bitbucket.org/genbattle/go-opencl

miku · 2014-07-24T13:16:29Z

An ugly approach like this long winded sed pipeline will perform the same task in about half the time (140M lines – nttoldj: 80min, sed: 45min) . It could be probably made more efficient by matching processes and cores.

SergeC · 2014-07-24T17:07:01Z

Unpacked freebase-rdf-2014-07-06-00-00.gz have 2623380169 lines.
Can discuss speed up methods in Skype or some other messengers?

SergeC · 2014-07-29T09:11:27Z

Have you tried http://sphinxsearch.com/ or http://gearman.org/ ?

miku · 2014-07-29T12:25:23Z

With careful sed work balancing, I was able to convert/shrink about 40k lines of n-triples per second. The resulting program can be found here: https://github.com/miku/ntto – eventually I will merge nttoldj and ntto. Just wanted to let you know, that on a single machine, this is probably the simplest and fastest way to do it. Also ntto supports empty replacements, if you want to get rid of namespaces completely.

You could always use something like Hadoop or Gearman to distribute work - and in the case of Freebase, this is likely the way to go.

SergeC · 2014-07-29T14:25:12Z

Is it possible to use sed for n-triples to json conversion?
Core idea of ntto is to run multiple sed processes?

miku · 2014-07-29T15:56:18Z

Just another data point. By replacing sed -e with LANG=C perl -lnpe we were able to cut runtime for a file with 1M lines from 32s to 6.8s. We thought sed was optimized for this task, until we stubled over this blog post.

So ntto is basically just a little wrapper to utilize multicore. I think, converting NT (ntriples) to JSON via sed or perl is possible, but probably tiresome. I hope I can merge the different NT tools, soon.

miku · 2014-07-29T17:57:15Z

Well, this seems to escalate into an optimization spree. Actually, there is a small utility called replace, that ships with mysql-server. It's some special-purpose C code that will mass replace 114M triples in about 40s, which amounts to almost 3M triples per second.

miku · 2014-07-31T16:09:26Z

I'm going to close this, since the original issue has been resolved:

$ echo "<NULL> http://www.w3.org/1999/02/22-rdf-syntax-ns#" > RULES
$ ntto -a -r RULES -o output.nt file.nt

output.nt will contain abbreviated ntriples with the prefix http://www.w3.org/1999/02/22-rdf-syntax-ns# completely removed.

Furthermore, the performance issues have been addressed. It might be possible to shrink the freebase dump in a few hours.

miku closed this as completed Jul 31, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flag to remove some parts of string #1

Add flag to remove some parts of string #1

SergeC commented Jul 16, 2014

miku commented Jul 16, 2014

SergeC commented Jul 16, 2014

miku commented Jul 16, 2014

SergeC commented Jul 16, 2014

miku commented Jul 16, 2014

SergeC commented Jul 16, 2014

SergeC commented Jul 23, 2014

miku commented Jul 24, 2014

SergeC commented Jul 24, 2014

miku commented Jul 24, 2014

SergeC commented Jul 24, 2014

SergeC commented Jul 29, 2014

miku commented Jul 29, 2014

SergeC commented Jul 29, 2014

miku commented Jul 29, 2014

miku commented Jul 29, 2014

miku commented Jul 31, 2014

Add flag to remove some parts of string #1

Add flag to remove some parts of string #1

Comments

SergeC commented Jul 16, 2014

miku commented Jul 16, 2014

SergeC commented Jul 16, 2014

miku commented Jul 16, 2014

SergeC commented Jul 16, 2014

miku commented Jul 16, 2014

SergeC commented Jul 16, 2014

SergeC commented Jul 23, 2014

miku commented Jul 24, 2014

SergeC commented Jul 24, 2014

miku commented Jul 24, 2014

SergeC commented Jul 24, 2014

SergeC commented Jul 29, 2014

miku commented Jul 29, 2014

SergeC commented Jul 29, 2014

miku commented Jul 29, 2014

miku commented Jul 29, 2014

miku commented Jul 31, 2014