Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Converting a Freebase Quad Dump to BaseKB Pro
Freebase has discontinued the quad dump and is currently developing an official RDF dump. I hope this information is useful for the development of a correct official quad dump and for it's interpretation
There are four major stages in the conversion of the Freebase Quad Dump to correct RDF. The essential problem is the grounding of symbols that are used inside Freebase to RDF resources and literals respecting the unique name assumption; this can be done efficiently in the following order:
- Turtle 0 -- construction of a directed acyclic graph (DAG) of names and namespaces from /type/object/key quads. This DAG can be used to resolve Freebase identifiers to unique Freebase mids. This DAG is represented in a JDBM database. This is the one major processing step which is not trivially parallelizable.
- Turtle 1 -- resolution of all Freebase identifiers to mids by joining the freebase quad dump with Turtle 0. For instance, /type/object/key is resolved to /m/0dd. The product of Turtle 1 is in the same format as the Freebase database except that key integrity is restored. Literal types are not correctly resolved in Turtle 1.
- Turtle 2 -- extraction of schema material in RDF format from Turtle 1
- Turtle 3 -- conversion of literal types (integers, dates, etc.) to RDF standard types by joining Turtle 1 with Turtle 2
There are stages of preprocessing and postprocessing also.
For preprocessing, the hydroxide system partitions the quad dump on a hash of the subject and sorts the results so that all quads with a given subject sort together. This enables a "Reduce" processing phase where it is possible to examine all statements sharing a given subject.
In a reduce step, we can load these statements into an in-memory triple store, like that provided by the Jena framework. Since these triple stores never get very big, we reduce scalability demands and significantly reduce memory requirements. In particular, "shared nothing" parallelism