Set of experiments related to put in place a service that replaces the current SPARQL endpoint powered by the WDQS blazegraph cluster. Context: https://phabricator.wikimedia.org/T292404
Prototypes of graphdb able to fulfill the needs of the categories SPARQL endpoint:
- Breadth first search over the category tree
- Simple point queries to get the object properties
Prototype expectations:
- load the data from the json dumps
- able to mutate the graph from a json event (for real-time updates)
- ideally have a WMCS service with the backend running
Please add any PG graph engine you would like to experiment with
In folder dgraph-backend
In folder orientdb-backend
In folder rdf-fuzeki-hdt-backend
Using spark in spark-batch-loader.
Goals:
- Get the RDF dumps into HDFS possibly using bz2 to ease splitting (dumps source:
/mnt/data/xmldatadumps/public/other/categoriesrdf/) - Create a hive table with context, subject, predicate, object and store similar to the wikibase approach
- Convert the dumps to an HDT file, might require https://github.com/rdfhdt/hdt-mr
Daily process (nice to have):
- process daily sparql dumps and update the date fetched above
- Create a new HDT file
Using flink see: TODO add link to project
https://dumps.wikimedia.your.org/other/categoriesrdf/
If you want to experiment with jena-fuzeki & HDT without having to build your HDT file:
Please ask if you prefer a bigger/smaller example or convert one yourself using tools/convert_rdf_to_json.py.
See tools
- convert_rdf_to_json.py: A tool to convert RDF dumps into json dumps (ndjson: one doc per line)
Category document structure:
type Category {
""" ID for the category page. (The RDF model conflates the page_url and the ID should we do the same here?) """
id: ID!
""" Name of the page """
name: String!
""" URL of the category page """
pageUrl: String!
""" category visibility (categories generally not displayed at the end of the page)"""
hidden: Boolean!
""" Categories this category belongs to """
parentCategories: [Category!]!
""" Number of pages belonging to this category (direct relationships) """
numberOfPages: Int!
""" Number of categories belonging to this category (direct relationships) """
numberOfCategories: Int!
}Example json doc for dumps:
{
"id": "https://commons.wikimedia.org/wiki/Category:Trees",
"name": "Trees",
"pageUrl": "https://commons.wikimedia.org/wiki/Category:Trees",
"hidden": false,
"parentCategories": [
"https://commons.wikimedia.org/wiki/Category:Plants_by_common_named_groups",
"https://commons.wikimedia.org/wiki/Category:Woody_plants",
"https://commons.wikimedia.org/wiki/Category:Plant_life-form"
],
"numberOfPages": 13,
"numberOfCategories": 83
}(Coordinate with linkstable hackathon project).
Example mutation events (for testing, the flink pipeline might perhaps perform the updates itself using backend specific update-DSL):
Removing a parent category (e.g. Woody_plants is no longer parent of Trees):
- must remove Woody_plants from the
parentCategoriesarray of Trees - must decrement the
numberOfCategoriesof Woody_plants by 1
{
"id": "https://commons.wikimedia.org/wiki/Category:Trees",
"removedParentCategories": [
"https://commons.wikimedia.org/wiki/Category:Woody_plants"
]
}{
"id": "https://commons.wikimedia.org/wiki/Category:Woody_plants",
"numberOfCategories": 4
}Adding a parent category (e.g. Woody_plants is re-added as a parent of Trees):
{
"id": "https://commons.wikimedia.org/wiki/Category:Trees",
"addedParentCategories": [
"https://commons.wikimedia.org/wiki/Category:Woody_plants"
]
}{
"id": "https://commons.wikimedia.org/wiki/Category:Woody_plants",
"numberOfCategories": 5
}An article is added to/removed from Trees:
{
"id": "https://commons.wikimedia.org/wiki/Category:Trees",
"numberOfPages": 14
}