Categories in Wikipedia pages |
Categories are organised in multiple overlaping hierarchies |
The MediaWiki API Prepare the DB and create the root category CREATE INDEX ON :Category(catId) CREATE INDEX ON :Category(catName) CREATE INDEX ON :Page(pageTitle) CREATE (c:Category:RootCategory {catId: 0, catName: 'Databases', subcatsFetched : false, pagesFetched : false, level: 0 }) |
We can load an arbitrarily deep hierarchy of Wikipedia categories by iteratively generating requests to the MediaWiki API starting from the root category.
UNWIND range(0,3) as level
CALL apoc.cypher.doit("
MATCH (c:Category { subcatsFetched: false, level: $level})
CALL apoc.load.json('https://en.wikipedia.org/w/api.php?format=json&action=query&list=categorymembers&cmtype=subcat&cmtitle=Category:' + apoc.text.urlencode(c.catName) + '&cmprop=ids%7Ctitle&cmlimit=500')
YIELD value as results
UNWIND results.query.categorymembers AS subcat
MERGE (sc:Category {catId: subcat.pageid})
ON CREATE SET sc.catName = substring(subcat.title,9),
sc.subcatsFetched = false,
sc.pagesFetched = false,
sc.level = $level + 1
WITH sc,c
CALL apoc.create.addLabels(sc,['Level' + ($level + 1) + 'Category']) YIELD node
MERGE (sc)-[:SUBCAT_OF]->(c)
WITH DISTINCT c
SET c.subcatsFetched = true", { level: level }) YIELD value
RETURN value
Similarly, we can load the Wikipedia pages for each of the categories in the Graph. Notice that at this point we are only loading the page id, its url and its title.
UNWIND range(0,4) as level
CALL apoc.cypher.doit("
MATCH (c:Category { pagesFetched: false, level: $level })
CALL apoc.load.json('https://en.wikipedia.org/w/api.php?format=json&action=query&list=categorymembers&cmtype=page&cmtitle=Category:' + apoc.text.urlencode(c.catName) + '&cmprop=ids%7Ctitle&cmlimit=500')
YIELD value as results
UNWIND results.query.categorymembers AS page
MERGE (p:Page {pageId: page.pageid})
ON CREATE SET p.pageTitle = page.title, p.pageUrl = 'http://en.wikipedia.org/wiki/' + apoc.text.urlencode(replace(page.title, ' ', '_'))
WITH p,c
MERGE (p)-[:IN_CATEGORY]->(c)
WITH DISTINCT c
SET c.pagesFetched = true", { level: level }) yield value
return value
The MediaWiki API does not expose the content of the pages, but DBpedia does to a certain extent. DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. SELECT ?label WHERE { ?x <http://xmlns.com/foaf/0.1/isPrimaryTopicOf> <...wiki page url...> ; <http://dbpedia.org/ontology/abstract> ?label . FILTER(LANGMATCHES(LANG(?label), 'en')) } LIMIT 1 |
We’ll submit the query to the endpoint and get the results as JSON and we’ll do this again with APOC’s load.json
procedure. The results will enrich our knowledge graph by adding the abstract as a property of Page
nodes.
WITH "SELECT ?label
WHERE {
?x <http://xmlns.com/foaf/0.1/isPrimaryTopicOf> <@wikiurl@> ;
<http://dbpedia.org/ontology/abstract> ?label .
FILTER(LANG(?label) = '' || LANGMATCHES(LANG(?label), 'en')) } LIMIT 1
" AS sparqlPattern
UNWIND range(0,3) as level
CALL apoc.cypher.doit("
MATCH (c:Category { level: $level })<-[:IN_CATEGORY]-(p:Page)
WHERE NOT exists(p.abstract)
WITH DISTINCT p, apoc.text.replace(sparqlPattern,'@wikiurl@',p.pageUrl) as runnableSparql LIMIT 100
CALL apoc.load.json('http://dbpedia.org/sparql/?query=' + apoc.text.urlencode(runnableSparql) + '&format=application%2Fsparql-results%2Bjson') YIELD value
SET p.abstract = value.results.bindings[0].label.value
", { level: level, sparqlPattern: sparqlPattern }) yield value
return value
Notice that the number of pages is limited to 100 per level because we are generating an HTTP request to the DBpedia endpoint for each Page node in our graph. Feel free to remove this limit but keep in mind that this can take a while.
We can list categories by number of sub/super categories or by number of pages or even create custom indexes like ABS(toFloat(superCatCount - subCatCount)/(superCatCount + subCatCount))
that tells us how 'balanced' (ratio between supercategories and subcategories) a category is. Closer to zero are the more balanced categories and closer to one are the more unbalanced.
MATCH (c:Category)
WITH c.catName AS category,
size((c)<-[:SUBCAT_OF]-()) AS subCatCount,
size((c)-[:SUBCAT_OF]->()) AS superCatCount,
size((c)<-[:IN_CATEGORY]-()) AS pageCount
WHERE subCatCount > 0 AND superCatCount > 0
RETURN category,
pageCount,
subCatCount,
superCatCount,
ABS(toFloat(superCatCount - subCatCount)/(superCatCount + subCatCount)) as balanceIndex
ORDER BY subCatCount DESC
LIMIT 500
We can also aggregate these values to produce stats.
MATCH (c:Category)
WITH c.catName AS category,
size((c)<-[:SUBCAT_OF]-()) AS subCatCount,
size((c)-[:SUBCAT_OF]->()) AS superCatCount,
size((c)<-[:IN_CATEGORY]-()) AS pageCount,
size((c)-[:SUBCAT_OF]-()) AS total
RETURN AVG(subCatCount) AS `AVG #subcats`,
MIN(subCatCount) AS `MIN #subcats`,
MAX(subCatCount) AS `MAX #subcats`,
percentileCont(subCatCount,0.9) AS `.9p #subcats`,
AVG(pageCount) AS `AVG #pages`,
MIN(pageCount) AS `MIN #pages`,
MAX(pageCount) AS `MAX #pages`,
percentileCont(pageCount,0.95) AS `.9p #pages`,
AVG(superCatCount) AS `AVG #supercats`,
MIN(superCatCount) AS `MIN #supercats`,
MAX(superCatCount) AS `MAX #supercats`,
percentileCont(superCatCount,0.95) AS `.9p #supercats`
How are Quantone and Monster.com connected in the Wikipedia hierarchy?
MATCH shortestHierarchyConnection = shortestPath((q:Page { pageTitle : 'Quantone'})-[:IN_CATEGORY|SUBCAT_OF*]-(m:Page { pageTitle : 'Monster.com'}))
RETURN shortestHierarchyConnection
Some unexpectedly(?) long hierarchies. How can a path of lenght 7 exist if we’ve only loaded a 4 level deep hierarchy? Here’s how:
MATCH path =()-[r:SUBCAT_OF*7..]->() WITH path LIMIT 1
return path
Loops!
MATCH loop = (cat)-[r:SUBCAT_OF*]->(cat)
RETURN loop LIMIT 1
There is a snapshot of the Wikipedia categories and their hierarchical relationships (as of mid April 2017) here. It contains 1.4 million categories and 4 million hierarchical relationships. They can both be loaded into Neo4j using LOAD CSV
. You can run the queries as they are or download the files to your Neo4j’s instance import
directory and use LOAD CSV FROM "file:///…"
instead.
First the categories. Notice that we are loading a couple of extra properties in the Category nodes: the pageCount and the subcatCount. These numbers are a precomputed in the data dump and not always accurate.
USING PERIODIC COMMIT 10000
LOAD CSV FROM "https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/cats.csv?raw=true" AS row
CREATE (c:Category { catId: row[0]})
SET c.catName = row[2], c.pageCount = toInt(row[3]), c.subcatCount = toInt(row[4])
And then the subcategory relationships
USING PERIODIC COMMIT 10000
LOAD CSV FROM "https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/rels.csv?raw=true" AS row
MATCH (from:Category { catId: row[0]})
MATCH (to:Category { catId: row[1]})
CREATE (from)-[:SUBCAT_OF]->(to)
If you’re interested in regenerating fresh CSV files, here’s how:
-
Start by downloading the latest DB dumps from the wikipedia downloads page. For the category hierarchy, you’ll only need the following tables: category, categorylinks and page.
-
Load them in your DBMS.
-
Generate the categories CSV file by running the following SQL
select p.page_id as PAGE_ID, c.cat_id as CAT_ID, cast(c.cat_title as nCHAR) as CAT_TITLE , c.cat_pages as CAT_PAGES_COUNT, c.cat_subcats as CAT_SUBCAT_COUNT into outfile '/Users/jbarrasa/Applications/neo4j-enterprise-3.1.2/import/wiki/cats.csv' fields terminated by ',' enclosed by '"' escaped by '\\' lines terminated by '\n' from test.category c, test.page p where c.cat_title = p.page_title and p.page_namespace = 14
-
Generate the relationships file by running the following SQL
select p.page_id as FROM_PAGE_ID, p2.page_id as TO_PAGE_ID into outfile '/Users/jbarrasa/Applications/neo4j-enterprise-3.1.2/import/wiki/rels.csv' fields terminated by ',' enclosed by '"' escaped by '\\' lines terminated by '\n' from test.category c, test.page p , test.categorylinks l, test.category c2, test.page p2 where l.cl_type = 'subcat' and c.cat_title = p.page_title and p.page_namespace = 14 and l.cl_from = p.page_id and l.cl_to = c2.cat_title and c2.cat_title = p2.cat_title and p2.page_namespace = 14
The same analysis run in the first approach can be applied to the full hierarchy. However there are a couple of interesting queries that will only make sense on the full hierarchy. The first one is the stats on the number of pages per category based on precomputed counts available in the Wikipedia dump.
MATCH (c:Category)
return SUM(c.pageCount) AS `#pages categorised (with duplicates)`,
AVG(c.pageCount) AS `average #pages per cat`,
percentileCont(c.pageCount, 0.75) AS `.75p #pages in a cat`,
MIN(c.pageCount) AS `min #pages in a cat`,
MAX(c.pageCount) AS `max #pages in a cat`
Also, the full export contains some orphan nodes. Probably unmaintained/migrated categories? Here is how to get the number of orphan nodes:
MATCH (c:Category)
WHERE NOT (c)-[:SUBCAT_OF]-()
RETURN COUNT(c)