Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
143 lines (106 sloc) 5.15 KB

This wikipage is for issue-280's working group subsidy and reference. See also Wikidata's Guidelines for external relationships.


Introduction

The issue 280 started with the title's suggestion, "Schema.org should have mappings to Wikidata terms where possible", but the simplest and collaborative way is to feed Wikidata: these are the basic @thadguidry rules to accomplish the task at Wikidata editions,


Now, to get back the information, we need "figure out the SPARQL for query.wikidata.org that would extract these mappings", as @danbri suggested.

Quering and exporting

The "wanted universe" is provided by a simple query, and perhaps works fine for a local Wikidata user (at the query.wikidata.org's server.
See also [[Issue 280, quering Wikidata, OLD QUERIES]].

Standard sparql query

All periodic reports was generated by a "standard query", to get relationship information (equivclass, equivprop, sub or super) and export result to other algorithms, databases or spreadsheets.

SELECT ?wd ?wdLabel ?corrName ?schema
{
  values (?corr ?corrName)
    { (wdt:P2235 "superProp") (wdt:P2236 "subProp") (wdt:P1628 "equivProp")
      (wdt:P1709 "equivClass") (wdt:P2888 "exactMatch")
    }
  ?wd ?corr ?schema
  filter(regex(str(?schema), "schema.org"))
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} order by ?corrName ?schema

Standard SQL database

After download as CSV (eg. file wd2schema-raw-2016-06.csv), produce a standard SQL table to manipulate data,

DROP TABLE IF EXISTS wd2schema_std;

CREATE TABLE  IF NOT EXISTS wd2schema_std_temp (
	pLabel text, p text,
	equivclass text, equivprop text, sub text, super text, exact text
);

COPY wd2schema_std_temp FROM '/tmp/wd2schema-raw-2016-06.csv' CSV HEADER;

-- adapat to better structure:
CREATE TABLE wd2schema_std AS
  SELECT
    COALESCE(equivclass,equivprop,sub,super) as sch_id,
    CASE
	    WHEN equivclass IS NOT NULL THEN 'equivclass'
	    WHEN equivprop IS NOT NULL THEN 'equivprop'
	    WHEN sub IS NOT NULL THEN 'sub'
	    WHEN super IS NOT NULL THEN 'super'
	    WHEN exact IS NOT NULL THEN 'exact'
    END as reltype,
    p As wd_id,
    plabel as wd_label
  FROM wd2schema_std_temp
;
CREATE VIEW vw_wd2schema_std AS
  SELECT *, reltype='equivprop' as rel_isprop,
        ascii(substring(sch_id,1,1)::char)>87 as sch_isprop,
        substring(wd_id,1,1)='P' wd_isprop
  FROM wd2schema_std ORDER BY sch_id,reltype
;
DROP TABLE wd2schema_std_temp;

-- clean:
UPDATE wd2schema_std SET
   wd_id  = regexp_replace(wd_id,  '^https?://[a-z\.]+/?', ''),
   sch_id = regexp_replace(sch_id, '^https?://[a-z\.]+/?', '')
;
-- back as new file:
COPY (SELECT * FROM wd2schema_std ORDER BY sch_id,reltype)
TO '/tmp/wd2schema-std-2017-05.csv' WITH CSV HEADER;

So, as input we have wd2schema-raw-2016-06.csv and as output wd2schema-std-2016-06.csv, and the SQL database with the same data to perform queries.

Summarizations

SELECT count(*) as n_tot FROM wd2schema_std;

-- summarize reltypes
SELECT reltype, count(*) as n  FROM wd2schema_std
GROUP BY 1 ORDER BY 1;

-- summarize reltypes and pure_prop
WITH t AS (
  SELECT *, (rel_isprop AND sch_isprop AND wd_isprop) as pure_prop
  FROM vw_wd2schema_std
) SELECT reltype, pure_prop, count(*) as n
  FROM t GROUP BY 1,2 ORDER BY 1,2;

-- summarize schemaOrg name that repeat
SELECT sch_id, count(*) as n FROM wd2schema_std
GROUP BY 1  HAVING count(*)>1
ORDER BY 2 DESC, 1;

-- summarize plabel that repeat
SELECT wd_label, count(*) as n FROM wd2schema_std
GROUP BY 1  HAVING count(*)>1
ORDER BY 2 DESC, 1;

-- summarize wd_id that repeat
SELECT wd_id, count(*) as n FROM wd2schema_std
GROUP BY 1  HAVING count(*)>1
ORDER BY 2 DESC, 1;

Results

Quering and reporting results. See reports.md, for specific dates:

You can’t perform that action at this time.