# Demo: Reading Linked Data
* for this exercise you will need to install SPARQLWrapper:
  * __`~/anaconda3/bin/conda install -c conda-forge sparqlwrapper`__
  * Windows: Use Start Menu to open Anaconda Shell, then __`conda install -c conda-forge sparqlwrapper`__

In [8]:
import pandas as pd
import json
from SPARQLWrapper import SPARQLWrapper, JSON

In [9]:
def get_sparql_dataframe(service, query):
    """
    Helper function to convert SPARQL results into a Pandas data frame.
    """
    sparql = SPARQLWrapper(service)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    result = sparql.query()

    processed_results = json.load(result.response)
    cols = processed_results['head']['vars']

    out = []
    for row in processed_results['results']['bindings']:
        item = []
        for c in cols:
            item.append(row.get(c, {}).get('value'))
        out.append(item)

    return pd.DataFrame(out, columns=cols)

In [10]:
wds = "https://query.wikidata.org/sparql"

In [13]:
# This is a SPARQL query to send to the SPARQL endpoint defined in the
# previous step. It mixes three vocabularies that each have their own
# definitions but is ultimately a selection from the Wikidata graph.
#
# We're looking for distinct rows of individuals who have an orcid
# (https://orcid.org) and any English descriptions and labels we
# might also have about them. We're matching a pattern in the graph
# for any node that is connected to other nodes with these relationships.
#
# Note that the relationships themselves are often resolvable. To
# understand what wdt:P496 means, expand it into its full URL by
# applying the prefix for wdt and then issue an HTTP request to
# http://www.wikidata.org/prop/direct/P496
#
# For more information on SPARQL, please consult "Learning SPARQL (2nd
# Edition)" by Bob DuCharme.

rq = """
PREFIX bd: <http://www.bigdata.com/rdf#>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

select distinct
   ?item
   ?itemLabel
   ?orcid
   ?description
WHERE {
  ?item wdt:P496 ?orcid 
  OPTIONAL { ?item schema:description ?description filter (lang(?description) = "en") }
  SERVICE wikibase:label {
        bd:serviceParam wikibase:language "en" .
  }
} Limit 10000
"""

In [14]:
wikidf = get_sparql_dataframe(wds, rq)

In [15]:
# Let's inspect our DataFrame
wikidf.head()

Unnamed: 0,item,itemLabel,orcid,description
0,http://www.wikidata.org/entity/Q879571,Björn Brembs,0000-0001-7824-7650,German university teacher
1,http://www.wikidata.org/entity/Q91785,Anna Frebel,0000-0002-2139-7145,astronomer
2,http://www.wikidata.org/entity/Q63514,Reinhardt Kristensen,0000-0001-9549-1188,Danish Scientist
3,http://www.wikidata.org/entity/Q532387,Takashi Gojobori,0000-0001-7850-1743,Japanese biologist
4,http://www.wikidata.org/entity/Q92756,Johan Håstad,0000-0002-5379-345X,Swedish computer scientist


In [16]:
# Some stats...
wikidf.describe()

Unnamed: 0,item,itemLabel,orcid,description
count,10000,10000,10000,1018
unique,9999,9989,9993,466
top,http://www.wikidata.org/entity/Q42326259,Tao Liu,0000-0002-9231-9996,researcher
freq,2,2,2,430


In [18]:
# Top 10 most frequently occurring job titles
wikidf['description'].value_counts()[:10]

researcher                     430
chemist                         10
scientist                       10
botanist                         8
British computer scientist       6
American computer scientist      5
American biologist               5
German computer scientist        5
American scientist               5
Brazilian entomologist           4
Name: description, dtype: int64

In [19]:
# A Jupyter trick to store this DataFrame so we can use it in our exercise
%store wikidf

Stored 'wikidf' (DataFrame)
