# Working with Wikidata and File Formats

This notebook is a DRAFT. The notebook begins outlining how to interact with wikidata via the notebook, and
uses SPARQL queries to ask about entities in wikidata, with the goal of processing information
about file formats that are listed in the wikibase.

If not installed, install the qwikidata library. See https://qwikidata.readthedocs.io/en/stable/readme.html for more information.

In [1]:
!pip install qwikidata

Collecting qwikidata
  Downloading qwikidata-0.4.0-py3-none-any.whl (20 kB)
Installing collected packages: qwikidata
Successfully installed qwikidata-0.4.0


In [2]:
from qwikidata.entity import WikidataItem, WikidataLexeme, WikidataProperty
from qwikidata.linked_data_interface import get_entity_dict_from_api

# create an item representing "Jerome Kern"
Q_Jerry_Kern = "Q313270"
q313270_dict = get_entity_dict_from_api(Q_Jerry_Kern)
q313270 = WikidataItem(q313270_dict)

q313270

WikidataItem(label=Jerome Kern, id=Q313270, description=American composer of musical theater and popular music (1885-1945), aliases=['Jerome David Kern'], enwiki_title=Jerome Kern)

In [14]:
# look through the dictionary
for item in q313270_dict:
    print(item)
    #print(item,':',q313270_dict[item]) #<-- print keys and values

pageid
ns
title
lastrevid
modified
type
id
labels
descriptions
aliases
claims
sitelinks


In [21]:
# display contents of an element
q313270_dict['descriptions']

{'it': {'language': 'it', 'value': 'compositore statunitense'},
 'fr': {'language': 'fr', 'value': 'compositeur américain'},
 'de': {'language': 'de', 'value': 'US-amerikanischer Komponist'},
 'fa': {'language': 'fa', 'value': 'آهنگساز آمریکایی'},
 'nb': {'language': 'nb', 'value': 'amerikansk komponist'},
 'nn': {'language': 'nn', 'value': 'amerikansk komponist'},
 'da': {'language': 'da', 'value': 'amerikansk komponist'},
 'sv': {'language': 'sv', 'value': 'amerikansk kompositör'},
 'en': {'language': 'en',
  'value': 'American composer of musical theater and popular music (1885-1945)'},
 'nl': {'language': 'nl', 'value': 'Amerikaans componist (1885-1945)'},
 'uk': {'language': 'uk', 'value': 'американський композитор'},
 'he': {'language': 'he', 'value': 'מלחין יהודי אמריקני'},
 'cs': {'language': 'cs', 'value': 'americký skladatel'},
 'ca': {'language': 'ca', 'value': 'compositor estatunidenc'},
 'id': {'language': 'id', 'value': 'Komposer Amerika'}}

In [44]:
en_descrip = q313270_dict['descriptions']['en']['value']

print(en_descrip)

American composer of musical theater and popular music (1885-1945)


In [23]:
from qwikidata.sparql import (get_subclasses_of_item,
                              return_sparql_query_results)

# send any sparql query to the wikidata query service and get full result back
# here we use an example that counts the number of humans
sparql_query = """
SELECT (COUNT(?item) AS ?count)
WHERE {
        ?item wdt:P31/wdt:P279* wd:Q5 .
}
"""
res = return_sparql_query_results(sparql_query)

print(res)

{'head': {'vars': ['count']}, 'results': {'bindings': [{'count': {'datatype': 'http://www.w3.org/2001/XMLSchema#integer', 'type': 'literal', 'value': '9390415'}}]}}


In [33]:
res['results']['bindings'][0]['count']['value']

'9390415'

In [41]:
# use convenience function to get subclasses of an item as a list of item ids
Q_id = "Q235557"
subclasses = get_subclasses_of_item(Q_id)

print(len(subclasses), subclasses)

268 ['Q235557', 'Q86920', 'Q167772', 'Q223535', 'Q229762', 'Q243303', 'Q278934', 'Q285972', 'Q287067', 'Q290741', 'Q336705', 'Q379545', 'Q467454', 'Q497118', 'Q507860', 'Q527723', 'Q594447', 'Q682626', 'Q691652', 'Q863883', 'Q1056408', 'Q1135858', 'Q1224822', 'Q1343033', 'Q1351368', 'Q1363415', 'Q1485661', 'Q1572121', 'Q1727359', 'Q1840684', 'Q1931564', 'Q1955133', 'Q2141493', 'Q2206173', 'Q2427787', 'Q2720536', 'Q3077335', 'Q3498805', 'Q3502441', 'Q3930596', 'Q4781113', 'Q4836790', 'Q5008632', 'Q5090461', 'Q5090500', 'Q5156830', 'Q5227180', 'Q5248648', 'Q5359789', 'Q5426535', 'Q6046575', 'Q7079133', 'Q7203483', 'Q7508366', 'Q16361936', 'Q16545707', 'Q17042621', 'Q17074854', 'Q17087630', 'Q17560478', 'Q17636230', 'Q18011768', 'Q18359031', 'Q20155966', 'Q26697935', 'Q27198004', 'Q27823178', 'Q27824058', 'Q27826463', 'Q27915156', 'Q27915171', 'Q27915172', 'Q27915173', 'Q27915174', 'Q27967078', 'Q27978793', 'Q28009469', 'Q28049484', 'Q28049572', 'Q28344234', 'Q28846068', 'Q28846076', 'Q28

In [None]:
# create a property representing "subclass of"
P_SUBCLASS_OF = "P279"
p279_dict = get_entity_dict_from_api(P_SUBCLASS_OF)
p279 = WikidataProperty(p279_dict)

# create a lexeme representing "bank"
L_BANK = "L3354"
l3354_dict = get_entity_dict_from_api(L_BANK)
l3354 = WikidataLexeme(l3354_dict)

# Query the File Format Entities

The next blocks use `qwikidata` to send a SPARQL query to wikidata. 
The query looks for all of the items associated with file format or
file format family. 

As of October 2021, the following query suggests that there should be 13,699 items 
represented by these file entities:

```
SELECT (COUNT(?item) AS ?count)
WHERE {
        ?item wdt:P31/wdt:P279* wd:Q235557 .
}
```

This query was derived from the query that generates the table at https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics/Structures/File_formats/List

In [45]:
from qwikidata.sparql import (get_subclasses_of_item,
                              return_sparql_query_results)

# send any sparql query to the wikidata query service and get full result back
# here we use an example that counts the number of humans
sparql_query = """
SELECT ?item 

WHERE { 
    ?item wdt:P31*/wdt:P279* wd:Q235557 
}
"""
file_format_results = return_sparql_query_results(sparql_query)

print(len(file_format_results))

2


In [53]:
for key in file_format_results:
    print(key)

head
results


In [96]:
# each entity URL is in a dictionary called 'bindings' in the 'results' dictionary
# to get the entity ID, split off the last element of the URL:

test_string = 'http://www.wikidata.org/entity/Q452197'

id = test_string.split('/')[-1]

print(id)

Q452197


In [97]:
format_related_entities_list = list()

for format in file_format_results['results']['bindings']:
    if format['item']['type'] == 'uri':
        format_related_entities_list.append(format['item']['value'].split('/')[-1])

len(format_related_entities_list)

13699

In [100]:
c = 0 

for entity in format_related_entities_list:
    c += 1
    print(entity)
    if c > 10:
        break

Q452197
Q1388170
Q2623363
Q3063023
Q16530692
Q61047486
Q62391975
Q64952115
Q98381664
Q98381938
Q29946121
