# Wikidata items

In [1]:
!pip install sparqlwrapper

import sys
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import re


Defaulting to user installation because normal site-packages is not writeable
Collecting sparqlwrapper
  Using cached SPARQLWrapper-2.0.0-py3-none-any.whl (28 kB)
Collecting rdflib>=6.1.1 (from sparqlwrapper)
  Using cached rdflib-6.3.2-py3-none-any.whl (528 kB)
Collecting isodate<0.7.0,>=0.6.0 (from rdflib>=6.1.1->sparqlwrapper)
  Using cached isodate-0.6.1-py2.py3-none-any.whl (41 kB)
Installing collected packages: isodate, rdflib, sparqlwrapper
Successfully installed isodate-0.6.1 rdflib-6.3.2 sparqlwrapper-2.0.0


## Query Wikidata function

The function takes a SPARQL query string as its argument. It will run the SPARQL query and return the result as a data frame

In [2]:
def query_wikidata(query):
    endpoint_url = "https://query.wikidata.org/sparql"

    def get_results(endpoint_url, query):
        user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
        # TODO adjust user agent; see https://w.wiki/CX6
        sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
        sparql.setQuery(query)
        sparql.setReturnFormat(JSON)
        return sparql.query().convert()


    results = get_results(endpoint_url, query)

    raw = pd.json_normalize(results["results"]["bindings"])

    df = raw.filter(regex="\.value$")
    df = df.rename(columns=lambda x: re.sub('\.value$','',x))
    
    if 'orcid' not in df.columns:
        df['orcid'] = None
    if 'wye' not in df.columns:
        df['wye'] = None
    # if 'au_dict_bio' not in df.columns:
    #     df['au_dict_bio'] = None
    
    # order columns so that they are always in the same order
    cols = ['item', 'itemLabel', 'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id',
            'yob', 'yod', 'wyb', 'wye']
    df = df[cols]
    
    return df

## SPARQL queries

Because wildcard search against item label in Wikidata are very slow and generally time out, we run a number of searches on the presence of relavant identifiers.

SPARQL queries courtesy of Mathias Dillen, Botanic Garden Meise: https://github.com/matdillen/STSM-wikidata-people/blob/master/collectormatching.Rmd. I have changed them slightly by removing the identifier the presence of which is queried on from the SELECT clause and adding the Harvard Index of Botanists ID (P6264), IPNI ID (P586) and IPNI Standard Form (P428). This way all queries return the same terms and no data will be lost when duplicates are removed. The added terms will be valuable for verifying matches later on.

…

Update 2023-04-24: refactor from bloodhount to bionomia

### “Biologists” in general

For instance: Walter G. Berendsohn (https://www.wikidata.org/wiki/Q54499411) is described as:

- occupation: researcher, botanist

… and we could try to query him, or biologist in general, by using the relation that a botanist should be a subproperty of working in the field of biology, however it retrieves also "horse breeder", "physiologist" aso. that are primarily not of interest and would give a lot of data noise. In theory to get all biologist we would ask like:

- occupation (P106)/subproperty of (P279) = is biologist (Q864503) or
- occupation (P106)/subproperty of (P279) = is biology (Q420) or
- field of work (P101)/subproperty of (P279) = is biologist (Q864503) or
- field of work (P101)/subproperty of (P279) = is biology (Q420)

… and the query to get occupation ~ biologist:

    ?item wdt:P31 wd:Q5 .
    ?item p:P106 ?statement_occupation_biologist.
    ?statement_occupation_biologist (ps:P106/(wdt:P279*)) wd:Q864503.
    # gets time out, could be cut into parts with LIMIT and OFFSET perhaps

- Occupation in biology gets 315 hits (17.5.2023) many "horse breeder" "racehorse owner" — not helpful.

Queries `occupation_biologist`, `fieldofwork_biology` is perhaps too broad (also with time out), so we try to narrow it to botanist

In [3]:
queries = {}

In [4]:
queries['occupation_botanist'] = """
SELECT DISTINCT ?item ?itemLabel ?orcid ?viaf ?isni ?yob ?yod ?fly ?wyb ?wye ?harv ?ipni ?abbr ?bionomia_id 
  WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,fr". }
  {
    SELECT DISTINCT ?item 
      ?orcid ?viaf ?isni ?yob ?yod ?fly ?wyb ?wye 
      ?harv ?ipni ?abbr ?bionomia_id WHERE {
        ?item wdt:P31 wd:Q5 . # Q5 human
        ?item p:P106 ?statement_occupation_botanist.
        ?statement_occupation_botanist (ps:P106) wd:Q2374149.
        OPTIONAL { ?item wdt:P496  ?orcid . }
        OPTIONAL { ?item wdt:P214  ?viaf . }
        OPTIONAL { ?item wdt:P213  ?isni . }
        OPTIONAL { ?item wdt:P6264 ?harv . }
        OPTIONAL { ?item wdt:P586  ?ipni . }
        OPTIONAL { ?item wdt:P428  ?abbr . }
        OPTIONAL { ?item wdt:P6944 ?bionomia_id . }
        OPTIONAL { ?item wdt:P569  ?dob . BIND(YEAR(?dob) as ?yob) }
        OPTIONAL { ?item wdt:P570  ?dod . BIND(YEAR(?dod) as ?yod) }
        OPTIONAL { ?item wdt:P1317 ?fl .  BIND(YEAR(?fl)  as ?fly) }
        OPTIONAL { ?item wdt:P2031 ?wpb . BIND(YEAR(?wpb) as ?wyb) } # work periode beginning
        OPTIONAL { ?item wdt:P2032 ?wpe . BIND(YEAR(?wpe) as ?wye) } # work periode end
      }
    # LIMIT 10000 # 61026 records 17.5.2023
  }
}
"""

### Bionomia ID (P6944)

In [5]:
queries['bionomia_id'] = """
SELECT DISTINCT ?item ?itemLabel ?orcid ?viaf ?isni ?yob ?yod ?fly ?wyb ?wye 
  ?harv ?ipni ?abbr ?bionomia_id
WHERE {
  ?item wdt:P31 wd:Q5 . # Q5 human
  ?item wdt:P6944 ?id . 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" } 
  OPTIONAL { ?item wdt:P496 ?orcid .}
  OPTIONAL { ?item wdt:P214 ?viaf .}
  OPTIONAL { ?item wdt:P213 ?isni .}
  OPTIONAL { ?item wdt:P6264 ?harv . }
  OPTIONAL { ?item wdt:P586 ?ipni . }
  OPTIONAL { ?item wdt:P428 ?abbr . }
  OPTIONAL { ?item wdt:P6944 ?bionomia_id . }
  OPTIONAL { ?item wdt:P569 ?dob . BIND(YEAR(?dob) as ?yob) }
  OPTIONAL { ?item wdt:P570 ?dod . BIND(YEAR(?dod) as ?yod) }
  OPTIONAL { ?item wdt:P1317 ?fl . BIND(YEAR(?fl) as ?fly) }
  OPTIONAL { ?item wdt:P2031 ?wpb . BIND(YEAR(?wpb) as ?wyb) }
  OPTIONAL { ?item wdt:P2032 ?wpe . BIND(YEAR(?wpe) as ?wye) }
}
"""

## Create the data frame

Run all the SPARQL requests, concatenate the results and drop duplicates

In [7]:
# Run the queries and create a list of data frames
frames = []
for key, query in queries.items():
    print(key + ': get data …')
    dfi = query_wikidata(query)
    print(key + ': ' + str(len(dfi.index)) + ' records')
    frames.append(dfi)

# Concatenate the dataframes from each SPARQL request
df = pd.concat(frames)

# Drop duplicates
df = df.drop_duplicates(subset=['item'])

df.head()

occupation_botanist: get data …
occupation_botanist: 61785 records
bionomia_id: get data …
bionomia_id: 14044 records


Unnamed: 0,item,itemLabel,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb,wye
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,,43340073,0000 0001 1630 5464,1373,6129-1,M.Bieb.,Q66612,1768,1826,,
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,,20328622,0000 0001 1604 8680,42741,619-1,Behr,Q66934,1818,1904,,
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,,47016953,0000 0000 8343 3899,1101,12818-1,Schaeff.,,1718,1790,,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,,20426762,0000 0001 1749 2732,135,4855-1,Klotzsch,Q67003,1805,1860,,
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,,59847236,0000 0001 1653 0899,73782,23266-1,Menge,,1808,1880,,


Add surname, initials and canonical string (`surname` + ', ' + `initials`) columns 

TODO optimize name splitting (use https://libraries.io/rubygems/dwc_agent ? here in the loop)

- itemLabel: `William J. Bell (entomologist)`
- itemLabel: `Cecil Stevenson Garnett; d.1950`
- itemLabel: `William Vernon (c. 1666-1711)`
- itemLabel: `Hildur von Rettig (Lindberg)`
- itemLabel: `(Johan) Fredrik(Friedrich) (Eberhard) Svanlund`
- itemLabel: `[M.] O.K. Poon`
- itemLabel: `(J.A.A.)M.(H.) Goossens-Fontana`

… perhaps these are WikiData entries that should be cleaned up, anyway `dwcagent` could provide assistance, if needed:

```bash
# https://www.wikidata.org/wiki/Q21610079
dwcagent '(Johan) Fredrik(Friedrich) (Eberhard) Svanlund' | jq '.'
```
gives
```json
[
  {
    "family": "Svanlund",
    "given": "Johan Fredrik",
    "suffix": null,
    "particle": null,
    "dropping_particle": null,
    "nick": null,
    "appellation": null,
    "title": null
  }
]
```

In [8]:
surname = []
initials = []
canonical = []

# TODO: write a function to use dwcagent only when the last word is NOT a name (also for initials)
for i, item in df.iterrows():
    words = re.split('[ .]', item['itemLabel'])
    words = [string for string in words if string != ""]
    surname.append(words[-1]) # TODO: optimize splitting of surename when the last word is not a name
    if len(words) == 1:
        initials.append(".".join(words[-1][0]) + '.')
        canonical.append(words[-1])
    else:
        initials.append(".".join([x[0] for x in words[0:-1]]) + '.')
        canonical.append(words[-1] + ', ' + ".".join([x[0] for x in words[0:-1] if len(x) > 0]) + '.')
    
df['surname'] = surname
df['initials'] = initials
df['canonical_string'] = canonical
    
df.head()

Unnamed: 0,item,itemLabel,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb,wye,surname,initials,canonical_string
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,,43340073,0000 0001 1630 5464,1373,6129-1,M.Bieb.,Q66612,1768,1826,,,Bieberstein,F.A.M.v.,"Bieberstein, F.A.M.v."
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,,20328622,0000 0001 1604 8680,42741,619-1,Behr,Q66934,1818,1904,,,Behr,H.H.,"Behr, H.H."
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,,47016953,0000 0000 8343 3899,1101,12818-1,Schaeff.,,1718,1790,,,Schäffer,J.C.,"Schäffer, J.C."
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,,20426762,0000 0001 1749 2732,135,4855-1,Klotzsch,Q67003,1805,1860,,,Klotzsch,J.F.,"Klotzsch, J.F."
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,,59847236,0000 0001 1653 0899,73782,23266-1,Menge,,1808,1880,,,Menge,F.A.,"Menge, F.A."


In [9]:
df = df[['item', 'itemLabel', 
        'surname', 'initials', 'canonical_string', 
        'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id',
        'yob', 'yod', 'wyb', 'wye']]
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb,wye
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Bieberstein,F.A.M.v.,"Bieberstein, F.A.M.v.",,43340073,0000 0001 1630 5464,1373,6129-1,M.Bieb.,Q66612,1768,1826,,
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Behr,H.H.,"Behr, H.H.",,20328622,0000 0001 1604 8680,42741,619-1,Behr,Q66934,1818,1904,,
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Schäffer,J.C.,"Schäffer, J.C.",,47016953,0000 0000 8343 3899,1101,12818-1,Schaeff.,,1718,1790,,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Klotzsch,J.F.,"Klotzsch, J.F.",,20426762,0000 0001 1749 2732,135,4855-1,Klotzsch,Q67003,1805,1860,,
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Menge,F.A.,"Menge, F.A.",,59847236,0000 0001 1653 0899,73782,23266-1,Menge,,1808,1880,,


In [16]:
wikidata_link = []
orcid_link = []
harv_link = []
ipni_link = []
bionomia_link = []
# enc_au_sc_link = []
# au_dict_bio_link = []
for i, row in df.iterrows():
    wikidata_link.append(row['item'].replace('entity', 'wiki'))
    orcid_link.append('https://orcid.org/' + str(row['orcid']) if pd.notnull(row['orcid']) else None)
    harv_link.append('https://kiki.huh.harvard.edu/databases/botanist_search.php?mode=details&id=' + str(row['harv']) if pd.notnull(row['harv']) else None)
    ipni_link.append('https://www.ipni.org/a/' + row['ipni'] if pd.notnull(row['ipni']) else None)
    bionomia_link.append('https://bionomia.net/' + row['bionomia_id'] if pd.notnull(row['bionomia_id']) else None)
    # enc_au_sc_link.append('http://www.eoas.info/biogs/' + row['enc_au_sc_id'] if pd.notnull(row['enc_au_sc_id']) else None)
    # au_dict_bio_link.append('http://adb.anu.edu.au/biography/' + row['au_dict_bio'] if pd.notnull(row['au_dict_bio']) else None)
    
df['wikidata_link'] = wikidata_link
df['orcid_link'] = orcid_link
df['harv_link'] = harv_link
df['ipni_link'] = ipni_link
df['bionomia_link'] = bionomia_link
# df['enc_au_sc_link'] = enc_au_sc_link
# df['au_dict_bio_link'] = au_dict_bio_link

# df[df['au_dict_bio_link'].notnull()]
df.head()

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,orcid,viaf,isni,harv,ipni,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q178412,Giovanni Battista Amici,Amici,G.B.,"Amici, G.B.",,15020565,0000 0001 1598 9892,72663,32247-1,...,,1786,1863,,,http://www.wikidata.org/wiki/Q178412,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/32247-1,
1,http://www.wikidata.org/entity/Q125762,Alfred Heilbronn,Heilbronn,A.,"Heilbronn, A.",,200321915,0000 0004 1745 4316,80826,3784-1,...,,1885,1961,,,http://www.wikidata.org/wiki/Q125762,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/3784-1,
2,http://www.wikidata.org/entity/Q169306,Thomas Hanbury,Hanbury,T.,"Hanbury, T.",,2043468,0000 0001 1224 4895,61186,12529-1,...,Q169306,1832,1907,,,http://www.wikidata.org/wiki/Q169306,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/12529-1,https://bionomia.net/Q169306
3,http://www.wikidata.org/entity/Q160362,Theophrastus,Theophrastus,T.,Theophrastus,,100212289,0000 0003 9865 0994,66197,35326-1,...,,-370,-286,,,http://www.wikidata.org/wiki/Q160362,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/35326-1,
4,http://www.wikidata.org/entity/Q161420,Carlo Allioni,Allioni,C.,"Allioni, C.",,39365716,0000 0001 1567 4786,1775,20034648-1,...,Q161420,1728,1804,,,http://www.wikidata.org/wiki/Q161420,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/20034648-1,https://bionomia.net/Q161420


In [10]:
from datetime import datetime
# write data frame as CSV with a date time

import os
if not os.path.exists('data'):
    os.makedirs('data')

this_output_file=os.path.join(
    "data", 'wikidata_persons_botanists_%s.csv' % (datetime.today().strftime('%Y%m%d_%H%M'))
)

df.to_csv(this_output_file)

print("Wrote data frame into", this_output_file)

Wrote data frame into data/wikidata_persons_botanists_20230703_1352.csv
