# Get Wikidata items (=resource data)

See also 
- [comment in issue #1](https://github.com/infinite-dao/collector-matching/issues/1#issuecomment-1819337177) to get also `skol:altLabel` of names, i.e. “also known as …”.

In [1]:
!pip install sparqlwrapper

import sys
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import re


[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try 'pacman -S
[31m   [0m python-xyz', where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch-packaged Python package,
[31m   [0m create a virtual environment using 'python -m venv path/to/venv'.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch packaged Python application,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. Make sure you have python-pipx
[31m   [0m installed via pacman.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-s

## Query Wikidata function

The function takes a SPARQL query string as its argument. It will run the SPARQL query and return the result as a data frame

In [2]:
def query_wikidata(query):
    endpoint_url = "https://query.wikidata.org/sparql"

    def get_results(endpoint_url, query):
        user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
        # TODO adjust user agent; see https://w.wiki/CX6
        sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
        sparql.setQuery(query)
        sparql.setReturnFormat(JSON)
        return sparql.query().convert()


    results = get_results(endpoint_url, query)

    raw = pd.json_normalize(results["results"]["bindings"])

    df = raw.filter(regex="\.value$")
    df = df.rename(columns=lambda x: re.sub('\.value$','',x))
    
    if 'orcid' not in df.columns:
        df['orcid'] = None
    if 'wye' not in df.columns:
        df['wye'] = None
    # if 'au_dict_bio' not in df.columns:
    #     df['au_dict_bio'] = None
    
    # order columns so that they are always in the same order
    cols = ['item', 'itemLabel', 'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id',
            'yob', 'yod', 'wyb', 'wye']
    df = df[cols]
    
    return df

## SPARQL queries

Because wildcard search against item label in Wikidata are very slow and generally time out, we run a number of searches on the presence of relavant identifiers.

SPARQL queries courtesy of Mathias Dillen, Botanic Garden Meise: https://github.com/matdillen/STSM-wikidata-people/blob/master/collectormatching.Rmd. I have changed them slightly by removing the identifier the presence of which is queried on from the SELECT clause and adding the Harvard Index of Botanists ID (P6264), IPNI ID (P586) and IPNI Standard Form (P428). This way all queries return the same terms and no data will be lost when duplicates are removed. The added terms will be valuable for verifying matches later on.

…

Tecnical notes: refactor from bloodhount to bionomia

### “Biologists” in general

For instance: Walter G. Berendsohn (https://www.wikidata.org/wiki/Q54499411) is described as:

- occupation: researcher, botanist

… and we could try to query him, or biologist in general, by using the relation that a botanist should be a subproperty of working in the field of biology, however it retrieves also "horse breeder", "physiologist" aso. that are primarily not of interest and would give a lot of data noise. In theory to get all biologist we would ask like:

- occupation (P106)/subproperty of (P279) = is biologist (Q864503) or
- occupation (P106)/subproperty of (P279) = is biology (Q420) or
- field of work (P101)/subproperty of (P279) = is biologist (Q864503) or
- field of work (P101)/subproperty of (P279) = is biology (Q420)

… and the query to get occupation ~ biologist:

    ?item wdt:P31 wd:Q5 .
    ?item p:P106 ?statement_occupation_biologist.
    ?statement_occupation_biologist (ps:P106/(wdt:P279*)) wd:Q864503.
    # gets time out, could be cut into parts with LIMIT and OFFSET perhaps

- Occupation in biology gets 315 hits (17.5.2023) many "horse breeder" "racehorse owner" — not helpful.

Queries `occupation_biologist`, `fieldofwork_biology` is perhaps too broad (also with time out), so we try to narrow it to botanist

### Get Botanists

To query biologists results in too large a number of names, which is unsuitable for comparing botanists, for example, so it was decided to limit the query to the lowest possible occupational branch, i.e. the occupational title botanist, the query would request something like:

- occupation (P106)/subproperty of (P279) = is botanist (Q2374149) or
- field of work (P101)/subproperty of (P279) = is botanist (Q2374149)


In [3]:
queries = {}

In [10]:
queries['occupation_botanist'] = """
SELECT DISTINCT ?item ?itemLabel ?orcid ?viaf ?isni ?yob ?yod ?fly ?wyb ?wye ?harv ?ipni ?abbr ?bionomia_id 
  WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,fr,ru,be,uz". }
  {
    SELECT DISTINCT ?item 
      ?orcid ?viaf ?isni ?yob ?yod ?fly ?wyb ?wye 
      ?harv ?ipni ?abbr ?bionomia_id WHERE {
        ?item wdt:P31 wd:Q5 . # Q5 human
        ?item p:P106 ?statement_occupation_botanist.
        # ?statement_occupation_botanist (ps:P106) wd:Q2374149.
        ?statement_occupation_botanist (ps:P106/(wdt:P279*)) wd:Q2374149.
        OPTIONAL { ?item wdt:P496  ?orcid . }
        OPTIONAL { ?item wdt:P214  ?viaf . }
        OPTIONAL { ?item wdt:P213  ?isni . }
        OPTIONAL { ?item wdt:P6264 ?harv . }
        OPTIONAL { ?item wdt:P586  ?ipni . }
        OPTIONAL { ?item wdt:P428  ?abbr . }
        OPTIONAL { ?item wdt:P6944 ?bionomia_id . }
        OPTIONAL { ?item wdt:P569  ?dob . BIND(YEAR(?dob) as ?yob) }
        OPTIONAL { ?item wdt:P570  ?dod . BIND(YEAR(?dod) as ?yod) }
        OPTIONAL { ?item wdt:P1317 ?fl .  BIND(YEAR(?fl)  as ?fly) }
        OPTIONAL { ?item wdt:P2031 ?wpb . BIND(YEAR(?wpb) as ?wyb) } # work periode beginning
        OPTIONAL { ?item wdt:P2032 ?wpe . BIND(YEAR(?wpe) as ?wye) } # work periode end
      }
    # LIMIT 10000 # 71019 records 14.11.2023
  }
}
"""

### Bionomia ID (P6944)

In [11]:
queries['bionomia_id'] = """
SELECT DISTINCT ?item ?itemLabel ?orcid ?viaf ?isni ?yob ?yod ?fly ?wyb ?wye 
  ?harv ?ipni ?abbr ?bionomia_id
WHERE {
  ?item wdt:P31 wd:Q5 . # Q5 human
  ?item wdt:P6944 ?id . 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,fr,ru,be,uz" } 
  OPTIONAL { ?item wdt:P496 ?orcid .}
  OPTIONAL { ?item wdt:P214 ?viaf .}
  OPTIONAL { ?item wdt:P213 ?isni .}
  OPTIONAL { ?item wdt:P6264 ?harv . }
  OPTIONAL { ?item wdt:P586 ?ipni . }
  OPTIONAL { ?item wdt:P428 ?abbr . }
  OPTIONAL { ?item wdt:P6944 ?bionomia_id . }
  OPTIONAL { ?item wdt:P569 ?dob . BIND(YEAR(?dob) as ?yob) }
  OPTIONAL { ?item wdt:P570 ?dod . BIND(YEAR(?dod) as ?yod) }
  OPTIONAL { ?item wdt:P1317 ?fl . BIND(YEAR(?fl) as ?fly) }
  OPTIONAL { ?item wdt:P2031 ?wpb . BIND(YEAR(?wpb) as ?wyb) }
  OPTIONAL { ?item wdt:P2032 ?wpe . BIND(YEAR(?wpe) as ?wye) }
}
"""

## Create the data frame

Run all the SPARQL requests, concatenate the results and drop duplicates

In [12]:
# Run the queries and create a list of data frames
frames = []
for key, query in queries.items():
    print(key + ': get data …')
    dfi = query_wikidata(query)
    print(key + ': ' + str(len(dfi.index)) + ' records')
    frames.append(dfi)

# Concatenate the dataframes from each SPARQL request
df = pd.concat(frames)

# Drop duplicates
df = df.drop_duplicates(subset=['item'])

df.head()

occupation_botanist: get data …
occupation_botanist: 71019 records
bionomia_id: get data …
bionomia_id: 14883 records


Unnamed: 0,item,itemLabel,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb,wye
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,,43340073,0000 0001 1630 5464,1373,6129-1,M.Bieb.,Q66612,1768,1826,,
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,,20328622,0000 0001 1604 8680,42741,619-1,Behr,Q66934,1818,1904,,
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,,47016953,0000 0000 8343 3899,1101,12818-1,Schaeff.,,1718,1790,,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,,20426762,0000 0001 1749 2732,135,4855-1,Klotzsch,Q67003,1805,1860,,
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,,59847236,0000 0001 1653 0899,73782,23266-1,Menge,,1808,1880,,


Add surname, initials and canonical string (`surname` + ', ' + `initials`) columns 

TODO optimize name splitting (use https://libraries.io/rubygems/dwc_agent ? here in the loop)

- itemLabel: `William J. Bell (entomologist)`
- itemLabel: `Cecil Stevenson Garnett; d.1950`
- itemLabel: `William Vernon (c. 1666-1711)`
- itemLabel: `Hildur von Rettig (Lindberg)`
- itemLabel: `(Johan) Fredrik(Friedrich) (Eberhard) Svanlund`
- itemLabel: `[M.] O.K. Poon`
- itemLabel: `(J.A.A.)M.(H.) Goossens-Fontana`
- itemLabel: `Thomas Platter the Younger`

… perhaps these are WikiData entries that should be cleaned up, anyway `dwcagent` could provide assistance, if needed:

```bash
# https://www.wikidata.org/wiki/Q21610079
dwcagent '(Johan) Fredrik(Friedrich) (Eberhard) Svanlund' | jq '.'
```
gives
```json
[
  {
    "family": "Svanlund",
    "given": "Johan Fredrik",
    "suffix": null,
    "particle": null,
    "dropping_particle": null,
    "nick": null,
    "appellation": null,
    "title": null
  }
]
```

In [19]:
df.head()

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Friedrich August Marschall von,F. A. M. v.,F. A. M. v. Bieberstein,Friedrich August Marschall von Bieberstein,,43340073,0000 0001 1630 5464,1373,...,Q66612,1768,1826,,,http://www.wikidata.org/wiki/Q66612,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/6129-1,https://bionomia.net/Q66612
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Hans Hermann,H. H.,H. H. Behr,Hans Hermann Behr,,20328622,0000 0001 1604 8680,42741,...,Q66934,1818,1904,,,http://www.wikidata.org/wiki/Q66934,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/619-1,https://bionomia.net/Q66934
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Jacob Christian,J. C.,J. C. Schäffer,Jacob Christian Schäffer,,47016953,0000 0000 8343 3899,1101,...,,1718,1790,,,http://www.wikidata.org/wiki/Q66661,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/12818-1,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Johann Friedrich,J. F.,J. F. Klotzsch,Johann Friedrich Klotzsch,,20426762,0000 0001 1749 2732,135,...,Q67003,1805,1860,,,http://www.wikidata.org/wiki/Q67003,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/4855-1,https://bionomia.net/Q67003
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Franz Anton,F. A.,F. A. Menge,Franz Anton Menge,,59847236,0000 0001 1653 0899,73782,...,,1808,1880,,,http://www.wikidata.org/wiki/Q66322,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/23266-1,


In [24]:
# add some methods to abbreviate a name word, clean messy name input
# 
# The idea is to get the family name at the end of the string, and swap a reverse name spelling 
# ("family, given name" aso.), so that the call name comes first and the family or clan name 
# comes near the end (as most names are given like this in WikiData)

def abbr_word(word):
    """
    Return the abbreviation of a word, e.g: Anton → A. or (Lisa) → (L.)

    @param word: a word without spaces
    @return: str an abbreviated word
    """
    if len(word) > 0:
        word = word.strip()
        names_regex_substitution = {
            # generic simple names: Antonio OR (Antonius)
            r"^([^\w\s]*)(\w)\w*([^\w\s\.]*)$": r"\1\2.\3",
            # names having minus: Charles-Jeunet OR (Carl-Jeanet) → CJ. OR (CJ.)
            r"^([^\w\s]*)(\w)\w*[-–—](\w)\w*([^\w\s\.]*)$": r"\1\2\3.\4",
            # names having apostrophe: (Ch'An)
            r"^([^\w\s]*)(\w)\w*['’’´`](\w)\w*([^\w\s\.]*)$": r"\1\2.\4",
            # names having comma: (Eugen,Eugène)
            r"^([^\w\s]*)(\w)\w*[,](\w)\w*([^\w\s\.]*)$": r"\1\2.,\3.\4",
            # names having immediate parentheses: Wilhelmus(Wim)
            r"^(\w)\w*(\()(\w)\w*(\))$": r"\1.\2\3.\4",
            # names having immediate parentheses + comma: Ion(Ioan,Joan)
            r"^(\w)\w*(\()(\w)\w*,(\w)\w*(\))$": r"\1.\2\3.,\4.\5",
            # names having immediate parentheses + an apostrophe: "Chan(Ch'An)"
            r"^(\w)\w*(\()(\w)\w*['’’´`](\w)\w*(\))$": r"\1.\2\3.\5",
            # names having immediate parentheses + minus in the name: 'Ken-Ichiro(Ken-Itirô)'
            r"^([^\w\s]*)(\w)\w*[-–—](\w)\w*([^\w\s\.]*)(\w)\w*[-–—](\w)\w*([^\w\s\.]*)$": r"\1\2\3.\4\5\6.\7",
        }

        for k_search_pattern, v_replace_pattern in names_regex_substitution.items():
            if re.match(k_search_pattern, word, re.IGNORECASE):
                return re.sub(k_search_pattern, v_replace_pattern, word)
        return word
    else:
        return word

def clean_person_name(this_full_name):
    """
    Clean person name from WikiData to get `given + … + family` and not in reverse order `family, given + … `
    
    TODO how to deal with comma and: …, Jr., II. aso. ? (should be considered when splitting on family name)
    
    @requires re
    @param this_full_name: str the name to be cleaned
    @return: str the processed name
    """
    parentheses_words_at_last = [
        'botaniker', 'botanist', 'botanist-1', 'bot.',
        'diatomist',
        'entomologist',
        'instruisto',
        'lehrer',
        'mycologist',
        'taxonomist', 'teacher',
        'zoologist'
    ]
    regex_paranthesis_words = re.compile(r' +\((' + '|'.join(parentheses_words_at_last) + r')\) *$', flags=re.IGNORECASE)

    # delete life time e.g. "… (c. 1534)" or "… (1748-1801)"
    this_full_name = re.sub(r" +\([c. ]*\d+[-–—]*\d+\) *", r"", this_full_name)
    # delete occupations in parentheses
    this_full_name = regex_paranthesis_words.sub(r"", this_full_name)
    # delete noble designations, e.g. “Sir James Nasmyth, 2nd Baronet” → “Sir James Nasmyth”
    this_full_name = re.sub(r" *, +[2]*(1st|2nd|3rd|[4-9]th|[1][0-9]th)[^,]+$", r"", this_full_name)
    # reverse cyrillic simple names: Штейп, Владимир Владимирович → Владимир Владимирович Штейп
    this_full_name = re.sub(r"^([\u0400-\u04FF-]+),\s+([\u0400-\u04FF- ]+)$", r"\2 \1", this_full_name)    
        
    return this_full_name

def simple_namereverse2family_last(this_name_reversed):
    """
    Get the family name or clan name to the last position from names that seem obviously reversed
    
    Draft: it should be applied, after family name splitting was unsuccessful
    @param this_name_reversed:
    @return:
    """
    if len(this_name_reversed) > 0:
        this_name_reversed = this_name_reversed.strip()
        names_regex_substitution = {
            # Baudoin-Bodin, Jacqueline
            # Brennecke, Dorothea
            # Chi, Chün-tao
            # Cormack, R.G.H.
            # Gajón Sánchez, Carlos
            r"^(\w+[-.\s]*\w*[-.\s]*\w*[-.\s]*),\s(\w+[-.\s]*\w*[-.\s]*\w*[-.\s]*)$": r"\2 \1",
            # TODO William E., III, Fox → William E. Fox, III
            r"^(\w+[-.\s]*\w*[-.\s]*\w*[-.\s]*),\s([IVX]+),\s(\w+[-.\s]*\w*[-.\s]*\w*[-.\s]*)$": r"\1 \3, \2",
        }

        for k_search_pattern, v_replace_pattern in names_regex_substitution.items():
            if re.match(k_search_pattern, this_name_reversed, re.IGNORECASE):
                return re.sub(k_search_pattern, v_replace_pattern, this_name_reversed)
        return this_name_reversed
    else:
        return this_name_reversed

regex_split_on_family_name = re.compile(
    r"""
    (?<!,)
    \s+ (
     \w+, \s+ [Bb]aron[in]*   \s+ [Vv]on \s+ \w+[-]?\w+
    |\w+, \s+ [Gg]r[aä]f[in]* \s+ [Vv]on \s+ \w+[-]?\w+
    |\w+, \s+ [Dd]uke         \s+ [Oo]f \s+ \w+[-]?\w+
    |\w+, \s+ [Dd]uchess      \s+ [Oo]f \s+ \w+[-]?\w+
    |\w+, \s+ [Cc]ountess     \s+ [Oo]f \s+ \w+[-]?\w+
    |\w+, \s+ [Mm]archioness  \s+ [Oo]f \s+ \w+[-]?\w+
    |\w+, \s+ [Dd]uchesse     \s+ [Dd]e \s+ \w+[-]?\w+
    |\w+, \s+ [Cc]omte        \s+ [Dd]e \s+ \w+[-]?\w+
    |\w+, \s+ [Cc]omtesse     \s+ [Dd]e \s+ \w+[-]?\w+
    |\w+, \s+ later           \s+ \w+[-]?\w+
    |\w+, \s+ [XVI]+[.]?
    |\w+, \s+ Junior[.]?
    |\w+, \s+ [JjS]r[.]?
    |\w['’‘]\w+
    |\w+
    |\w+\s?\([^()]+\)[.]?
    |\w\w*[.]?[-–—]\w+
    )$
    """,
    re.VERBOSE | re.MULTILINE
)

In [25]:
surname = []
initials = []
canonical = []
canonical_fullname = []

# TODO: use dwcagent on names? Unfortunately it removes all parentheses content

for i, item in df.iterrows():
    thisItemLabelModified = item['itemLabel'].strip()
    thisItemLabelModified = clean_person_name(thisItemLabelModified)
    # split at last name part, optionally having parentheses TODO how to deal with Jr., II. aso. 
    # print(thisItemLabelModified)
    nameparts = regex_split_on_family_name.split(thisItemLabelModified)
    if len(nameparts) == 1 and re.findall(r",", thisItemLabelModified):
        # re-iterate through probably reversed names
        thisItemLabelModified = simple_namereverse2family_last(thisItemLabelModified)
        nameparts = regex_split_on_family_name.split(thisItemLabelModified)
    # remove empty/None: ['Friedrich August Marschall von', 'Bieberstein', None, '']
    nameparts = [string for string in nameparts if string]
    if len(nameparts)==1:
        initials.append("") # no beginning name initials per se, only family name
        surname.append("")
        canonical.append(" ".join(nameparts))
        canonical_fullname.append(" ".join(nameparts))
    else:
        surname.append(nameparts[0])
        first_nameparts = re.split('[ ]', nameparts[0])
        first_nameparts = [string for string in first_nameparts if string]
        # print(first_nameparts)
        initials.append(" ".join([abbr_word(w) for w in first_nameparts if len(w) > 0]) )
        canonical.append(" ".join([abbr_word(w) for w in first_nameparts if len(w) > 0]) + " " + nameparts[1])
        # canonical.append(words[-1] + ', ' + ".".join([w[0] for w in words[0:-1] if len(w) > 0]) + '.')
        canonical_fullname.append(" ".join(nameparts))
    
df['surname'] = surname
df['initials'] = initials
df['canonical_string'] = canonical
df['canonical_string_fullname'] = canonical_fullname
    
df.head()

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Friedrich August Marschall von,F. A. M. v.,F. A. M. v. Bieberstein,Friedrich August Marschall von Bieberstein,,43340073,0000 0001 1630 5464,1373,...,Q66612,1768,1826,,,http://www.wikidata.org/wiki/Q66612,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/6129-1,https://bionomia.net/Q66612
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Hans Hermann,H. H.,H. H. Behr,Hans Hermann Behr,,20328622,0000 0001 1604 8680,42741,...,Q66934,1818,1904,,,http://www.wikidata.org/wiki/Q66934,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/619-1,https://bionomia.net/Q66934
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Jacob Christian,J. C.,J. C. Schäffer,Jacob Christian Schäffer,,47016953,0000 0000 8343 3899,1101,...,,1718,1790,,,http://www.wikidata.org/wiki/Q66661,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/12818-1,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Johann Friedrich,J. F.,J. F. Klotzsch,Johann Friedrich Klotzsch,,20426762,0000 0001 1749 2732,135,...,Q67003,1805,1860,,,http://www.wikidata.org/wiki/Q67003,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/4855-1,https://bionomia.net/Q67003
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Franz Anton,F. A.,F. A. Menge,Franz Anton Menge,,59847236,0000 0001 1653 0899,73782,...,,1808,1880,,,http://www.wikidata.org/wiki/Q66322,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/23266-1,


In [26]:
df = df[['item', 'itemLabel', 
        'surname', 'initials', 'canonical_string', 'canonical_string_fullname',
        'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id',
        'yob', 'yod', 'wyb', 'wye']]
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb,wye
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Friedrich August Marschall von,F. A. M. v.,F. A. M. v. Bieberstein,Friedrich August Marschall von Bieberstein,,43340073,0000 0001 1630 5464,1373,6129-1,M.Bieb.,Q66612,1768,1826,,
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Hans Hermann,H. H.,H. H. Behr,Hans Hermann Behr,,20328622,0000 0001 1604 8680,42741,619-1,Behr,Q66934,1818,1904,,
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Jacob Christian,J. C.,J. C. Schäffer,Jacob Christian Schäffer,,47016953,0000 0000 8343 3899,1101,12818-1,Schaeff.,,1718,1790,,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Johann Friedrich,J. F.,J. F. Klotzsch,Johann Friedrich Klotzsch,,20426762,0000 0001 1749 2732,135,4855-1,Klotzsch,Q67003,1805,1860,,
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Franz Anton,F. A.,F. A. Menge,Franz Anton Menge,,59847236,0000 0001 1653 0899,73782,23266-1,Menge,,1808,1880,,


In [27]:
# add custom columns
df['wikidata_link'] = df.apply(lambda this_df: this_df['item'].replace('entity', 'wiki'), axis="columns") # needed?
df['orcid_link'] = df[pd.notnull(df["orcid"])].apply(lambda this_df: 'https://orcid.org/' + str(this_df['orcid']), axis="columns")
df['harv_link'] = df[pd.notnull(df["harv"])].apply(lambda this_df: 'https://kiki.huh.harvard.edu/databases/botanist_search.php?mode=details&id=' + str(this_df['harv']), axis="columns")
df['ipni_link'] = df[pd.notnull(df["ipni"])].apply(lambda this_df: 'https://www.ipni.org/a/' + str(this_df['ipni']), axis="columns")
df['bionomia_link'] = df[pd.notnull(df["bionomia_id"])].apply(lambda this_df: 'https://bionomia.net/' + str(this_df['bionomia_id']), axis="columns")
df.head()

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Friedrich August Marschall von,F. A. M. v.,F. A. M. v. Bieberstein,Friedrich August Marschall von Bieberstein,,43340073,0000 0001 1630 5464,1373,...,Q66612,1768,1826,,,http://www.wikidata.org/wiki/Q66612,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/6129-1,https://bionomia.net/Q66612
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Hans Hermann,H. H.,H. H. Behr,Hans Hermann Behr,,20328622,0000 0001 1604 8680,42741,...,Q66934,1818,1904,,,http://www.wikidata.org/wiki/Q66934,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/619-1,https://bionomia.net/Q66934
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Jacob Christian,J. C.,J. C. Schäffer,Jacob Christian Schäffer,,47016953,0000 0000 8343 3899,1101,...,,1718,1790,,,http://www.wikidata.org/wiki/Q66661,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/12818-1,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Johann Friedrich,J. F.,J. F. Klotzsch,Johann Friedrich Klotzsch,,20426762,0000 0001 1749 2732,135,...,Q67003,1805,1860,,,http://www.wikidata.org/wiki/Q67003,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/4855-1,https://bionomia.net/Q67003
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Franz Anton,F. A.,F. A. Menge,Franz Anton Menge,,59847236,0000 0001 1653 0899,73782,...,,1808,1880,,,http://www.wikidata.org/wiki/Q66322,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/23266-1,


In [28]:
from datetime import datetime
# write data frame as CSV with a date time

import os
if not os.path.exists('data'):
    os.makedirs('data')

this_output_file=os.path.join(
    "data", 'wikidata_persons_botanists_%s.csv' % 
    # "20230703_1352"
    # '%Y%m%d_%H%M'
    (datetime.today().strftime('%Y%m%d'))
)

df.to_csv(this_output_file)

print("Wrote data frame into", this_output_file)

Wrote data frame into data/wikidata_persons_botanists_20231116.csv


Explanation of columns:

Column | Description
-|-
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
canonical_string_fullname | Canonical name string including full (given) name; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
bionomia_id | identifier for a collector and/or determiner of natural history specimens, in the Bionomia database ([P6944](https://www.wikidata.org/wiki/Property:P6944))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Work year period begin ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | Work year period end ([P2032](https://www.wikidata.org/wiki/Property:P2032))