# Get Wikidata items (=resource data)

We query the `itemLabel` (`rdfs:label`) of botanist person names and also the `altLabel` (`skol:altLabel`), i.e. “also known as …” or the name aliases — work year period begin (`wyb`) yielded hardly any useful data and was therefore removed for the time being. If you want to query also other language sources please adjust the SPARQL query

In [1]:
!pip install sparqlwrapper

import sys
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import re

explain_and_show_the_data = True
this_timestamp_for_data=20240312 # (datetime.today().strftime('%Y%m%d')) # '%Y%m%d_%H%M'

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try 'pacman -S
[31m   [0m python-xyz', where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch-packaged Python package,
[31m   [0m create a virtual environment using 'python -m venv path/to/venv'.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch packaged Python application,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. Make sure you have python-pipx
[31m   [0m installed via pacman.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-s

## Query Wikidata function

The function takes a SPARQL query string as its argument. It will run the SPARQL query and return the result as a data frame

In [2]:
def query_wikidata(query):
    endpoint_url = "https://query.wikidata.org/sparql"

    def get_results(endpoint_url, query):
        user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
        # TODO adjust user agent; see https://w.wiki/CX6
        sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
        sparql.setQuery(query)
        sparql.setReturnFormat(JSON)
        return sparql.query().convert()


    results = get_results(endpoint_url, query)

    raw = pd.json_normalize(results["results"]["bindings"])

    thisdf = raw.filter(regex="\.value$")
    thisdf = thisdf.rename(columns=lambda x: re.sub('\.value$','',x))
    
    if 'orcid' not in thisdf.columns:
        thisdf['orcid'] = None
    # if 'wye' not in thisdf.columns:
    #     thisdf['wye'] = None
    # if 'au_dict_bio' not in thisdf.columns:
    #     thisdf['au_dict_bio'] = None
    
    # order columns so that they are always in the same order
    # sparql_cols = ['item', 'itemLabel', 'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod', 'wyb', 'wye']
    sparql_cols = ['item', 'itemLabel', 'altLabel', 'altLabel_lang', 'abbr', 'yob', 'yod', 'orcid', 'viaf', 'isni', 'harv', 'ipni', 'bionomia_id']
    thisdf = thisdf[sparql_cols]
    
    # for col in list(filter(lambda x: x != 'yob' and x != 'yod', sparql_cols)):
    #     thisdf[col].astype('string')
    
    return thisdf

## SPARQL queries

Because wildcard search against item label in Wikidata are very slow and generally time out, we run a number of searches on the presence of relavant identifiers.

SPARQL queries courtesy of Mathias Dillen, Botanic Garden Meise: https://github.com/matdillen/STSM-wikidata-people/blob/master/collectormatching.Rmd. I have changed them slightly by removing the identifier the presence of which is queried on from the SELECT clause and adding the Harvard Index of Botanists ID (P6264), IPNI ID (P586) and IPNI Standard Form (P428). This way all queries return the same terms and no data will be lost when duplicates are removed. The added terms will be valuable for verifying matches later on.

…

Tecnical notes: refactor from bloodhount to bionomia

### “Biologists” in general

For instance: Walter G. Berendsohn (https://www.wikidata.org/wiki/Q54499411) is described as:

- occupation: researcher, botanist

… and we could try to query him, or biologist in general, by using the relation that a botanist should be a subproperty of working in the field of biology, however it retrieves also "horse breeder", "physiologist" aso. that are primarily not of interest and would give a lot of data noise. In theory to get all biologist we would ask like:

- occupation (P106)/subproperty of (P279) = is biologist (Q864503) or
- occupation (P106)/subproperty of (P279) = is biology (Q420) or
- field of work (P101)/subproperty of (P279) = is biologist (Q864503) or
- field of work (P101)/subproperty of (P279) = is biology (Q420)

… and the query to get occupation ~ biologist:

    ?item wdt:P31 wd:Q5 .
    ?item p:P106 ?statement_occupation_biologist.
    ?statement_occupation_biologist (ps:P106/(wdt:P279*)) wd:Q864503.
    # gets time out, could be cut into parts with LIMIT and OFFSET perhaps

- Occupation in biology gets 315 hits (17.5.2023) many "horse breeder" "racehorse owner" — not helpful.

Queries `occupation_biologist`, `fieldofwork_biology` is perhaps too broad (also with time out), so we try to narrow it to botanist

### Get Botanists

To query biologists results in too large a number of names, which is unsuitable for comparing botanists, for example, so it was decided to limit the query to the lowest possible occupational branch, i.e. the occupational title botanist, the query would request something like:

- occupation (P106)/subproperty of (P279) = is botanist (Q2374149) or
- field of work (P101)/subproperty of (P279) = is botanist (Q2374149)


In [3]:
queries = {}

queries['occupation_botanist'] = """
SELECT DISTINCT 
  ?item ?itemLabel ?altLabel ?altLabel_lang ?abbr 
  ?yob ?yod
  ?orcid ?viaf ?isni ?harv ?ipni ?bionomia_id 
  WHERE {
    ?item wdt:P31 wd:Q5 ;
        p:P106 ?statement_occupation_botanist.
    # ?statement_occupation_botanist (ps:P106) wd:Q2374149.
    ?statement_occupation_botanist (ps:P106/(wdt:P279*)) wd:Q2374149.
    OPTIONAL { ?item rdfs:label ?itemLabel . FILTER (lang(?itemLabel) IN("en", "de", "ru") ) }
    OPTIONAL { ?item skos:altLabel ?altLabel . FILTER (lang(?altLabel) IN("en", "de", "ru" ) ) 
              BIND( lang(?altLabel)  as ?altLabel_lang )
    }
    OPTIONAL { ?item wdt:P496  ?orcid . }
    OPTIONAL { ?item wdt:P214  ?viaf . }
    OPTIONAL { ?item wdt:P213  ?isni . }
    OPTIONAL { ?item wdt:P6264 ?harv . }
    OPTIONAL { ?item wdt:P586  ?ipni . }
    OPTIONAL { ?item wdt:P428  ?abbr . }
    OPTIONAL { ?item wdt:P6944 ?bionomia_id . }
    OPTIONAL { ?item wdt:P569  ?dob . BIND(YEAR(?dob) as ?yob) }
    OPTIONAL { ?item wdt:P570  ?dod . BIND(YEAR(?dod) as ?yod) }
  }    
  LIMIT 500000 # it seems faster to limit it just above the real number of total results
"""

### Bionomia ID (P6944)

queries['bionomia_id'] = """
SELECT DISTINCT 
  ?item ?itemLabel ?altLabel ?altLabel_lang ?abbr 
  ?yob ?yod
  ?orcid ?viaf ?isni ?harv ?ipni ?bionomia_id 
  WHERE {
    ?item wdt:P31 wd:Q5 . # Q5 human
    ?item wdt:P6944 ?id_of_bionomia_id.
    OPTIONAL { ?item rdfs:label ?itemLabel . FILTER (lang(?itemLabel) IN("en", "de", "ru") ) }
    OPTIONAL { ?item skos:altLabel ?altLabel . FILTER (lang(?altLabel) IN("en", "de", "ru") ) 
              BIND( lang(?altLabel)  as ?altLabel_lang )
    }
    OPTIONAL { ?item wdt:P496  ?orcid . }
    OPTIONAL { ?item wdt:P214  ?viaf . }
    OPTIONAL { ?item wdt:P213  ?isni . }
    OPTIONAL { ?item wdt:P6264 ?harv . }
    OPTIONAL { ?item wdt:P586  ?ipni . }
    OPTIONAL { ?item wdt:P428  ?abbr . }
    OPTIONAL { ?item wdt:P6944 ?bionomia_id . }
    OPTIONAL { ?item wdt:P569  ?dob . BIND(YEAR(?dob) as ?yob) }
    OPTIONAL { ?item wdt:P570  ?dod . BIND(YEAR(?dod) as ?yod) }
  }    
  LIMIT 200000 # it seems faster to limit it just above the real number of total results
"""

## Query Data and Create the Data Frame

Run all the SPARQL requests, concatenate the results and drop duplicates

In [4]:
# Run the queries and create a list of data frames
frames = []
for key, query in queries.items():
    print(key + ': get data …')
    dfi = query_wikidata(query)
    print(key + ': ' + str(len(dfi.index)) + ' records')
    frames.append(dfi)

# Concatenate the dataframes from each SPARQL request
df_temp = pd.concat(frames)

occupation_botanist: get data …
occupation_botanist: 487295 records
bionomia_id: get data …
bionomia_id: 160786 records


In [5]:
if explain_and_show_the_data:
    print("compose itemMatchingLabel = altLabel + itemLabel as name source to match against (itemLabel could serve as skos:perfLabel)")
df_temp['itemMatchingLabel'] = df_temp["altLabel"]
df_temp['itemMatchingLabel'] = df_temp['itemMatchingLabel'].fillna(df_temp["itemLabel"])

col = df_temp.pop("itemMatchingLabel") # place it after itemLabel
df_temp.insert(df_temp.columns.get_loc('itemLabel') + 1, col.name, col)
df_temp.sort_values(by=['item', 'itemLabel', 'itemMatchingLabel'], inplace=True)


cols_drop_duplicates = ['item', 'itemMatchingLabel']
print("drop duplicates for: {}".format(", ".join(cols_drop_duplicates)))
df_temp = df_temp.drop_duplicates(subset=cols_drop_duplicates)

if explain_and_show_the_data:
    print("show data where altLabel was empty (NaN)")
    display(df_temp[df_temp.isnull()['altLabel']].head(10))
    print("show data where altLabel has values")
    display(df_temp[df_temp.notnull()['altLabel']].head(10))

compose itemMatchingLabel = altLabel + itemLabel as name source to match against (itemLabel could serve as skos:perfLabel)
drop duplicates for: item, itemMatchingLabel
show data where altLabel was empty (NaN)


Unnamed: 0,item,itemLabel,itemMatchingLabel,altLabel,altLabel_lang,abbr,yob,yod,orcid,viaf,isni,harv,ipni,bionomia_id
146941,http://www.wikidata.org/entity/Q100149196,Russell Cox,Russell Cox,,,,,,0000-0001-5149-1709,,,,,0000-0001-5149-1709
476562,http://www.wikidata.org/entity/Q100152199,Zhiyong Li,Zhiyong Li,,,,,,,,,,,
208674,http://www.wikidata.org/entity/Q100152296,Alda Pereira da Fonseca,Alda Pereira da Fonseca,,,,1882.0,,,,,,,
487212,http://www.wikidata.org/entity/Q100154933,Thomas R. Sinclair,Thomas R. Sinclair,,,,1944.0,,,48124632.0,120371034.0,,,
476544,http://www.wikidata.org/entity/Q100250695,Takashi Nakada,Takashi Nakada,,,Nakada,1980.0,,,,,,20049410-1,
484950,http://www.wikidata.org/entity/Q100250861,Kiplagat Kotut,Kiplagat Kotut,,,,,,,,,,,
483369,http://www.wikidata.org/entity/Q100250912,Huiyin Song,Huiyin Song,,,,,,,,,,,
483365,http://www.wikidata.org/entity/Q100250945,Yuxin Hu,Yuxin Hu,,,,,,,,,,,
481388,http://www.wikidata.org/entity/Q100277010,Qinghua Wang,Qinghua Wang,,,,,,,,,,,
476687,http://www.wikidata.org/entity/Q100277011,Jijian Long,Jijian Long,,,,,,,,,,,


show data where altLabel has values


Unnamed: 0,item,itemLabel,itemMatchingLabel,altLabel,altLabel_lang,abbr,yob,yod,orcid,viaf,isni,harv,ipni,bionomia_id
446170,http://www.wikidata.org/entity/Q100142069,Frida Eggens,Eggens,Eggens,de,Eggens,,,,,,,20045232-1,
58323,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Mrs A. H.,Mrs A. H.,en,,1792.0,1834.0,,,,,,Q100146795
58324,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Mrs Arnold Harrison,Mrs Arnold Harrison,en,,1792.0,1834.0,,,,,,Q100146795
113915,http://www.wikidata.org/entity/Q100156193,Laurence Henry Millener,L. H. Millener,L. H. Millener,en,,1914.0,2000.0,,,,,,Q100156193
113916,http://www.wikidata.org/entity/Q100156193,Laurence Henry Millener,Laurie Henry Millener,Laurie Henry Millener,en,,1914.0,2000.0,,,,,,Q100156193
113917,http://www.wikidata.org/entity/Q100156193,Laurence Henry Millener,Laurie Millener,Laurie Millener,en,,1914.0,2000.0,,,,,,Q100156193
114007,http://www.wikidata.org/entity/Q100156252,Thomas Leonard Lancaster,T. L. Lancaster,T. L. Lancaster,en,,1888.0,1945.0,,,,,,Q100156252
429836,http://www.wikidata.org/entity/Q100156269,Mike Quinn,Michael A. Quinn,Michael A. Quinn,en,,1962.0,,,,,,,
429875,http://www.wikidata.org/entity/Q100156269,Mike Quinn,Michael Andrew Quinn,Michael Andrew Quinn,de,,1962.0,,,,,,,
429890,http://www.wikidata.org/entity/Q100156269,Mike Quinn,Quinn,Quinn,de,,1962.0,,,,,,,


In [6]:
if explain_and_show_the_data: print("concatenate itemLabel also to itemMatchingLabel")
    
df_itemLabels=df_temp.drop_duplicates(subset=["itemLabel"]).copy()
df_itemLabels['itemMatchingLabel'] = df_itemLabels['itemLabel']

df_matching = pd.concat([df_temp, df_itemLabels], ignore_index=True).drop_duplicates(subset=["item", "itemMatchingLabel"]).sort_values(by=['item', 'itemLabel', 'itemMatchingLabel']).reset_index(drop=True)

lst = [df_itemLabels]
del lst

if explain_and_show_the_data: display(df_matching)

concatenate itemLabel also to itemMatchingLabel


Unnamed: 0,item,itemLabel,itemMatchingLabel,altLabel,altLabel_lang,abbr,yob,yod,orcid,viaf,isni,harv,ipni,bionomia_id
0,http://www.wikidata.org/entity/Q100142069,Frida Eggens,Eggens,Eggens,de,Eggens,,,,,,,20045232-1,
1,http://www.wikidata.org/entity/Q100142069,Frida Eggens,Frida Eggens,Eggens,de,Eggens,,,,,,,20045232-1,
2,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Elizabeth Harrison,Mrs A. H.,en,,1792,1834,,,,,,Q100146795
3,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Mrs A. H.,Mrs A. H.,en,,1792,1834,,,,,,Q100146795
4,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Mrs Arnold Harrison,Mrs Arnold Harrison,en,,1792,1834,,,,,,Q100146795
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
187328,http://www.wikidata.org/entity/Q99982783,Dorothy F. Chappell,Dorothy F. Chappell,,,,,,,,,,,
187329,http://www.wikidata.org/entity/Q99982881,Alfredo Faz,Alfredo Faz,,,,1863,1931,,,,,,Q99982881
187330,http://www.wikidata.org/entity/Q99983112,Atsumi Himizu,Atsumi Himizu,,,,,,,,,107222,,
187331,http://www.wikidata.org/entity/Q99983228,Wendy K. Bellows,Wendy K. Bellows,,,,,,,,,,,


Add columns `surname`, `initials` and `canonical_string`, `canonical_string_fullname` …

Name splitting with https://libraries.io/rubygems/dwc_agent removes all parentheses contents by concept, so we cannot apply it to the following name examples to standardize (clean or parse) names:

- itemLabel: `William J. Bell (entomologist)`
- itemLabel: `Cecil Stevenson Garnett; d.1950`
- itemLabel: `William Vernon (c. 1666-1711)`
- itemLabel: `Hildur von Rettig (Lindberg)`
- itemLabel: `(Johan) Fredrik(Friedrich) (Eberhard) Svanlund`
- itemLabel: `[M.] O.K. Poon`
- itemLabel: `(J.A.A.)M.(H.) Goossens-Fontana`
- itemLabel: `Thomas Platter the Younger`

… `dwcagent` would parse like this (see David Shorthouses comment in [issue #18 (github.com/bionomia/dwc_agent/)](https://github.com/bionomia/dwc_agent/issues/18#issuecomment-1810976221)):

```bash
# https://www.wikidata.org/wiki/Q21610079
dwcagent '(Johan) Fredrik(Friedrich) (Eberhard) Svanlund' | jq '.'
```
gives
```json
[
  {
    "family": "Svanlund",
    "given": "Johan Fredrik",
    "suffix": null,
    "particle": null,
    "dropping_particle": null,
    "nick": null,
    "appellation": null,
    "title": null
  }
]
```

In [7]:
# add some methods to abbreviate a name word, clean messy name input
# 
# The idea is to get the family name at the end of the string, and swap a reverse name spelling 
# ("family, given name" aso.), so that the call name comes first and the family or clan name 
# comes near the end (as most names are given like this in WikiData)
# TODO deal with “Mrs”, e.g. Mrs Arnold Harrison

def abbr_word(word):
    """
    Return the abbreviation of a word, e.g: Anton → A. or (Lisa) → (L.)

    @param word: a word without spaces
    @return: str an abbreviated word
    """
    if len(word) > 0:
        word = word.strip()
        names_regex_substitution = {
            # generic simple names: Antonio OR (Antonius)
            r"^([^\w\s]*)(\w)\w*([^\w\s\.]*)$": r"\1\2.\3",
            # names having minus: Charles-Jeunet OR (Carl-Jeanet) → CJ. OR (CJ.)
            r"^([^\w\s]*)(\w)\w*[-–—](\w)\w*([^\w\s\.]*)$": r"\1\2\3.\4",
            # names having apostrophe: (Ch'An)
            r"^([^\w\s]*)(\w)\w*['’’´`](\w)\w*([^\w\s\.]*)$": r"\1\2.\4",
            # names having comma: (Eugen,Eugène)
            r"^([^\w\s]*)(\w)\w*[,](\w)\w*([^\w\s\.]*)$": r"\1\2.,\3.\4",
            # names having immediate parentheses: Wilhelmus(Wim)
            r"^(\w)\w*(\()(\w)\w*(\))$": r"\1.\2\3.\4",
            # names having immediate parentheses + comma: Ion(Ioan,Joan)
            r"^(\w)\w*(\()(\w)\w*,(\w)\w*(\))$": r"\1.\2\3.,\4.\5",
            # names having immediate parentheses + an apostrophe: "Chan(Ch'An)"
            r"^(\w)\w*(\()(\w)\w*['’’´`](\w)\w*(\))$": r"\1.\2\3.\5",
            # names having immediate parentheses + minus in the name: 'Ken-Ichiro(Ken-Itirô)'
            r"^([^\w\s]*)(\w)\w*[-–—](\w)\w*([^\w\s\.]*)(\w)\w*[-–—](\w)\w*([^\w\s\.]*)$": r"\1\2\3.\4\5\6.\7",
        }

        for k_search_pattern, v_replace_pattern in names_regex_substitution.items():
            if re.match(k_search_pattern, word, re.IGNORECASE):
                return re.sub(k_search_pattern, v_replace_pattern, word)
        return word
    else:
        return word

def clean_person_name(this_full_name):
    """
    Clean person name from WikiData to get `given + … + family` and not in reverse order `family, given + … `
    
    TODO how to deal with comma and: …, Jr., II., the younger aso. ? (should be considered when splitting on family name)
    
    @requires re
    @param this_full_name: str the name to be cleaned
    @return: str the processed name
    """
    parentheses_words_at_last = [
        'botaniker', 'botanist', 'botanist-1', 'bot.',
        'diatomist',
        'entomologist',
        'instruisto',
        'lehrer',
        'mycologist',
        'taxonomist', 'teacher',
        'zoologist'
    ]
    regex_paranthesis_words = re.compile(r' +\((' + '|'.join(parentheses_words_at_last) + r')\) *$', flags=re.IGNORECASE)

    # delete life time e.g. "… (c. 1534)" or "… (1748-1801)"
    this_full_name = re.sub(r" +\([c. ]*\d+[-–—]*\d+\) *", r"", this_full_name)
    # delete occupations in parentheses
    this_full_name = regex_paranthesis_words.sub(r"", this_full_name)
    # delete noble designations, e.g. “Sir James Nasmyth, 2nd Baronet” → “Sir James Nasmyth”
    this_full_name = re.sub(r" *, +[2]*(1st|2nd|3rd|[4-9]th|[1][0-9]th)[^,]+$", r"", this_full_name)
    # reverse cyrillic simple names: Штейп, Владимир Владимирович → Владимир Владимирович Штейп
    this_full_name = re.sub(r"^([\u0400-\u04FF-]+),\s+([\u0400-\u04FF- ]+)$", r"\2 \1", this_full_name)
    # offset when comma for “, the younger \w+” aso., e.g.: “Peter Joseph, the younger Lenné” → “Peter Joseph Lenné, the younger”
    this_full_name = re.sub(
        r"""
        , \s+ (
            [Tt]he \s+ [Yy]ounger 
            | [Dd]ie \s+ [Jj]üngere 
            | [Dd]er \s+ [Jj]üngere
            | [Dd]ie \s+ [Ää]ltere
            | [Dd]er \s+ [Ää]ltere
            ) (\s+ \w+[-]?\w+)
        $
        """, 
        r"\2, \1", this_full_name, flags=re.VERBOSE
    )
        
    return this_full_name

def simple_namereverse2family_last(this_name_reversed):
    """
    Get the family name or clan name to the last position from names that seem obviously reversed
    
    Draft: it should be applied, after family name splitting was unsuccessful
    @param this_name_reversed:
    @return:
    """
    if len(this_name_reversed) > 0:
        this_name_reversed = this_name_reversed.strip()
        names_regex_substitution = {
            # Baudoin-Bodin, Jacqueline
            # Brennecke, Dorothea
            # Chi, Chün-tao
            # Cormack, R.G.H.
            # Gajón Sánchez, Carlos
            r"^(\w+[-.\s]*\w*[-.\s]*\w*[-.\s]*),\s(\w+[-.\s]*\w*[-.\s]*\w*[-.\s]*)$": r"\2 \1",
            # TODO William E., III, Fox → William E. Fox, III
            r"^(\w+[-.\s]*\w*[-.\s]*\w*[-.\s]*),\s([IVX]+),\s(\w+[-.\s]*\w*[-.\s]*\w*[-.\s]*)$": r"\1 \3, \2",
        }

        for k_search_pattern, v_replace_pattern in names_regex_substitution.items():
            if re.match(k_search_pattern, this_name_reversed, re.IGNORECASE):
                return re.sub(k_search_pattern, v_replace_pattern, this_name_reversed)
        return this_name_reversed
    else:
        return this_name_reversed

regex_split_on_family_name = re.compile(
    r"""
    (?<!,)
    \s+ (
     \w+[-]?\w+, \s+ [Bb]aron[in]*   \s+ [Vv]on \s+ \w+[-]?\w+
    |\w+[-]?\w+, \s+ [Gg]r[aä]f[in]* \s+ [Vv]on \s+ \w+[-]?\w+
    |\w+[-]?\w+, \s+ [Dd]uke         \s+ [Oo]f \s+ \w+[-]?\w+
    |\w+[-]?\w+, \s+ [Dd]uchess      \s+ [Oo]f \s+ \w+[-]?\w+
    |\w+[-]?\w+, \s+ [Cc]ountess     \s+ [Oo]f \s+ \w+[-]?\w+
    |\w+[-]?\w+, \s+ [Mm]archioness  \s+ [Oo]f \s+ \w+[-]?\w+
    |\w+[-]?\w+, \s+ [Dd]uchesse     \s+ [Dd]e \s+ \w+[-]?\w+
    |\w+[-]?\w+, \s+ [Cc]omte        \s+ [Dd]e \s+ \w+[-]?\w+
    |\w+[-]?\w+, \s+ [Cc]omtesse     \s+ [Dd]e \s+ \w+[-]?\w+
    |\w+[-]?\w+, \s+ later           \s+ \w+[-]?\w+
    |\w+[-]?\w+, \s+ [XVI]+[.]?
    |\w+[-]?\w+, \s+ Junior[.]?
    |\w+[-]?\w+, \s+ [JjS]r[.]?
    |\w+[-]?\w+,? \s+ [Tt]he \s+ [Yy]ounger
    |\w+[-]?\w+,? \s+ [Dd]ie \s+ [Jj]üngere
    |\w+[-]?\w+,? \s+ [Dd]er \s+ [Jj]üngere
    |\w+[-]?\w+,? \s+ [Dd]ie \s+ [Ää]ltere
    |\w+[-]?\w+,? \s+ [Dd]er \s+ [Ää]ltere
    |\w['’‘]\w+
    |\w+[-]?\w+
    |\w+\s?\([^()]+\)[.]?
    |\w\w*[.]?[-–—]\w+
    )
    $
    """,
    re.VERBOSE | re.MULTILINE
)

In [8]:
surname = []
initials = []
canonical = []
canonical_fullname = []

# TODO: use dwcagent on names? Unfortunately it removes all parentheses content

for i, item in df_matching.iterrows():
    thisItemLabelModified = '{}'.format(item['itemMatchingLabel']).strip()
    thisItemLabelModified = clean_person_name(thisItemLabelModified)
    # split at last name part, optionally having parentheses TODO how to deal with Jr., II. aso. 
    # print(thisItemLabelModified)
    nameparts = regex_split_on_family_name.split(thisItemLabelModified)
    if len(nameparts) == 1 and re.findall(r",", thisItemLabelModified):
        # re-iterate through probably reversed names
        thisItemLabelModified = simple_namereverse2family_last(thisItemLabelModified)
        nameparts = regex_split_on_family_name.split(thisItemLabelModified)
    # remove empty/None: ['Friedrich August Marschall von', 'Bieberstein', None, '']
    nameparts = [string for string in nameparts if string]
    if len(nameparts)==1:
        initials.append("") # no beginning name initials per se, only family name
        surname.append("")
        canonical.append(" ".join(nameparts))
        canonical_fullname.append(" ".join(nameparts))
    else:
        surname.append(nameparts[0])
        first_nameparts = re.split('[ ]', nameparts[0])
        first_nameparts = [string for string in first_nameparts if string]
        # print(first_nameparts)
        initials.append(" ".join([abbr_word(w) for w in first_nameparts if len(w) > 0]) )
        canonical.append(" ".join([abbr_word(w) for w in first_nameparts if len(w) > 0]) + " " + nameparts[1])
        # canonical.append(words[-1] + ', ' + ".".join([w[0] for w in words[0:-1] if len(w) > 0]) + '.')
        canonical_fullname.append(" ".join(nameparts))
    
df_matching['surname'] = surname
df_matching['initials'] = initials
df_matching['canonical_string'] = canonical
df_matching['canonical_string_fullname'] = canonical_fullname # should equal itemMatchingLabel

if explain_and_show_the_data: display(df_matching.head())

Unnamed: 0,item,itemLabel,itemMatchingLabel,altLabel,altLabel_lang,abbr,yob,yod,orcid,viaf,isni,harv,ipni,bionomia_id,surname,initials,canonical_string,canonical_string_fullname
0,http://www.wikidata.org/entity/Q100142069,Frida Eggens,Eggens,Eggens,de,Eggens,,,,,,,20045232-1,,,,Eggens,Eggens
1,http://www.wikidata.org/entity/Q100142069,Frida Eggens,Frida Eggens,Eggens,de,Eggens,,,,,,,20045232-1,,Frida,F.,F. Eggens,Frida Eggens
2,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Elizabeth Harrison,Mrs A. H.,en,,1792.0,1834.0,,,,,,Q100146795,Elizabeth,E.,E. Harrison,Elizabeth Harrison
3,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Mrs A. H.,Mrs A. H.,en,,1792.0,1834.0,,,,,,Q100146795,,,Mrs A. H.,Mrs A. H.
4,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Mrs Arnold Harrison,Mrs Arnold Harrison,en,,1792.0,1834.0,,,,,,Q100146795,Mrs Arnold,M. A.,M. A. Harrison,Mrs Arnold Harrison


In [9]:
# narrow the needed columns down
df_matching = df_matching[[
    'item', 'itemLabel', 
    'surname', 'initials', 'canonical_string', 'canonical_string_fullname',
    'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id',
    'yob', 'yod'
    #, 'wyb', 'wye'
]]
df_matching.reset_index(drop=True, inplace=True)

if explain_and_show_the_data: display(df_matching.head())

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod
0,http://www.wikidata.org/entity/Q100142069,Frida Eggens,,,Eggens,Eggens,,,,,20045232-1,Eggens,,,
1,http://www.wikidata.org/entity/Q100142069,Frida Eggens,Frida,F.,F. Eggens,Frida Eggens,,,,,20045232-1,Eggens,,,
2,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Elizabeth,E.,E. Harrison,Elizabeth Harrison,,,,,,,Q100146795,1792.0,1834.0
3,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,,,Mrs A. H.,Mrs A. H.,,,,,,,Q100146795,1792.0,1834.0
4,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Mrs Arnold,M. A.,M. A. Harrison,Mrs Arnold Harrison,,,,,,,Q100146795,1792.0,1834.0


In [10]:
# add custom columns
# df.loc[(df['column_of_interest'] … condition), 'fill_to_column'] = value 
# display(df_matching)

In [11]:

df_matching["wikidata_link"] = df_matching['item'].apply(lambda thiscol: thiscol.replace('entity', 'wiki')) # needed?
df_matching["orcid_link"] = 'https://orcid.org/' + df_matching['orcid']
    # df_matching.assign(orcid_link=lambda thisdf: 'https://orcid.org/' + str(thisdf['orcid']))
df_matching["harv_link"] = 'https://kiki.huh.harvard.edu/databases/botanist_search.php?mode=details&id=' + df_matching['harv']
    # df_matching.assign(harv_link = lambda thisdf: 'https://kiki.huh.harvard.edu/databases/botanist_search.php?mode=details&id=' + str(thisdf['harv']))
df_matching["ipni_link"] = 'https://www.ipni.org/a/' + df_matching['ipni']
    # df_matching.assign(ipni_link = lambda thisdf: 'https://www.ipni.org/a/' + str(thisdf['ipni']))
df_matching["bionomia_link"] = 'https://bionomia.net/' + df_matching['bionomia_id']
    # df_matching.assign(bionomia_link = lambda thisdf: 'https://bionomia.net/' + str(thisdf['bionomia_id']))

if explain_and_show_the_data: display(df_matching.head())

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q100142069,Frida Eggens,,,Eggens,Eggens,,,,,20045232-1,Eggens,,,,http://www.wikidata.org/wiki/Q100142069,,,https://www.ipni.org/a/20045232-1,
1,http://www.wikidata.org/entity/Q100142069,Frida Eggens,Frida,F.,F. Eggens,Frida Eggens,,,,,20045232-1,Eggens,,,,http://www.wikidata.org/wiki/Q100142069,,,https://www.ipni.org/a/20045232-1,
2,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Elizabeth,E.,E. Harrison,Elizabeth Harrison,,,,,,,Q100146795,1792.0,1834.0,http://www.wikidata.org/wiki/Q100146795,,,,https://bionomia.net/Q100146795
3,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,,,Mrs A. H.,Mrs A. H.,,,,,,,Q100146795,1792.0,1834.0,http://www.wikidata.org/wiki/Q100146795,,,,https://bionomia.net/Q100146795
4,http://www.wikidata.org/entity/Q100146795,Elizabeth Harrison,Mrs Arnold,M. A.,M. A. Harrison,Mrs Arnold Harrison,,,,,,,Q100146795,1792.0,1834.0,http://www.wikidata.org/wiki/Q100146795,,,,https://bionomia.net/Q100146795


## Save Data

In [12]:
from datetime import datetime
# write data frame as CSV with a date time

import os
if not os.path.exists('data'):
    os.makedirs('data')

this_output_file=os.path.join(
    "data", 'wikidata_persons_botanists_{timestamp}.csv'.format(
    timestamp=this_timestamp_for_data
    )
)

df_matching.to_csv(this_output_file)

print("Wrote data frame into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote data frame into data/wikidata_persons_botanists_20240312.csv


## Documentation

Explanation of columns:

Column | Description
-|-
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label
itemMatchingLabel | the source label to build on surname, initials aso., to against later on; it is the concatenation of itemLabel and altLabel
altLabel | the name aliases, or skos:altLabel of WikiData
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
canonical_string_fullname | Canonical name string including full (given) name; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
bionomia_id | identifier for a collector and/or determiner of natural history specimens, in the Bionomia database ([P6944](https://www.wikidata.org/wiki/Property:P6944))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Work year period begin ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | Work year period end ([P2032](https://www.wikidata.org/wiki/Property:P2032))