# Match BGBM Collectors to Wikidata Items

Basically we attempt a match of `canonical_string` of WikiData to `canonical_string` of the collectors (in this case the names were parsed beforehand into single names using <https://libraries.io/rubygems/dwc_agent>)

TODO:

- evaluate if multiple names (WikiData or collector data) are found
- match also with time periode of work (WikiData) ⇌ created time of the herbarium sheet (if no other life time data are available)

### Load Wikidata Data Set

[Jupyter Notebook for creating the botanist Wikidata data set](./create_wikidata_datasets_botanists.ipynb) (TODO: improve query properties) 

Out of the Wikidata items data set we create a data frame with unique canonical name strings and their counts.

In [1]:
import pandas as pd
wikidata = pd.read_csv("data/wikidata_persons_botanists_20230703_1352.csv", index_col=0, low_memory=False)

wikidata.head()

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb,wye
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Bieberstein,F.A.M.v.,"Bieberstein, F.A.M.v.",,43340073,0000 0001 1630 5464,1373.0,6129-1,M.Bieb.,Q66612,1768.0,1826.0,,
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Behr,H.H.,"Behr, H.H.",,20328622,0000 0001 1604 8680,42741.0,619-1,Behr,Q66934,1818.0,1904.0,,
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Schäffer,J.C.,"Schäffer, J.C.",,47016953,0000 0000 8343 3899,1101.0,12818-1,Schaeff.,,1718.0,1790.0,,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Klotzsch,J.F.,"Klotzsch, J.F.",,20426762,0000 0001 1749 2732,135.0,4855-1,Klotzsch,Q67003,1805.0,1860.0,,
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Menge,F.A.,"Menge, F.A.",,59847236,0000 0001 1653 0899,73782.0,23266-1,Menge,,1808.0,1880.0,,


In [2]:
# create the test data set of WikiData data
# group by canonical name/string, count douplicated names
wd_matchtest = wikidata.groupby('canonical_string').agg({'item': ['count']}).reset_index()
# TODO AP: meaning of wd_matchtest + count for merge later on? 

wd_matchtest.tail()

# colls = list(wikidata.columns)

Unnamed: 0_level_0,canonical_string,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
61296,"Șerbanescu, I.",1
61297,"Ștefureac, T.",1
61298,"Țopa, E.",1
61299,"Ḥalwaǧī, R.",1
61300,"Ḳushnir, Ṭ.",1


### Load Collectors Data Set

Data sources:

- option 1: Jupyter Notebook for `create_bgbm_botanypilot_collectors_dataset.ipynb` from SPARQL (not in this official documentation yet)
- option 2: Jupyter Notebook for [`create_bgbm_gbif-occurrence_collectors_dataset.ipynb`](./create_bgbm_gbif-occurrence_collectors_dataset.ipynb)

Then parse collector names to get single, separate collector names using `dwcagent`, use ruby gem package available at  <https://rubygems.org/gems/dwc_agent>:

- use ruby script `./bin/agent_parse4tsv.rb` for parsing text lines like `"Abbe,L.B., Abbe,E.C., Smitinand,T. & Rollet,B."`

TODO:
- check parsed fields `particle` and other fields, e.g. «`Abbas al Ani, H.`»

    ```bash
    cd data/VHde_0195853-230224095556074_BGBM/
    head occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv \
      | column --table --separator $'\t' \
      | sed 's@^@  # @;'
      # family     given  suffix  particle  dropping_particle  nick  appellation  title  occurrenceID_count  occurrenceID_first
      #            No                                                                    1                   http://id.snsb.info/snsb/collection/108286/167064/109352
      # Azofeifa   A.                                                                    2                   https://herbarium.bgbm.org/object/B200211416
      # A. Cano    E.                                                                    1                   https://herbarium.bgbm.org/object/B100699397
      # Henry      A.                                                                    1                   https://herbarium.bgbm.org/object/B200098813
      # Selmons    Ad                                                                    1                   https://herbarium.bgbm.org/object/B100379213
      # Aaronsohn  A.                                                                    3                   https://je.jacq.org/JE00010154
      ```


In [26]:
# unique names parsed already by ruby gem package: dwcagent

# collectors = pd.read_csv("data/bgbm_collectors_20230510_1429_single-line_parsed_unique_names.tab", sep="\t")
collectors = pd.read_csv("data/VHde_0195853-230224095556074_BGBM/occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv", sep="\t")

collectors.dropna(subset=['family'], inplace=True) # remove where family was NA, e.g. from originally «??» aso.
collectors

  collectors = pd.read_csv("data/VHde_0195853-230224095556074_BGBM/occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv", sep="\t")


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first
1,Azofeifa,A.,,,,,,,2,https://herbarium.bgbm.org/object/B200211416
2,A. Cano,E.,,,,,,,1,https://herbarium.bgbm.org/object/B100699397
3,Henry,A.,,,,,,,1,https://herbarium.bgbm.org/object/B200098813
4,Selmons,Ad,,,,,,,1,https://herbarium.bgbm.org/object/B100379213
5,Aaronsohn,A.,,,,,,,3,https://je.jacq.org/JE00010154
...,...,...,...,...,...,...,...,...,...,...
66574,Żelazny,J.,,,,,,,4,https://herbarium.bgbm.org/object/B100344466
66575,Ždanova,O.,,,,,,,5,https://herbarium.bgbm.org/object/B100263330
66576,Žíla,V.,,,,,,,3,https://herbarium.bgbm.org/object/B100009590
66577,Волкова,Е.,,,,,,,1,https://herbarium.bgbm.org/object/B100530714


In [27]:
# test particle for NA values (perhaps particle is the most important)
test_collectors = collectors.loc[(collectors.particle.isna() == False)]
print("names with particle (%s records)…" % len(test_collectors.index))
test_collectors.head()

names with particle (534 records)…


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first
6,Ani,H.,,Abbas al,,,,,1,http://id.snsb.info/snsb/collection/462713/563...
57,Khalek,Abel,,el,,,,,1,https://herbarium.bgbm.org/object/B100763849
367,Newton,F.X.O.,,Aguiar de,,,,,1,https://herbarium.bgbm.org/object/B100154587
410,Zanten,B.,,van,,,,,1,https://herbarium.bgbm.org/object/B300259443
476,Aichenhayn,Aichinger,,von,,,,,1,https://dr.jacq.org/DR073481


In [28]:
# test suffix for NA values
test_collectors = collectors.loc[(collectors.suffix.isna() == False)]
print("names with suffix (%s records)…" % len(test_collectors.index))
test_collectors.head()

names with suffix (15 records)…


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first
801,Grear,J.W.,Jr.,,,,,,1,https://herbarium.bgbm.org/object/B100525791
4168,Pineda,J.F.,Jr.,,,,,,12,https://herbarium.bgbm.org/object/B100759134
4180,Pineda,J.F.,Jr.,,,,,,6,https://herbarium.bgbm.org/object/B100042115
6568,Toledo,F.,jr.,Tamandaré de,,,,,2,https://herbarium.bgbm.org/object/B200049849
17017,Forsyth,W.,jr.,,,,,,1,http://id.snsb.info/snsb/collection/504525/625...


In [29]:
# test dropping_particle for NA values
test_collectors = collectors.loc[(collectors.dropping_particle.isna() == False)]
print("names with dropping_particle (%s records)…" % len(test_collectors.index))
test_collectors.head()

names with dropping_particle (0 records)…


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first


Add `canonical_string…` that we will match against later for Wikidata names:

In [30]:
collectors['canonical_string_collector_parsed'] = (
  # use collectors.family only where given name has NA values, otherwise use family name + given name
  collectors.family.where(
      # condition
      collectors.given.isna(),
      # any other TODO improve the combined name for canonical_string_collector_parsed if any of the other dwc_parsed fields is not NaN
      # other= collectors.family + ", " + collectors.given 
      other= (collectors.family + ", " + collectors.given) if any(collectors.particle.isna()) else collectors.particle + " " + collectors.family + ", " + collectors.given
  )
)
collectors.tail()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first,canonical_string_collector_parsed
66574,Żelazny,J.,,,,,,,4,https://herbarium.bgbm.org/object/B100344466,"Żelazny, J."
66575,Ždanova,O.,,,,,,,5,https://herbarium.bgbm.org/object/B100263330,"Ždanova, O."
66576,Žíla,V.,,,,,,,3,https://herbarium.bgbm.org/object/B100009590,"Žíla, V."
66577,Волкова,Е.,,,,,,,1,https://herbarium.bgbm.org/object/B100530714,"Волкова, Е."
66578,Жирова,O.,,,,,,,1,https://herbarium.bgbm.org/object/B100630811,"Жирова, O."


In [None]:
# collectors=collectors.add_suffix('_parsed') \
#  if not any(col.endswith("_parsed") for col in list(collectors.columns))

### Set Up the Text Search

See https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536

The ngrams function is used as an analyzer in the text search later.

In [51]:
wd_matchtest['canonical_string'].at[0]

'(-Walraevens), O.H.'

In [54]:
import re
!pip install ftfy # amazing text cleaning for decode issues..
from ftfy import fix_text

def ngrams(string, n=3):
    string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower()
    chars_to_remove = [")","(",".","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.title() # normalise case - capital at start of each word
    string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single
    string = ' '+ string +' ' # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]


print("Example from name:", ngrams('Klazenga, N.'))
print("Example from collectors:", ngrams(collectors["canonical_string_collector_parsed"].at[1])) 
print("Example from match-test:", ngrams(wd_matchtest['canonical_string'].at[1]))


Defaulting to user installation because normal site-packages is not writeable
Example from name: [' Kl', 'Kla', 'laz', 'aze', 'zen', 'eng', 'nga', 'ga ', 'a N', ' N ']
Example from collectors: [' Az', 'Azo', 'zof', 'ofe', 'fei', 'eif', 'ifa', 'fa ', 'a A', ' A ']
Example from match-test: [' 18', '183', '835', '35 ', '5 1', ' 19', '190', '906', '06 ', '6 G', ' Ga', 'Gaf', 'afe', 'fe ']


Vectorize Wikidata names. Background: We use an information retrieval technique (Term Frequency — Inverse Document Frequency, blog [towardsdatascience.com/tf-idf-explained…](https://towardsdatascience.com/tf-idf-explained-and-python-sklearn-implementation-b020c5e83275)) for matching the source names with WikiData names, for that a calculated dinsance measure of the name match will help to match similar names and distinguish names that are rather no match. In general see also https://scikit-learn.org, https://pypi.org/project/scikit-learn/. 

Convert a collection of raw documents to a matrix of TF-IDF features:

In [55]:
# pip install --upgrade scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

wikidata_names = wd_matchtest['canonical_string']

# vectorize wikidata names
print('Vectorizing data. This may take a while...')
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
tfidf_vector_data = vectorizer.fit_transform(wikidata_names)
print('Vectorizing completed: Created a matrix of TF-IDF featurs')


Vectorizing data. This may take a while...
Vectorizing completed: Created a matrix of TF-IDF featurs


Set up the function that performs the nearest neighbour matches...

In [56]:
from sklearn.neighbors import NearestNeighbors

nbrs_data = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf_vector_data) 
# tfidf_vector_data contains the vectorized wikidata names from the previous step

# matching query
def getNearestN(query):
    queryTFIDF_ = vectorizer.transform(query)
    distances, indices = nbrs_data.kneighbors(queryTFIDF_)
    return distances, indices


### Perform the Matching

Perform the nearest neighbour (NN) matches on the (BGBM) collector names and create a data frame with matches... (can take 5 to 10 minutes)

In [57]:
collectors_names = set(collectors['canonical_string_collector_parsed'].values) 
  # convert list to set for better performance

import time
start = time.time()
print('Getting nearest neighbours...')
distances, indices = getNearestN(collectors_names)
duration = time.time() - start
print('Completed in:', duration, 's')

collectors_names = list(collectors_names) # convert back to list

print('Finding matches...')
matches = []
for i,j in enumerate(indices):
    temp = [collectors_names[i], wd_matchtest.values[j][0][0], round(distances[i][0],2)]
    matches.append(temp)

duration = time.time() - start
print('Building matches data frame:', duration, 's')  
matches = pd.DataFrame(
    matches, 
    columns=['namematch_collector','namematch_wikidata','namematch_distance']
)

duration = time.time() - start
print('Done:', duration, 's') 

matches = matches.sort_values(['namematch_distance'])
matches = matches.reset_index()

matches.head()

Getting nearest neighbours...
Completed in: 123.63351941108704 s
Finding matches...
Building matches data frame: 195.70471167564392 s
Done: 195.71673774719238 s


Unnamed: 0,index,namematch_collector,namematch_wikidata,namematch_distance
0,16281,"Erdner, E.","Erdner, E.",0.0
1,4351,"Smith, A.C.","Smith, A.C.",0.0
2,17468,"Hicken, C.M.","Hicken, C.M.",0.0
3,4348,"Zhang, J.W.","Zhang, J.W.",0.0
4,8782,"Boeckeler, J.O.","Boeckeler, J.O.",0.0


### Create Output Results

Combine the matches data frame back to the (BGBM) collectors and Wikidata items ...

In [58]:
# join matches data frame back to source collectors  dataframe 
collectors_matches = pd.merge(
    collectors, matches, 
    left_on='canonical_string_collector_parsed', right_on='namematch_collector'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

collectors_matches.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first,canonical_string_collector_parsed,index,namematch_collector,namematch_wikidata,namematch_distance
0,Azofeifa,A.,,,,,,,2,https://herbarium.bgbm.org/object/B200211416,"Azofeifa, A.",7693,"Azofeifa, A.","Azofeifa-Bolaños, J.B.",0.74
1,Azofeifa,A.,,,,,,,3,https://herbarium.bgbm.org/object/B200211671,"Azofeifa, A.",7693,"Azofeifa, A.","Azofeifa-Bolaños, J.B.",0.74
2,Azofeifa,A.,,,,,,,1,https://herbarium.bgbm.org/object/B101143091,"Azofeifa, A.",7693,"Azofeifa, A.","Azofeifa-Bolaños, J.B.",0.74
3,Azofeifa,A.,,,,,,,1,https://herbarium.bgbm.org/object/B101147462,"Azofeifa, A.",7693,"Azofeifa, A.","Azofeifa-Bolaños, J.B.",0.74
4,Azofeifa,A.,,,,,,,1,https://herbarium.bgbm.org/object/B101147459,"Azofeifa, A.",7693,"Azofeifa, A.","Azofeifa-Bolaños, J.B.",0.74


Save the results...

In [59]:
from datetime import datetime
import os
if not os.path.exists('data'):
    os.makedirs('data')

this_output_file_name='data/bgbm_collectors_matches_wikidata-botanists_%s.csv' % (
    # "20230531_1156"
    datetime.today().strftime('%Y%m%d') # '%Y%m%d_%H%M'
)

print("Write matches of collector names into", this_output_file_name)

collectors_matches.to_csv(this_output_file_name)

Write matches of collector names into data/bgbm_collectors_matches_wikidata-botanists_20230705.csv


### Aggregate Matched Data

Aggregate now the data, if multiple names are found … aso.

In [60]:
# link counts of wikidata items with canonical name string
# collectors_matches_g1 = pd.merge(collectors_matches, wd_matchtest, 
#                                  left_on='matched_name', right_on='canonical_string')
collectors_matches_g1 = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_wikidata', right_on='canonical_string'
)
collectors_matches_g1.rename(columns = {list(collectors_matches_g1)[-1]: 'item_count'}, inplace=True)

# link wikidata items with canonical name string (pipe separated if more than one)
print('Aggregate WD item (multiple data items found) ...')
wikidata_uniq_items = wikidata.groupby(['canonical_string'])['item'].apply('|'.join).reset_index()
print('Done.')

collectors_matches_g2 = pd.merge(# no merging unique WikiData names to collectors
    collectors_matches_g1, wikidata_uniq_items, 
    left_on='namematch_wikidata', right_on='canonical_string'
    , suffixes=('__grp_by_itemcount', '__grp_by_item') 
      # append to left-data, right-data only when identical column names occur
)
collectors_matches_g2.rename(columns = {list(collectors_matches_g2)[-1]: 'items'}, inplace=True)

# link wikidata items with canonical name string (pipe separated if more than one)
print('Aggregate WD itemLabel (multiple names found) ...')
wikidata_uniq_itemlabels = wikidata.groupby(['canonical_string'])['itemLabel'].apply('|'.join).reset_index()
print('Done.')

collectors_matches_g3 = pd.merge(
    collectors_matches_g2, wikidata_uniq_itemlabels, 
    left_on='namematch_wikidata', right_on='canonical_string'
    , suffixes=('__grp_by_items', '__grp_by_itemlabel') 
      # append to left-data, right-data only when identical column names occur
)

collectors_matches_g3.rename(columns = {list(collectors_matches_g3)[-1]: 'item_labels'}, inplace=True)


Aggregate WD item (multiple data items found) ...
Done.
Aggregate WD itemLabel (multiple names found) ...
Done.


Prepare data to save later on …

In [74]:
collectors_matches_group = collectors_matches_g3

print(list(collectors_matches_group.columns))
# from merge: _x means from left column, _y means from right column

# in BASH fold text long lines; echo "${text}" | fold --spaces | sed 's@^@# @'
# ['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick', 
# 'appellation', 'title', 'occurrenceID_count', 'occurrenceID_first', 
# 'canonical_string_collector_parsed', 'index', 'namematch_collector', 
# 'namematch_wikidata', 'namematch_distance', 'item__grp_by_itemcount', 
# 'itemLabel__grp_by_items', 'surname', 'initials', 
# 'canonical_string__grp_by_itemcount', 'orcid', 'viaf', 'isni', 'harv', 'ipni', 
# 'abbr', 'bionomia_id', 'yob', 'yod', 'wyb', 'item_count', 
# 'canonical_string__grp_by_item', 'items', 'canonical_string', 'item_labels']

['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick', 'appellation', 'title', 'occurrenceID_count', 'occurrenceID_first', 'canonical_string_collector_parsed', 'index', 'namematch_collector', 'namematch_wikidata', 'namematch_distance', 'item__grp_by_itemcount', 'itemLabel__grp_by_items', 'surname', 'initials', 'canonical_string__grp_by_itemcount', 'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod', 'wyb', 'item_count', 'canonical_string__grp_by_item', 'items', 'canonical_string', 'item_labels']


In [75]:
collectors_matches_group.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first,...,abbr,bionomia_id,yob,yod,wyb,item_count,canonical_string__grp_by_item,items,canonical_string,item_labels
0,Azofeifa,A.,,,,,,,2,https://herbarium.bgbm.org/object/B200211416,...,Azof.-Bolaños,,,,,,"Azofeifa-Bolaños, J.B.",http://www.wikidata.org/entity/Q36586259,"Azofeifa-Bolaños, J.B.",José B. Azofeifa-Bolaños
1,Azofeifa,A.,,,,,,,3,https://herbarium.bgbm.org/object/B200211671,...,Azof.-Bolaños,,,,,,"Azofeifa-Bolaños, J.B.",http://www.wikidata.org/entity/Q36586259,"Azofeifa-Bolaños, J.B.",José B. Azofeifa-Bolaños
2,Azofeifa,A.,,,,,,,1,https://herbarium.bgbm.org/object/B101143091,...,Azof.-Bolaños,,,,,,"Azofeifa-Bolaños, J.B.",http://www.wikidata.org/entity/Q36586259,"Azofeifa-Bolaños, J.B.",José B. Azofeifa-Bolaños
3,Azofeifa,A.,,,,,,,1,https://herbarium.bgbm.org/object/B101147462,...,Azof.-Bolaños,,,,,,"Azofeifa-Bolaños, J.B.",http://www.wikidata.org/entity/Q36586259,"Azofeifa-Bolaños, J.B.",José B. Azofeifa-Bolaños
4,Azofeifa,A.,,,,,,,1,https://herbarium.bgbm.org/object/B101147459,...,Azof.-Bolaños,,,,,,"Azofeifa-Bolaños, J.B.",http://www.wikidata.org/entity/Q36586259,"Azofeifa-Bolaños, J.B.",José B. Azofeifa-Bolaños


In [76]:
# Remove superfluous columns TODO check WARNING: A value is trying to be set on a copy of a slice from a DataFrame
# TODO check duplicates
collectors_matches_group = collectors_matches_g3[
    ['family', 'given', 'canonical_string_collector_parsed', 
    'namematch_collector', 'namematch_wikidata', 'namematch_distance', 
    'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod', 'wyb',
    'items', 'canonical_string', 'item_labels']
]
# collectors_matches_group = collectors_matches_g3
collectors_matches_group.sort_values(by=['namematch_distance'], inplace=True)
collectors_matches_group.drop_duplicates(inplace=True)
collectors_matches_group.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_matches_group.sort_values(by=['namematch_distance'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_matches_group.drop_duplicates(inplace=True)


Unnamed: 0,family,given,canonical_string_collector_parsed,namematch_collector,namematch_wikidata,namematch_distance,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb,items,canonical_string,item_labels
29411,Scholz,H.,"Scholz, H.","Scholz, H.","Scholz, H.",0.0,,69223857.0,0000 0001 0783 3193,38901.0,9229-1,H.Scholz,Q1618426,1928.0,2012.0,,http://www.wikidata.org/entity/Q1618426,"Scholz, H.",Hildemar Scholz
65931,Tamura,M.,"Tamura, M.","Tamura, M.","Tamura, M.",0.0,,6268997.0,0000 0000 8420 9944,31773.0,10411-1,Tamura,,1927.0,2007.0,,http://www.wikidata.org/entity/Q15631729|http:...,"Tamura, M.",Michio Tamura|Miki Tamura
33351,Gerard,J.,"Gerard, J.","Gerard, J.","Gerard, J.",0.0,,67535040.0,0000 0000 8147 6342,75610.0,13091-1,J.Gerard,,1545.0,1612.0,1560.0,http://www.wikidata.org/entity/Q1333338,"Gerard, J.",John Gerard
65930,Tamura,M.,"Tamura, M.","Tamura, M.","Tamura, M.",0.0,,,,,20010488-1,M.Tamura,,,,,http://www.wikidata.org/entity/Q15631729|http:...,"Tamura, M.",Michio Tamura|Miki Tamura
65928,Tammaro,F.,"Tammaro, F.","Tammaro, F.","Tammaro, F.",0.0,,,,11257.0,14533-1,Tammaro,,1942.0,,,http://www.wikidata.org/entity/Q21610167,"Tammaro, F.",Fernando Tammaro


In [77]:
this_output_file_name='data/bgbm_collectors_matches_wikidata_items_group_concat_%s.csv' % (
    # "20230531_1156"
    datetime.today().strftime('%Y%m%d') # '%Y%m%d_%H%M'
)

print("Wrote groups of collectors matches into", this_output_file_name)

# collectors_matches_group.to_csv(this_output_file_name)
collectors_matches_group.to_csv(this_output_file_name)

Wrote groups of collectors matches into data/bgbm_collectors_matches_wikidata_items_group_concat_20230705.csv


Get individual WikiData items (TODO review code): 
- associate collector name match + individual WikiData items (remember: we matched the `canonical_string`)

In [78]:
# TODO get  list of atomized collectors matches down to single wikidata items
collectors_matches_t1 = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_wikidata', right_on='canonical_string'
)
# collectors_matches_t1.drop(columns=['canonical_string'])

# link counts of wikidata items with same canonical name string
collectors_matches_t2 = pd.merge(
    collectors_matches_t1, wikidata, 
    left_on="namematch_wikidata", right_on="canonical_string"
    , suffixes=('__collmatches', '__wdata-isolated') # append to left-data, right-data only when identical column names occur
)

# TODO AP: add count of douplicates?

print(list(collectors_matches_t2.columns))
# in BASH fold text long lines; echo "${text}" | fold --spaces | sed 's@^@# @'
# ['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick', 
# 'appellation', 'title', 'occurrenceID_count', 'occurrenceID_first', 
# 'canonical_string_collector_parsed', 'index', 'namematch_collector', 
# 'namematch_wikidata', 'namematch_distance', 'item__collmatches', 
# 'itemLabel__collmatches', 'surname__collmatches', 'initials__collmatches', 
# 'canonical_string__collmatches', 'orcid__collmatches', 'viaf__collmatches', 
# 'isni__collmatches', 'harv__collmatches', 'ipni__collmatches', 
# 'abbr__collmatches', 'bionomia_id__collmatches', 'yob__collmatches', 
# 'yod__collmatches', 'wyb__collmatches', 'wye__collmatches', 
# 'item__wdata-isolated', 'itemLabel__wdata-isolated', 'surname__wdata-isolated', 
# 'initials__wdata-isolated', 'canonical_string__wdata-isolated', 
# 'orcid__wdata-isolated', 'viaf__wdata-isolated', 'isni__wdata-isolated', 
# 'harv__wdata-isolated', 'ipni__wdata-isolated', 'abbr__wdata-isolated', 
# 'bionomia_id__wdata-isolated', 'yob__wdata-isolated', 'yod__wdata-isolated', 
# 'wyb__wdata-isolated', 'wye__wdata-isolated']


['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick', 'appellation', 'title', 'occurrenceID_count', 'occurrenceID_first', 'canonical_string_collector_parsed', 'index', 'namematch_collector', 'namematch_wikidata', 'namematch_distance', 'item__collmatches', 'itemLabel__collmatches', 'surname__collmatches', 'initials__collmatches', 'canonical_string__collmatches', 'orcid__collmatches', 'viaf__collmatches', 'isni__collmatches', 'harv__collmatches', 'ipni__collmatches', 'abbr__collmatches', 'bionomia_id__collmatches', 'yob__collmatches', 'yod__collmatches', 'wyb__collmatches', 'wye__collmatches', 'item__wdata-isolated', 'itemLabel__wdata-isolated', 'surname__wdata-isolated', 'initials__wdata-isolated', 'canonical_string__wdata-isolated', 'orcid__wdata-isolated', 'viaf__wdata-isolated', 'isni__wdata-isolated', 'harv__wdata-isolated', 'ipni__wdata-isolated', 'abbr__wdata-isolated', 'bionomia_id__wdata-isolated', 'yob__wdata-isolated', 'yod__wdata-isolated', 'wyb__wdata-isolat

In [79]:
collectors_matches_t2.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first,...,viaf__wdata-isolated,isni__wdata-isolated,harv__wdata-isolated,ipni__wdata-isolated,abbr__wdata-isolated,bionomia_id__wdata-isolated,yob__wdata-isolated,yod__wdata-isolated,wyb__wdata-isolated,wye__wdata-isolated
0,Azofeifa,A.,,,,,,,2,https://herbarium.bgbm.org/object/B200211416,...,,,,20031244-1,Azof.-Bolaños,,,,,
1,Azofeifa,A.,,,,,,,3,https://herbarium.bgbm.org/object/B200211671,...,,,,20031244-1,Azof.-Bolaños,,,,,
2,Azofeifa,A.,,,,,,,1,https://herbarium.bgbm.org/object/B101143091,...,,,,20031244-1,Azof.-Bolaños,,,,,
3,Azofeifa,A.,,,,,,,1,https://herbarium.bgbm.org/object/B101147462,...,,,,20031244-1,Azof.-Bolaños,,,,,
4,Azofeifa,A.,,,,,,,1,https://herbarium.bgbm.org/object/B101147459,...,,,,20031244-1,Azof.-Bolaños,,,,,


In [None]:
# TODO remove columns we do not need for analysis

In [80]:
this_output_file_name='data/bgbm_collectors_matches_wikidata-botanists_all-columns_%s.csv' % (
    # "20230531"
    datetime.today().strftime('%Y%m%d') # '%Y%m%d_%H%M'
)

print("Write isolated WikiData items of collector matches into", this_output_file_name)

collectors_matches_t2.to_csv(this_output_file_name)

Write isolated WikiData items of collector matches into data/bgbm_collectors_matches_wikidata-botanists_all-columns_20230705.csv


In [85]:
# TODO further evaluation or filtering, counting, clean up aso.


TODO document columns

Explanation of columns:

Column | Description
-|-
**Botanical collectors** |
family | parsed family name
given | parsed given name
TODO … | Year of first collection
TODO end_date | Year of last collection
TODO activity_span | Number of years between first and last collection
**Name matching** |
nammatch_collector | matched name of the data set
nammatch_wikidata | matched name; = Wikidata item label name is matched to
name_match_distance | Nearest Neighbour distance between the name and matched name; the lower the value, the better the match
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Start year of work period ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | End year of work period ([P2032](https://www.wikidata.org/wiki/Property:P2032))