# Match BGBM Collectors to Wikidata Items

Basically we attempt a match of `canonical_string` of WikiData to `canonical_string` of the collectors (in this case the names were parsed beforehand into single names using <https://libraries.io/rubygems/dwc_agent>)

TODO:

- evaluate if multiple names (WikiData or collector data) are found
- match also with time periode of work (WikiData) ⇌ created time of the herbarium sheet (if no other life time data are available)

### Load Wikidata Data Set

[Jupyter Notebook for creating the botanist Wikidata data set](./create_wikidata_datasets_botanists.ipynb) (TODO: improve query properties) 

Out of the Wikidata items data set we create a data frame with unique canonical name strings and their counts.

In [1]:
import pandas as pd
wikidata = pd.read_csv("data/wikidata_persons_botanists_20230703_1352.csv", index_col=0, low_memory=False)

wikidata.head()

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb,wye
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Bieberstein,F.A.M.v.,"Bieberstein, F.A.M.v.",,43340073,0000 0001 1630 5464,1373.0,6129-1,M.Bieb.,Q66612,1768.0,1826.0,,
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Behr,H.H.,"Behr, H.H.",,20328622,0000 0001 1604 8680,42741.0,619-1,Behr,Q66934,1818.0,1904.0,,
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Schäffer,J.C.,"Schäffer, J.C.",,47016953,0000 0000 8343 3899,1101.0,12818-1,Schaeff.,,1718.0,1790.0,,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Klotzsch,J.F.,"Klotzsch, J.F.",,20426762,0000 0001 1749 2732,135.0,4855-1,Klotzsch,Q67003,1805.0,1860.0,,
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Menge,F.A.,"Menge, F.A.",,59847236,0000 0001 1653 0899,73782.0,23266-1,Menge,,1808.0,1880.0,,


In [2]:
# Create data frame with unique canonical strings 
# group by canonical name/string, count douplicated names
wd_matchtest = wikidata.groupby('canonical_string').agg({'item': ['count']}).reset_index()

wd_matchtest

# colls = list(wikidata.columns)

Unnamed: 0_level_0,canonical_string,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"(-Walraevens), O.H.",1
1,"(1835-1906), G.A.F.E.",1
2,"(1873-1926), S.S.",1
3,"(1888–1973), G.A.",1
4,"(1904-1990), J.J.",1
...,...,...
61296,"Șerbanescu, I.",1
61297,"Ștefureac, T.",1
61298,"Țopa, E.",1
61299,"Ḥalwaǧī, R.",1


### Load Collectors Data Set

Data sources:

- option 1: Jupyter Notebook for `create_bgbm_botanypilot_collectors_dataset.ipynb` from SPARQL (not in this official documentation yet)
- option 2: Jupyter Notebook for [`create_bgbm_gbif-occurrence_collectors_dataset.ipynb`](./create_bgbm_gbif-occurrence_collectors_dataset.ipynb)

Then parse collector names to get single, separate collector names using `dwcagent`, use ruby gem package available at  <https://rubygems.org/gems/dwc_agent>:

- use ruby script `./bin/agent_parse4tsv.rb` for parsing text lines like `"Abbe,L.B., Abbe,E.C., Smitinand,T. & Rollet,B."`

TODO:
- check parsed fields `particle` and other fields, e.g. «`Abbas al Ani, H.`»

    ```bash
    cd data/VHde_0195853-230224095556074_BGBM/
    head occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv \
      | column --table --separator $'\t' \
      | sed 's@^@  # @;'
      # family     given  suffix  particle  dropping_particle  nick  appellation  title  occurrenceID_count  occurrenceID_first
      #            No                                                                    1                   http://id.snsb.info/snsb/collection/108286/167064/109352
      # Azofeifa   A.                                                                    2                   https://herbarium.bgbm.org/object/B200211416
      # A. Cano    E.                                                                    1                   https://herbarium.bgbm.org/object/B100699397
      # Henry      A.                                                                    1                   https://herbarium.bgbm.org/object/B200098813
      # Selmons    Ad                                                                    1                   https://herbarium.bgbm.org/object/B100379213
      # Aaronsohn  A.                                                                    3                   https://je.jacq.org/JE00010154
      ```


In [3]:
# unique names parsed already by ruby gem package: dwcagent

# collectors = pd.read_csv("data/bgbm_collectors_20230510_1429_single-line_parsed_unique_names.tab", sep="\t")
collectors = pd.read_csv("data/VHde_0195853-230224095556074_BGBM/occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv", sep="\t")

collectors.dropna(subset=['family'], inplace=True) # remove where family was NA, e.g. from originally «??» aso.
collectors.sort_values(by=['family', 'given','occurrenceID_first'], inplace=True)
collectors

  collectors = pd.read_csv("data/VHde_0195853-230224095556074_BGBM/occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv", sep="\t")


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first
2,A. Cano,E.,,,,,,,1,https://herbarium.bgbm.org/object/B100699397
39762,Aaiki,,,,,,,,1,https://herbarium.bgbm.org/object/B101149305
5,Aaronsohn,A.,,,,,,,3,https://je.jacq.org/JE00010154
26985,Abaouz,A.,,,,,,,3,https://herbarium.bgbm.org/object/B100217620
26989,Abaouz,A.,,,,,,,2,https://herbarium.bgbm.org/object/B100326682
...,...,...,...,...,...,...,...,...,...,...
66575,Ždanova,O.,,,,,,,5,https://herbarium.bgbm.org/object/B100263330
32851,Ždanova,O.,,,,,,,1,https://herbarium.bgbm.org/object/B100263331
66576,Žíla,V.,,,,,,,3,https://herbarium.bgbm.org/object/B100009590
66577,Волкова,Е.,,,,,,,1,https://herbarium.bgbm.org/object/B100530714


#### Check Composition of Parsed Collector Data

In [4]:
# test particle for NA values (perhaps particle is the most important)
test_collectors = collectors.loc[(collectors.particle.isna() == False)]
print("names with particle (%s records)…" % len(test_collectors.index))
test_collectors.head()

names with particle (534 records)…


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first
21037,Abreu,Guilherme,,de,,,,,1,http://id.snsb.info/snsb/collection/22086/3086...
4096,Aguilar,M.L.,,Reyna de,,,,,4,https://herbarium.bgbm.org/object/B100031063
60867,Aguilar,M.L.,,Reyna de,,,,,26,https://herbarium.bgbm.org/object/B100031454
16765,Aguilar,M.L.,,Reyna de,,,,,2,https://herbarium.bgbm.org/object/B100031644
46755,Aguilar,M.L.,,Reyna de,,,,,3,https://herbarium.bgbm.org/object/B100031648


In [5]:
# test suffix for NA values
test_collectors = collectors.loc[(collectors.suffix.isna() == False)]
print("names with suffix (%s records)…" % len(test_collectors.index))
test_collectors.head()

names with suffix (15 records)…


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first
17288,August,Friedrich,II.,,,,,,21,https://dr.jacq.org/DR014960
58907,Dogma,I.J.,Jr.,,,,,,1,https://je.jacq.org/JE04008848
17017,Forsyth,W.,jr.,,,,,,1,http://id.snsb.info/snsb/collection/504525/625...
801,Grear,J.W.,Jr.,,,,,,1,https://herbarium.bgbm.org/object/B100525791
26194,Grear,J.W.,Jr.,,,,,,2,https://herbarium.bgbm.org/object/B100525792


In [6]:
# test dropping_particle for NA values
test_collectors = collectors.loc[(collectors.dropping_particle.isna() == False)]
print("names with dropping_particle (%s records)…" % len(test_collectors.index))
test_collectors.head()

names with dropping_particle (0 records)…


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first


In [7]:
test_collectors = collectors.loc[(collectors.appellation.isna() == False)]
print("names with appellation (%s records)…" % len(test_collectors.index))
test_collectors.head()

names with appellation (1 records)…


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first
17120,Sennen,,,,,,Fr,,2,https://herbarium.bgbm.org/object/B100127256


Compile `canonical_string…` for the collector data we will later match the WikiData names with:

In [8]:
collectors['canonical_string_collector_parsed'] = (
  # use collectors.family only where given name has NA values, otherwise use family name + given name
  collectors.family.where(
      # condition
      collectors.given.isna(),
      # any other TODO improve the combined name for canonical_string_collector_parsed if any of the other dwc_parsed fields is not NaN
      # other= collectors.family + ", " + collectors.given 
      other= (collectors.family + ", " + collectors.given) \
        if any(collectors.particle.isna()) \
        else collectors.particle + " " + collectors.family + ", " + collectors.given
  )
)
# move canonical_string_collector_parsed after column title
col = collectors.pop("canonical_string_collector_parsed")
collectors.insert(collectors.columns.get_loc('title') + 1, col.name, col)
collectors.tail()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_count,occurrenceID_first
66575,Ždanova,O.,,,,,,,"Ždanova, O.",5,https://herbarium.bgbm.org/object/B100263330
32851,Ždanova,O.,,,,,,,"Ždanova, O.",1,https://herbarium.bgbm.org/object/B100263331
66576,Žíla,V.,,,,,,,"Žíla, V.",3,https://herbarium.bgbm.org/object/B100009590
66577,Волкова,Е.,,,,,,,"Волкова, Е.",1,https://herbarium.bgbm.org/object/B100530714
66578,Жирова,O.,,,,,,,"Жирова, O.",1,https://herbarium.bgbm.org/object/B100630811


In [None]:
# collectors=collectors.add_suffix('_parsed') \
#  if not any(col.endswith("_parsed") for col in list(collectors.columns))

### Set Up the Text Search

See https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536

The ngrams function is used as an analyzer in the text search later.

In [9]:
wd_matchtest['canonical_string'].at[0]

'(-Walraevens), O.H.'

In [10]:
import re
!pip install ftfy # amazing text cleaning for decode issues..
from ftfy import fix_text

def ngrams(string, n=3):
    string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower()
    chars_to_remove = [")","(",".","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.title() # normalise case - capital at start of each word
    string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single
    string = ' '+ string +' ' # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]


print("Example from name:", ngrams('Klazenga, N.'))
print("Example from collectors:", ngrams(collectors["canonical_string_collector_parsed"].at[1])) 
print("Example from match-test:", ngrams(wd_matchtest['canonical_string'].at[1]))


Defaulting to user installation because normal site-packages is not writeable
Example from name: [' Kl', 'Kla', 'laz', 'aze', 'zen', 'eng', 'nga', 'ga ', 'a N', ' N ']
Example from collectors: [' Az', 'Azo', 'zof', 'ofe', 'fei', 'eif', 'ifa', 'fa ', 'a A', ' A ']
Example from match-test: [' 18', '183', '835', '35 ', '5 1', ' 19', '190', '906', '06 ', '6 G', ' Ga', 'Gaf', 'afe', 'fe ']


Vectorize Wikidata names. Background: We use an information retrieval technique (Term Frequency — Inverse Document Frequency, blog [towardsdatascience.com/tf-idf-explained…](https://towardsdatascience.com/tf-idf-explained-and-python-sklearn-implementation-b020c5e83275)) for matching the source names with WikiData names, for that a calculated dinsance measure of the name match will help to match similar names and distinguish names that are rather no match. In general see also https://scikit-learn.org, https://pypi.org/project/scikit-learn/. 

Convert a collection of raw documents to a matrix of TF-IDF features:

In [11]:
# pip install --upgrade scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

wikidata_names = wd_matchtest['canonical_string']

# vectorize wikidata names
print('Vectorizing data. This may take a while...')
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
tfidf_vector_data = vectorizer.fit_transform(wikidata_names)
print('Vectorizing completed: Created a matrix of TF-IDF featurs')


Vectorizing data. This may take a while...
Vectorizing completed: Created a matrix of TF-IDF featurs


Set up the function that performs the nearest neighbour matches...

In [12]:
from sklearn.neighbors import NearestNeighbors

nbrs_data = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf_vector_data) 
# tfidf_vector_data contains the vectorized wikidata names from the previous step

# matching query
def getNearestN(query):
    queryTFIDF_ = vectorizer.transform(query)
    distances, indices = nbrs_data.kneighbors(queryTFIDF_)
    return distances, indices


### Perform the Matching

Perform the nearest neighbour (NN) matches on the (BGBM) collector names and create a data frame with matches... (can take 5 to 10 minutes)

In [13]:
collectors_names = set(collectors['canonical_string_collector_parsed'].values) 
  # convert list to set for better performance

import time
start = time.time()
print('Getting nearest neighbours...')
distances, indices = getNearestN(collectors_names)
duration = time.time() - start
print('Completed in:', duration, 's')

collectors_names = list(collectors_names) # convert back to list

print('Finding matches...')
matches = []
for i,j in enumerate(indices):
    temp = [collectors_names[i], wd_matchtest.values[j][0][0], round(distances[i][0],2)]
    matches.append(temp)

duration = time.time() - start
print('Building matches data frame:', duration, 's')  
matches = pd.DataFrame(
    matches, 
    columns=['namematch_collector','namematch_wikidata','namematch_distance']
)

duration = time.time() - start
print('Done:', duration, 's') 

matches = matches.sort_values(['namematch_distance'])
matches = matches.reset_index()

matches.head()

Getting nearest neighbours...
Completed in: 121.09354186058044 s
Finding matches...
Building matches data frame: 199.0564329624176 s
Done: 199.06938552856445 s


Unnamed: 0,index,namematch_collector,namematch_wikidata,namematch_distance
0,10424,"Rehm, S.E.A.","Rehm, S.E.A.",0.0
1,4038,"Wiefel, C.","Wiefel, C.",0.0
2,4037,"Collins, G.N.","Collins, G.N.",0.0
3,11547,"Roux, H.","Roux, H.",0.0
4,11556,"Bleij, B.","Bleij, B.",0.0


### Create Output Results

Combine the matches data frame back to the (BGBM) collectors and Wikidata items …

In [14]:
# join matches data frame back to source collectors  dataframe 
collectors_matches = pd.merge(
    collectors, matches, 
    left_on='canonical_string_collector_parsed'
    , right_on='namematch_collector'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

collectors_matches.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_count,occurrenceID_first,index,namematch_collector,namematch_wikidata,namematch_distance
0,A. Cano,E.,,,,,,,"A. Cano, E.",1,https://herbarium.bgbm.org/object/B100699397,15817,"A. Cano, E.","Cano, Á.",0.64
1,Aaiki,,,,,,,,Aaiki,1,https://herbarium.bgbm.org/object/B101149305,4034,Aaiki,"Naiki, A.",0.84
2,Aaronsohn,A.,,,,,,,"Aaronsohn, A.",3,https://je.jacq.org/JE00010154,7903,"Aaronsohn, A.","Aaronsohn, A.",0.0
3,Abaouz,A.,,,,,,,"Abaouz, A.",3,https://herbarium.bgbm.org/object/B100217620,12817,"Abaouz, A.","Arbaoui, S.",1.12
4,Abaouz,A.,,,,,,,"Abaouz, A.",2,https://herbarium.bgbm.org/object/B100326682,12817,"Abaouz, A.","Arbaoui, S.",1.12


Save the results...

In [16]:
from datetime import datetime
import os
if not os.path.exists('data'):
    os.makedirs('data')

this_output_file='data/bgbm_collectors_matches_wikidata-botanists_%s.csv' % (
    # "20230705"
    datetime.today().strftime('%Y%m%d') # '%Y%m%d_%H%M'
)

collectors_matches.to_csv(this_output_file)

print("Wrote matches of collector names into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote matches of collector names into data/bgbm_collectors_matches_wikidata-botanists_20230705.csv (8450 kB)


### Aggregate Matched Data

Aggregate now the data, if multiple names are found … aso.

In [18]:
# link counts of wikidata items with canonical name string
# collectors_matches_g1 = pd.merge(collectors_matches, wd_matchtest, 
#                                  left_on='matched_name', right_on='canonical_string')
collectors_matches_g1 = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_wikidata', right_on='canonical_string'
)
collectors_matches_g1.rename(columns = {list(collectors_matches_g1)[-1]: 'item_count'}, inplace=True)

# link wikidata items with canonical name string (pipe separated if more than one)
print('Aggregate WD item, i.e. WD IDs (multiple records joined by “…|…”) ...')
wikidata_uniq_items = wikidata.groupby(['canonical_string'])['item'].apply('|'.join).reset_index()
print('Done.')

collectors_matches_g2 = pd.merge(# no merging unique WikiData names to collectors
    collectors_matches_g1, wikidata_uniq_items, 
    left_on='namematch_wikidata', right_on='canonical_string'
    , suffixes=('__grp_by_itemcount', '__grp_by_item') 
      # append to left-data, right-data only when identical column names occur
)
collectors_matches_g2.rename(columns = {list(collectors_matches_g2)[-1]: 'items'}, inplace=True)

# link wikidata items with canonical name string (pipe separated if more than one)
print('Aggregate WD itemLabel, i.e. names (multiple records joined by “…|…”) ...')
wikidata_uniq_itemlabels = wikidata.groupby(['canonical_string'])['itemLabel'].apply('|'.join).reset_index()
print('Done.')

collectors_matches_g3 = pd.merge(
    collectors_matches_g2, wikidata_uniq_itemlabels, 
    left_on='namematch_wikidata', right_on='canonical_string'
    , suffixes=('__grp_by_items', '__grp_by_itemlabel') 
      # append to left-data, right-data only when identical column names occur
)

collectors_matches_g3.rename(columns = {list(collectors_matches_g3)[-1]: 'item_labels'}, inplace=True)


Aggregate WD item, i.e. WD IDs (multiple records joined by “…|…”) ...
Done.
Aggregate WD itemLabel, i.e. names (multiple records joined by “…|…”) ...
Done.


Prepare data to save later on …

In [19]:
collectors_matches_group = collectors_matches_g3

print(list(collectors_matches_group.columns))
# from merge: _x means from left column, _y means from right column

# in BASH fold text long lines; echo "${text}" | fold --spaces | sed 's@^@# @'
# ['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick', 
# 'appellation', 'title', 'canonical_string_collector_parsed', 
# 'occurrenceID_count', 'occurrenceID_first', 'index', 'namematch_collector', 
# 'namematch_wikidata', 'namematch_distance', 'item__grp_by_itemcount', 
# 'itemLabel__grp_by_items', 'surname', 'initials', 
# 'canonical_string__grp_by_itemcount', 'orcid', 'viaf', 'isni', 'harv', 'ipni', 
# 'abbr', 'bionomia_id', 'yob', 'yod', 'wyb', 'item_count', 
# 'canonical_string__grp_by_item', 'items', 'canonical_string', 'item_labels']

['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick', 'appellation', 'title', 'canonical_string_collector_parsed', 'occurrenceID_count', 'occurrenceID_first', 'index', 'namematch_collector', 'namematch_wikidata', 'namematch_distance', 'item__grp_by_itemcount', 'itemLabel__grp_by_items', 'surname', 'initials', 'canonical_string__grp_by_itemcount', 'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod', 'wyb', 'item_count', 'canonical_string__grp_by_item', 'items', 'canonical_string', 'item_labels']


In [20]:
collectors_matches_group.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_count,...,abbr,bionomia_id,yob,yod,wyb,item_count,canonical_string__grp_by_item,items,canonical_string,item_labels
0,A. Cano,E.,,,,,,,"A. Cano, E.",1,...,An.Cano,,,,,,"Cano, Á.",http://www.wikidata.org/entity/Q47115003,"Cano, Á.",Ángela Cano
1,Cano-E,A.A.,,,,,,,"Cano-E, A.A.",2,...,An.Cano,,,,,,"Cano, Á.",http://www.wikidata.org/entity/Q47115003,"Cano, Á.",Ángela Cano
2,Cantillano,E.,,,,,,,"Cantillano, E.",3,...,An.Cano,,,,,,"Cano, Á.",http://www.wikidata.org/entity/Q47115003,"Cano, Á.",Ángela Cano
3,Aaiki,,,,,,,,Aaiki,1,...,Naiki,,,,,,"Naiki, A.",http://www.wikidata.org/entity/Q33686006,"Naiki, A.",Akiyo Naiki
4,Aaronsohn,A.,,,,,,,"Aaronsohn, A.",3,...,Aarons.,Q2086130,1876.0,1919.0,,,"Aaronsohn, A.",http://www.wikidata.org/entity/Q2086130,"Aaronsohn, A.",Aaron Aaronsohn


In [21]:
# Remove superfluous columns TODO check WARNING: A value is trying to be set on a copy of a slice from a DataFrame
# TODO check duplicates
collectors_matches_group = collectors_matches_g3[
    ['family', 'given', 'canonical_string_collector_parsed', 
    'namematch_collector', 'namematch_wikidata', 'namematch_distance', 
    'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod', 'wyb',
    'items', 'canonical_string', 'item_labels']
]
# collectors_matches_group = collectors_matches_g3
collectors_matches_group.sort_values(
    by=['namematch_distance', 'canonical_string_collector_parsed']
    , inplace=True
)
collectors_matches_group.drop_duplicates(inplace=True)
collectors_matches_group.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_matches_group.sort_values(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_matches_group.drop_duplicates(inplace=True)


Unnamed: 0,family,given,canonical_string_collector_parsed,namematch_collector,namematch_wikidata,namematch_distance,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb,items,canonical_string,item_labels
4,Aaronsohn,A.,"Aaronsohn, A.","Aaronsohn, A.","Aaronsohn, A.",0.0,,2795076,0000 0001 0948 8581,30592.0,23-1,Aarons.,Q2086130,1876.0,1919.0,,http://www.wikidata.org/entity/Q2086130,"Aaronsohn, A.",Aaron Aaronsohn
11,Abbe,E.C.,"Abbe, E.C.","Abbe, E.C.","Abbe, E.C.",0.0,,101473381,0000 0000 7237 8505,30066.0,26-1,Abbe,Q10274118,1905.0,2000.0,,http://www.wikidata.org/entity/Q10274118,"Abbe, E.C.",Ernst Cleveland Abbe
16,Abbott,J.R.,"Abbott, J.R.","Abbott, J.R.","Abbott, J.R.",0.0,,,,,20015671-1,J.R.Abbott,,1968.0,,,http://www.wikidata.org/entity/Q18982386,"Abbott, J.R.",J. Richard Abbott
30,Abbott,W.L.,"Abbott, W.L.","Abbott, W.L.","Abbott, W.L.",0.0,,1545420,0000 0000 3712 5377,27518.0,,,Q635604,1860.0,1936.0,,http://www.wikidata.org/entity/Q635604,"Abbott, W.L.",William Louis Abbott
47,Abedin,S.,"Abedin, S.","Abedin, S.","Abedin, S.",0.0,,5859151837993620520007,,69097.0,35239-1,Abedin,,1952.0,,,http://www.wikidata.org/entity/Q16142861,"Abedin, S.",Sultanul Abedin


In [22]:
this_output_file='data/bgbm_collectors_matches_wikidata_items_group_concat_%s.csv' % (
    # "20230705"
    datetime.today().strftime('%Y%m%d') # '%Y%m%d_%H%M'
)

collectors_matches_group.to_csv(this_output_file)

print("Wrote groups of collectors matches into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote groups of collectors matches into data/bgbm_collectors_matches_wikidata_items_group_concat_20230705.csv (4311 kB)


### Get Individual WikiData Items

(TODO review code): 
- associate collector name match + individual WikiData items (remember: we matched the `canonical_string`)

In [27]:
# TODO get  list of atomized collectors matches down to single wikidata items
collectors_matches_t1 = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_wikidata', right_on='canonical_string'
)
# collectors_matches_t1.drop(columns=['canonical_string'])

# link counts of wikidata items with same canonical name string
collectors_matches_t2 = pd.merge(
    collectors_matches_t1, wikidata, 
    left_on="namematch_wikidata", right_on="canonical_string"
    , suffixes=('__collmatches', '__wdata-isolated') # append to left-data, right-data only when identical column names occur
)

# TODO AP: add count of duplicates?

print(list(collectors_matches_t2.columns))
# in BASH fold text long lines; echo "${text}" | fold --spaces | sed 's@^@# @'
# ['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick', 
# 'appellation', 'title', 'canonical_string_collector_parsed', 
# 'occurrenceID_count', 'occurrenceID_first', 'index', 'namematch_collector', 
# 'namematch_wikidata', 'namematch_distance', 'item__collmatches', 
# 'itemLabel__collmatches', 'surname__collmatches', 'initials__collmatches', 
# 'canonical_string__collmatches', 'orcid__collmatches', 'viaf__collmatches', 
# 'isni__collmatches', 'harv__collmatches', 'ipni__collmatches', 
# 'abbr__collmatches', 'bionomia_id__collmatches', 'yob__collmatches', 
# 'yod__collmatches', 'wyb__collmatches', 'wye__collmatches', 
# 'item__wdata-isolated', 'itemLabel__wdata-isolated', 'surname__wdata-isolated', 
# 'initials__wdata-isolated', 'canonical_string__wdata-isolated', 
# 'orcid__wdata-isolated', 'viaf__wdata-isolated', 'isni__wdata-isolated', 
# 'harv__wdata-isolated', 'ipni__wdata-isolated', 'abbr__wdata-isolated', 
# 'bionomia_id__wdata-isolated', 'yob__wdata-isolated', 'yod__wdata-isolated', 
# 'wyb__wdata-isolated', 'wye__wdata-isolated']


['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick', 'appellation', 'title', 'canonical_string_collector_parsed', 'occurrenceID_count', 'occurrenceID_first', 'index', 'namematch_collector', 'namematch_wikidata', 'namematch_distance', 'item__collmatches', 'itemLabel__collmatches', 'surname__collmatches', 'initials__collmatches', 'canonical_string__collmatches', 'orcid__collmatches', 'viaf__collmatches', 'isni__collmatches', 'harv__collmatches', 'ipni__collmatches', 'abbr__collmatches', 'bionomia_id__collmatches', 'yob__collmatches', 'yod__collmatches', 'wyb__collmatches', 'wye__collmatches', 'item__wdata-isolated', 'itemLabel__wdata-isolated', 'surname__wdata-isolated', 'initials__wdata-isolated', 'canonical_string__wdata-isolated', 'orcid__wdata-isolated', 'viaf__wdata-isolated', 'isni__wdata-isolated', 'harv__wdata-isolated', 'ipni__wdata-isolated', 'abbr__wdata-isolated', 'bionomia_id__wdata-isolated', 'yob__wdata-isolated', 'yod__wdata-isolated', 'wyb__wdata-isolat

In [24]:
collectors_matches_t2.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_count,...,viaf__wdata-isolated,isni__wdata-isolated,harv__wdata-isolated,ipni__wdata-isolated,abbr__wdata-isolated,bionomia_id__wdata-isolated,yob__wdata-isolated,yod__wdata-isolated,wyb__wdata-isolated,wye__wdata-isolated
0,A. Cano,E.,,,,,,,"A. Cano, E.",1,...,6501155286653387180000,,,20023992-1,An.Cano,,,,,
1,Cano-E,A.A.,,,,,,,"Cano-E, A.A.",2,...,6501155286653387180000,,,20023992-1,An.Cano,,,,,
2,Cantillano,E.,,,,,,,"Cantillano, E.",3,...,6501155286653387180000,,,20023992-1,An.Cano,,,,,
3,Aaiki,,,,,,,,Aaiki,1,...,,,,20029813-1,Naiki,,,,,
4,Aaronsohn,A.,,,,,,,"Aaronsohn, A.",3,...,2795076,0000 0001 0948 8581,30592.0,23-1,Aarons.,Q2086130,1876.0,1919.0,,


Save all columns for further analysis

In [32]:
this_output_file='data/bgbm_collectors_matches_wikidata-botanists_all-columns_%s.csv' % (
    # "20230705"
    datetime.today().strftime('%Y%m%d') # '%Y%m%d_%H%M'
)

collectors_matches_t2.sort_values(
    by=['namematch_distance', 'canonical_string_collector_parsed']
    , inplace=True
)
collectors_matches_t2.to_csv(
    this_output_file, index=False # drop index column
)

print("Wrote isolated WikiData items of collector matches into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote isolated WikiData items of collector matches into data/bgbm_collectors_matches_wikidata-botanists_all-columns_20230705.csv (33210 kB)


In [None]:
# TODO remove columns we do not need for analysis
# remove duplicate columns by transposing it (https://www.statology.org/pandas-drop-duplicate-columns/)
compact_df_tmp=collectors_matches_t2.transpose().drop_duplicates().transpose()
compact_df_tmp.sort_values(
    by=['namematch_distance', 'canonical_string_collector_parsed']
    , inplace=True
)
this_output_file='data/bgbm_collectors_matches_wikidata-botanists_all-columns-made-unique_%s.csv' % (
    "20230705"
    # datetime.today().strftime('%Y%m%d') # '%Y%m%d_%H%M'
)

compact_df_tmp.to_csv(
    this_output_file, index=False # drop index column
)

print("Wrote isolated WikiData items (unique columns) of collector matches into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

In [None]:
# TODO further evaluation or filtering, counting, clean up aso.

TODO document columns

Explanation of columns:

Column | Description
-|-
**Botanical collectors** |
family | parsed family name
given | parsed given name
suffix | suffix from name parsing
particle | particle from name parsing
dropping_particle | dropping_particle from name parsing
nick | nick name from name parsing
appellation | appellation from name parsing
title | title from name parsing
TODO … | Year of first collection
TODO end_date | Year of last collection
TODO activity_span | Number of years between first and last collection
**Name matching** |
nammatch_collector | matched name of the data set
nammatch_wikidata | matched name; = Wikidata item label name is matched to
name_match_distance | Nearest Neighbour distance between the name and matched name; the lower the value, the better the match
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
bionomia_id | identifier for a collector and/or determiner of natural history specimens, in the Bionomia database ([P6944](https://www.wikidata.org/wiki/Property:P6944))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Work year period begin ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | Work year period end ([P2032](https://www.wikidata.org/wiki/Property:P2032))