# Match BGBM Collectors to Wikidata Items Using *Cosine Similarity*

Basically we …
- match of `canonical_string` of WikiData to `canonical_string` of the source collectors (abbreviated names and full names, if given), and
- parse collector source names beforehand to get individual names out of name lists in the source data, we have used <https://libraries.io/rubygems/dwc_agent>, and in general we
- follow the example of Niels Klazenga <https://github.com/nielsklazenga/avh-collectors/blob/master/cosine_similarity.ipynb>

Technical Notes — Review Code perhaps:
- TODO refactor some data files to results….csv
- done implement: run matching on `canonical_string_fullname` vs. `canonical_string` (abbreviated) names
- (NN ⇌ Cosine) refactor relation: wd_matchtest ⇌ wikidata_unique (replaced wd_matchtest → wikidata_unique)

### Load Wikidata Data Set

Construct data using Jupyter Notebook [create_wikidata_datasets_botanists.ipynb](./create_wikidata_datasets_botanists.ipynb)

Out of the Wikidata items data set we create a data frame with unique canonical name strings and their counts.

In [1]:
import pandas as pd
import pprint, time, os

wikidata = pd.read_csv("data/wikidata_persons_botanists_20230703_1352.csv", index_col=0, low_memory=False)

wikidata.head()

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Bieberstein,F.A.M.v.,"Bieberstein, F.A.M.v.","Bieberstein, Friedrich August Marschall von",,43340073,0000 0001 1630 5464,1373,...,Q66612,1768.0,1826.0,,,http://www.wikidata.org/wiki/Q66612,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/6129-1,https://bionomia.net/Q66612
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Behr,H.H.,"Behr, H.H.","Behr, Hans Hermann",,20328622,0000 0001 1604 8680,42741,...,Q66934,1818.0,1904.0,,,http://www.wikidata.org/wiki/Q66934,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/619-1,https://bionomia.net/Q66934
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Schäffer,J.C.,"Schäffer, J.C.","Schäffer, Jacob Christian",,47016953,0000 0000 8343 3899,1101,...,,1718.0,1790.0,,,http://www.wikidata.org/wiki/Q66661,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/12818-1,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Klotzsch,J.F.,"Klotzsch, J.F.","Klotzsch, Johann Friedrich",,20426762,0000 0001 1749 2732,135,...,Q67003,1805.0,1860.0,,,http://www.wikidata.org/wiki/Q67003,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/4855-1,https://bionomia.net/Q67003
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Menge,F.A.,"Menge, F.A.","Menge, Franz Anton",,59847236,0000 0001 1653 0899,73782,...,,1808.0,1880.0,,,http://www.wikidata.org/wiki/Q66322,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/23266-1,


In [2]:
# compile data having only unique canonical strings
# group by canonical name/string, count douplicated names
wd_matchtest = wikidata.groupby('canonical_string').agg({'item': ['count']}).reset_index()

wd_matchtest
# cols = wd_matchtest.columns.tolist()

Unnamed: 0_level_0,canonical_string,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"(-Walraevens), O.H.",1
1,"(1835-1906), G.A.F.E.",1
2,"(1873-1926), S.S.",1
3,"(1888–1973), G.A.",1
4,"(1904-1990), J.J.",1
...,...,...
61479,"Șerbanescu, I.",1
61480,"Ștefureac, T.",1
61481,"Țopa, E.",1
61482,"Ḥalwaǧī, R.",1


In [3]:
wd_matchtest_fullnames = wikidata.groupby('canonical_string_fullname').agg({'item': ['count']}).reset_index()

wd_matchtest_fullnames


Unnamed: 0_level_0,canonical_string_fullname,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"(-Walraevens), O Heylen",1
1,"(1835-1906), Gustav Adolf Ferdinand Eichler",1
2,"(1873-1926), Søren Sørensen",1
3,"(1888–1973), Georges André",1
4,"(1904-1990), Johannes Johannessen",1
...,...,...
63605,"Șerbanescu, Ioan",1
63606,"Ștefureac, Traian",1
63607,"Țopa, Emilian",1
63608,"Ḥalwaǧī, Riyāḍ",1


## Load Collectors Data Set

**Data sources:**

- option 1: Jupyter Notebook for [create_bgbm_gbif-occurrence_collectors_dataset.ipynb](./create_bgbm_gbif-occurrence_collectors_dataset.ipynb)
- option 2: Jupyter Notebook for `create_bgbm_botanypilot_collectors_dataset.ipynb` from SPARQL (not in this official documentation yet)

Then parse collector names to get single, separate collector names using `dwcagent`, use ruby gem package available at  <https://rubygems.org/gems/dwc_agent>:

- see [./bin/README.md](bin/README.md) to use ruby script `./bin/agent_parse4tsv.rb` for parsing text lines like `"Abbe,L.B., Abbe,E.C., Smitinand,T. & Rollet,B."`

Technical notes:

- the corresponding objects, variable names of Nils’ python code were:
```
refactor df_avh = → = collectors
refactor df_avh['label'] = → = collectors['canonical_string_collector_parsed']
…
```

In [4]:
# unique names parsed already by ruby gem package: dwcagent

# collectors = pd.read_csv("data/bgbm_collectors_20230510_1429_single-line_parsed_unique_names.tab", sep="\t")
collectors = pd.read_csv("data/VHde_0195853-230224095556074_BGBM/occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv", sep="\t", low_memory=False)

collectors.dropna(subset=['family'], inplace=True) # remove where family was NA, e.g. from originally «??» aso.
collectors.sort_values(by=['family', 'given','occurrenceID_first'], inplace=True)
collectors

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first
2,A. Cano,E.,,,,,,,1,https://herbarium.bgbm.org/object/B100699397
39762,Aaiki,,,,,,,,1,https://herbarium.bgbm.org/object/B101149305
5,Aaronsohn,A.,,,,,,,3,https://je.jacq.org/JE00010154
26985,Abaouz,A.,,,,,,,3,https://herbarium.bgbm.org/object/B100217620
26989,Abaouz,A.,,,,,,,2,https://herbarium.bgbm.org/object/B100326682
...,...,...,...,...,...,...,...,...,...,...
66575,Ždanova,O.,,,,,,,5,https://herbarium.bgbm.org/object/B100263330
32851,Ždanova,O.,,,,,,,1,https://herbarium.bgbm.org/object/B100263331
66576,Žíla,V.,,,,,,,3,https://herbarium.bgbm.org/object/B100009590
66577,Волкова,Е.,,,,,,,1,https://herbarium.bgbm.org/object/B100530714


### Check Composition of Parsed Collector Data

In [5]:
# TODO review code of abbreviated names and full name matching
criterion_fullnames = collectors.given.str.contains('^\w{3,}', na=False)
print("Show collecors given name has (propably) a full name (%s records) …" % len(collectors[criterion_fullnames].index))
collectors[criterion_fullnames]

Show collecors given name has (propably) a full name (1395 records) …


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first
2213,Abdallah,Raffael,,,,,,,1,https://herbarium.bgbm.org/object/B200125981
27076,Abdul,Kadir Bin,,,,,,,1,https://herbarium.bgbm.org/object/B100184021
21037,Abreu,Guilherme,,de,,,,,1,http://id.snsb.info/snsb/collection/22086/3086...
18465,Adá,García,,,,,,,1,https://herbarium.bgbm.org/object/B100296455
361,Aghababyan,Mvon,,,,,,,96,https://herbarium.bgbm.org/object/B100576238
...,...,...,...,...,...,...,...,...,...,...
65787,Zickendrath,Ernst,,,,,,,3,https://je.jacq.org/JE04006629
65788,Zickendrath,Ernst,,,,,,,1,https://je.jacq.org/JE04007139
65874,Ziz,Johann Baptist,,,,,,,1,https://je.jacq.org/JE00017744
65936,Zollinger,Heinrich,,,,,,,2,https://herbarium.bgbm.org/object/B101097046


In [6]:
# check the name-parsed columns if they are empty or need to be considerd as data for matching or not
for parsed_name_part in ["particle", "suffix", "dropping_particle", "appellation"]:
    test_collectors = collectors.loc[(collectors[parsed_name_part].isna() == False)]
    print("\n----------------------------------------\nshow names with **%s** found %s records:\n" % (parsed_name_part, len(test_collectors.index)))
    pprint.pprint(test_collectors.head())


----------------------------------------
show names with **particle** found 534 records:

        family      given suffix  particle  dropping_particle  nick  \
21037    Abreu  Guilherme    NaN        de                NaN   NaN   
4096   Aguilar       M.L.    NaN  Reyna de                NaN   NaN   
60867  Aguilar       M.L.    NaN  Reyna de                NaN   NaN   
16765  Aguilar       M.L.    NaN  Reyna de                NaN   NaN   
46755  Aguilar       M.L.    NaN  Reyna de                NaN   NaN   

      appellation title  occurrenceID_count  \
21037         NaN   NaN                   1   
4096          NaN   NaN                   4   
60867         NaN   NaN                  26   
16765         NaN   NaN                   2   
46755         NaN   NaN                   3   

                                      occurrenceID_first  
21037  http://id.snsb.info/snsb/collection/22086/3086...  
4096        https://herbarium.bgbm.org/object/B100031063  
60867       https://he

Compile `canonical_string…` for the collector data we will later match the WikiData names with:

In [7]:
collectors['canonical_string_collector_parsed'] = (
  # use collectors.family only where given name has NA values, otherwise use family name + given name
  collectors.family.where(
      # condition
      collectors.given.isna(),
      # any other 
      # TODO improve the combined name for canonical_string_collector_parsed if any of the other dwc_parsed fields is not NaN
      other= (collectors.family + ", " + collectors.given) \
        if any(collectors.particle.isna()) \
        else collectors.particle + " " + collectors.family + ", " + collectors.given
  )
)
# move canonical_string_collector_parsed after column title (title was the last of the parsing columns)
col = collectors.pop("canonical_string_collector_parsed")
collectors.insert(collectors.columns.get_loc('title') + 1, col.name, col)
collectors.tail()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_count,occurrenceID_first
66575,Ždanova,O.,,,,,,,"Ždanova, O.",5,https://herbarium.bgbm.org/object/B100263330
32851,Ždanova,O.,,,,,,,"Ždanova, O.",1,https://herbarium.bgbm.org/object/B100263331
66576,Žíla,V.,,,,,,,"Žíla, V.",3,https://herbarium.bgbm.org/object/B100009590
66577,Волкова,Е.,,,,,,,"Волкова, Е.",1,https://herbarium.bgbm.org/object/B100530714
66578,Жирова,O.,,,,,,,"Жирова, O.",1,https://herbarium.bgbm.org/object/B100630811


In [8]:
# group and aggregate data to have unique name rows only for the matching of names later on
collectors_unique=collectors.groupby(['canonical_string_collector_parsed']).agg(
    family=('family', lambda x: list(x)[0]),
    given=('given', lambda x: list(x)[0]),
    suffix=('suffix', lambda x: list(x)[0]),
    particle=('particle', lambda x: list(x)[0]),
    dropping_particle=('dropping_particle', lambda x: list(x)[0]),
    nick=('nick', lambda x: list(x)[0]),
    appellation=('appellation', lambda x: list(x)[0]),
    title=('title', lambda x: list(x)[0]),
    occurrenceID_collectors_count= ('occurrenceID_count', 'sum'), # use count function
    occurrenceID_collectors_firstsample=('occurrenceID_first', lambda x: list(x)[0]) # custom function, to get the first entry
).reset_index()

# move canonical_string_collector_parsed after column title
col = collectors_unique.pop("canonical_string_collector_parsed")
collectors_unique.insert(collectors_unique.columns.get_loc('title') + 1, col.name, col)

collectors_unique

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample
0,A. Cano,E.,,,,,,,"A. Cano, E.",1,https://herbarium.bgbm.org/object/B100699397
1,Aaiki,,,,,,,,Aaiki,1,https://herbarium.bgbm.org/object/B101149305
2,Aaronsohn,A.,,,,,,,"Aaronsohn, A.",3,https://je.jacq.org/JE00010154
3,Abaouz,A.,,,,,,,"Abaouz, A.",5,https://herbarium.bgbm.org/object/B100217620
4,Abarca,R.,,,,,,,"Abarca, R.",1,https://herbarium.bgbm.org/object/B101153811
...,...,...,...,...,...,...,...,...,...,...,...
20844,Żelazny,J.,,,,,,,"Żelazny, J.",4,https://herbarium.bgbm.org/object/B100344466
20845,Ždanova,O.,,,,,,,"Ždanova, O.",6,https://herbarium.bgbm.org/object/B100263330
20846,Žíla,V.,,,,,,,"Žíla, V.",3,https://herbarium.bgbm.org/object/B100009590
20847,Волкова,Е.,,,,,,,"Волкова, Е.",1,https://herbarium.bgbm.org/object/B100530714


In [9]:
# TODO continue 2023-08-21 10:28:54
# collectors=collectors.add_suffix('_parsed') \
#  if not any(col.endswith("_parsed") for col in list(collectors.columns))

## Set Up the Cosine Similarity and Text Search

See 
- for the application code https://github.com/nielsklazenga/avh-collectors/blob/master/cosine_similarity.ipynb
- for reading on the topic: Taylor, Josh. 2019. ‘Fuzzy Matching at Scale’. Towards Data Science (blog). 2 July 2019. https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536.

The `ngrams`-function is used as an analyzer in the text search later.

In [10]:
import pandas as pd, numpy as np, re
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from sparse_dot_topn import awesome_cossim_topn # pip install sparse-dot-topn

def get_matches_df(sparse_matrix, A, B, top=100):
    non_zeros = sparse_matrix.nonzero()

    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]

    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size

    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similarity = np.zeros(nr_matches)

    for index in range(0, nr_matches):
        left_side[index] = A[sparserows[index]]
        right_side[index] = B[sparsecols[index]]
        similarity[index] = round(sparse_matrix.data[index], 3)

    return pd.DataFrame({'namematch_source_data': left_side,
                         'namematch_resource_data': right_side,
                         'namematch_similarity': similarity})

!pip install ftfy
from ftfy import fix_text

def ngrams(string, n=3):
    """
    Construct ngram(s) of a given text

    @param string: the text string to perform the ngram splitting on
    @param n: character length of the particular (split) result text each
    @return: string as ngram
    """
    string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower()
    chars_to_remove = [")","(","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.replace('.', ' ')
    string = string.title()  # normalise case - capital at start of each word
    string = re.sub(' +', ' ', string).strip() # get rid of multiple spaces and replace with a single
    string = ' ' + string + ' '  # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD',r'', string)
    string = string.strip()
    this_ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in this_ngrams]

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try 'pacman -S
[31m   [0m python-xyz', where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch-packaged Python package,
[31m   [0m create a virtual environment using 'python -m venv path/to/venv'.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch packaged Python application,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. Make sure you have python-pipx
[31m   [0m installed via pacman.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-s

In [12]:
print("Show ngram examples:")
print("- simple name:", ngrams('Klazenga, N.'))
print("- data from collectors:", ngrams(collectors_unique["canonical_string_collector_parsed"].at[1])) 
print("- data from match-test:", ngrams(wd_matchtest['canonical_string'].at[0]))
print("- data from match-test (full name):", ngrams(wd_matchtest_fullnames['canonical_string_fullname'].at[3]))


Show ngram examples:
- simple name: ['Kla', 'laz', 'aze', 'zen', 'eng', 'nga', 'ga ', 'a N']
- data from collectors: ['Aai', 'aik', 'iki']
- data from match-test: ['Wal', 'alr', 'lra', 'rae', 'aev', 'eve', 'ven', 'ens', 'ns ', 's O', ' O ', 'O H']
- data from match-test (full name): ['188', '888', '881', '819', '197', '973', '73 ', '3 G', ' Ge', 'Geo', 'eor', 'org', 'rge', 'ges', 'es ', 's A', ' An', 'And', 'ndr']


In [15]:
# some example data
for i, row in enumerate(range(5)):
    if (i == 0):
        print('(WikiData’s) canonical_string = (constructed) canonical_string_fullname') 
    pprint.pprint("%s = %s" % (
        wd_matchtest['canonical_string'].at[row],
        wd_matchtest_fullnames['canonical_string_fullname'].at[row]
    ))


(WikiData’s) canonical_string = (constructed) canonical_string_fullname
'(-Walraevens), O.H. = (-Walraevens), O Heylen'
'(1835-1906), G.A.F.E. = (1835-1906), Gustav Adolf Ferdinand Eichler'
'(1873-1926), S.S. = (1873-1926), Søren Sørensen'
'(1888–1973), G.A. = (1888–1973), Georges André'
'(1904-1990), J.J. = (1904-1990), Johannes Johannessen'


In [16]:
def calculateTFIDFmatchingOfData(query_data, match_data, cossim_ntop=1, cossim_lower_bound=0.5):
    """
    Calculate a TF-IDF (Term Frequency — Inverse Document Frequency) matching with awesome_cossim_topn() and return matched data

    @param query_data: DataFrame usually a pandas data column to query names or strings for
    @param match_data: DataFrame against to match with
    @param cossim_ntop: how many cossim matches each shall be calculated (default 1, i.e. the highest similarity) — increase it to get more alternative
        matches with less similarity
    @param cossim_lower_bound: where is the lower similarity cut off to regard data as similar (default 0.5)

    @requires get_get_matches_df()
    @requires ngrams()
    @requires awesome_cossim_topn()
    @requires TfidfVectorizer()

    @return: a data frame dictionary: namematch_source_data, namematch_resource_data, namematch_similarity (from @see get_matches_df())
    @rtype pd.DataFrame
    """

    import time
    time_start = time.time()

    # Vectorize Wikidata name (use fit_transform())
    print('Vectorizing data. This may take a while...')
    vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
    tf_idf_matrix_clean = vectorizer.fit_transform(match_data)
    # Vectorize collectors’ names (use transform())
    tf_idf_matrix_dirty = vectorizer.transform(query_data)

    duration = time.time() - time_start
    print('Vectorizing completed: created a matrix of TF-IDF features after %s s' % duration)

    # Calculate Cosine Similarity; keep only the best match (ntop=1) and only if the similarity is greater than 0.5 (lower_bound=0.5)
    # (lower_bound: a threshold that the element of A*B must be greater than
    #  https://github.com/ing-bank/sparse_dot_topn/blob/3f40611b0553b50c27f23c7dcffc3ca9a9e8f5b5/sparse_dot_topn/awesome_cossim_topn.py#L26C9-L26C78)
    cossim_matches = awesome_cossim_topn(
        tf_idf_matrix_dirty,
        tf_idf_matrix_clean.transpose(),
        ntop=cossim_ntop,
        lower_bound=cossim_lower_bound
    )
    print("Cossim matches calculated after %s s" % (time.time() - time_start))

    print("Get all matches together ...")
    # construct the matching data frame
    matches_df = get_matches_df(
        cossim_matches,
        query_data,
        match_data,
        top=0
    )
    print("Done. Matches calculated after %s s" % (time.time() - time_start))

    return matches_df

In [24]:
criterion_fullnames = collectors_unique.given.str.contains('^\w{3,}', na=False)
collectors_names = collectors_unique['canonical_string_collector_parsed'][[not fullname for fullname in criterion_fullnames]].values

matches = calculateTFIDFmatchingOfData(
    collectors_names, 
    wd_matchtest['canonical_string'], 
    cossim_ntop=1 # e.g. cossim_ntop=3 would give more alternative matches as well, having lower similarities, data would increase 3 times as well
)
matches = matches.sort_values(by=['namematch_similarity'], ascending=[False])
matches = matches.reset_index(names=['old_index'])
matches

Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 3.6831042766571045 s
Cossim matches calculated after 4.273163318634033 s
Get all matches together ...
Done. Matches calculated after 4.432769536972046 s


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
0,8776,"Lehmann, A.","Lehmann, A.",1.0
1,8743,"Lebrun, J.-P.A.","Lebrun, J.P.A.",1.0
2,8840,"Lengyel, G.","Lengyel, G.",1.0
3,8838,"Lendemer, J.C.","Lendemer, J.C.",1.0
4,8831,"Lemmon, J.G.","Lemmon, J.G.",1.0
...,...,...,...,...
17547,8045,"Kokeil, F.","Keil, F.",0.5
17548,8656,"Latelo, M.G.","Melo, M.A.",0.5
17549,391,"Ani, H.","Anşin, R.",0.5
17550,10383,"Molero, C.","Poveda-Molero, J.C.",0.5


In [25]:
# criterion_fullnames = collectors_unique.given.str.contains('^\w{3,}', na=False)
collectors_fullnames = collectors_unique['canonical_string_collector_parsed'][criterion_fullnames].values
matches_fullnames = calculateTFIDFmatchingOfData(
    collectors_fullnames, 
    wd_matchtest_fullnames['canonical_string_fullname'], 
    cossim_ntop=1 # 10 would give more alternative matches also with lesser similarity
)

matches_fullnames = matches_fullnames.sort_values(by=['namematch_similarity'], ascending=[False])
matches_fullnames = matches_fullnames.reset_index(names=['old_index'])

matches_fullnames

Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 3.4578964710235596 s
Cossim matches calculated after 3.5317859649658203 s
Get all matches together ...
Done. Matches calculated after 3.536301612854004 s


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
0,214,"Kupffer, Karl Reinhold","Kupffer, Karl Reinhold",1.000
1,217,"Lackström, Emil Frithiof","Lackström, Emil Frithiof",1.000
2,221,"Lange, Michael","Lange, Michael",1.000
3,222,"Lechler, Wilibald","Lechler, Wilibald",1.000
4,226,"Lickleder, Max","Lickleder, Max",1.000
...,...,...,...,...
424,53,"Calasenz, Joseph","Klekovski, Joseph Calasenz Schlosser von",0.506
425,357,"Sharpe, Guymer","Guymer, Gordon P",0.505
426,223,"Leonis, Christos","Kapellos, Christos",0.502
427,276,"Nicomed, Rastern","Rastern, Nikomed",0.502


### Create Output Results

Combine the matches data frame back to the (BGBM) collectors and Wikidata items …

Note: merging 18.770.000 collector matches earlier to wikidata was too much to calculate. Hence the descision was to make the data unique by canonical_string_collector_parsed.

In [27]:
# # join (only) abbreviated name matches with collector source data
collectors_matches = pd.merge(
    collectors_unique, matches, 
    left_on='canonical_string_collector_parsed', right_on='namematch_source_data', 
    how='left'
)

collectors_matches.dropna(subset=['namematch_similarity'], inplace=True)
collectors_matches # 17552 rows × 15 columns

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
0,A. Cano,E.,,,,,,,"A. Cano, E.",1,https://herbarium.bgbm.org/object/B100699397,0.0,"A. Cano, E.","Cano, E.B.",0.664
1,Aaiki,,,,,,,,Aaiki,1,https://herbarium.bgbm.org/object/B101149305,1.0,Aaiki,"Naiki, A.",0.707
2,Aaronsohn,A.,,,,,,,"Aaronsohn, A.",3,https://je.jacq.org/JE00010154,2.0,"Aaronsohn, A.","Aaronsohn, A.",1.000
4,Abarca,R.,,,,,,,"Abarca, R.",1,https://herbarium.bgbm.org/object/B101153811,3.0,"Abarca, R.","Abarca, L.",0.879
5,Abarca,R.J.,,,,,,,"Abarca, R.J.",15,https://herbarium.bgbm.org/object/B101139201,4.0,"Abarca, R.J.","Abarca, L.",0.800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20841,Ţopa,E.,,,,,,,"Ţopa, E.",4,https://herbarium.bgbm.org/object/B100124910,17547.0,"Ţopa, E.","Țopa, E.",1.000
20842,Żarnowiec,J.,,,,,,,"Żarnowiec, J.",7,https://je.jacq.org/JE04006443,17548.0,"Żarnowiec, J.","Żarnowiec, J.T.",0.943
20843,Żelany,J.,,,,,,,"Żelany, J.",1,https://herbarium.bgbm.org/object/B100220196,17549.0,"Żelany, J.","Ważny, J.",0.670
20845,Ždanova,O.,,,,,,,"Ždanova, O.",6,https://herbarium.bgbm.org/object/B100263330,17550.0,"Ždanova, O.","Baranova, O.G.",0.599


In [28]:
# join (only) full name matches with collector source data
collectors_matches_fullname = pd.merge(
    collectors_unique, matches_fullnames, 
    left_on='canonical_string_collector_parsed' , right_on='namematch_source_data'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

collectors_matches_fullname # 429 rows × 15 columns

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
0,Abreu,Guilherme,,de,,,,,"Abreu, Guilherme",1,http://id.snsb.info/snsb/collection/22086/3086...,0,"Abreu, Guilherme","Rau, Guilherme",0.678
1,Adá,García,,,,,,,"Adá, García",1,https://herbarium.bgbm.org/object/B100296455,1,"Adá, García","Adá, Ramón García",0.651
2,Aghababyan,Mvon,,,,,,,"Aghababyan, Mvon",96,https://herbarium.bgbm.org/object/B100576238,2,"Aghababyan, Mvon","Aghababyan, Vladislav",0.729
3,Aichenhayn,Aichinger,,von,,,,,"Aichenhayn, Aichinger",1,https://dr.jacq.org/DR073481,3,"Aichenhayn, Aichinger","Aichinger, Erwin",0.521
4,Allorge,Valia Selitsky,,,,,,,"Allorge, Valia Selitsky",1,https://je.jacq.org/JE04004935,4,"Allorge, Valia Selitsky","Allorge, Valentine",0.514
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
424,Zhuo,Zhou,,,,,,,"Zhuo, Zhou",63,https://herbarium.bgbm.org/object/B100517202,424,"Zhuo, Zhou","Zhou, Zhuo",0.725
425,Zickendrath,Ernst,,,,,,,"Zickendrath, Ernst",4,https://je.jacq.org/JE04006629,425,"Zickendrath, Ernst","Zickendrath, Ernst",1.000
426,Ziz,Johann Baptist,,,,,,,"Ziz, Johann Baptist",1,https://je.jacq.org/JE00017744,426,"Ziz, Johann Baptist","Ziz, Johann Baptist",1.000
427,Zollinger,Heinrich,,,,,,,"Zollinger, Heinrich",2,https://herbarium.bgbm.org/object/B101097046,427,"Zollinger, Heinrich","Zollinger, Heinrich",1.000


In [29]:
# join all name matches together
collectors_all_matches=pd.concat([collectors_matches, collectors_matches_fullname])
collectors_all_matches.sort_values(by=['namematch_similarity', 'family'], ascending=[False, True], inplace=True)
collectors_all_matches.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,old_index,namematch_source_data,namematch_resource_data,namematch_similarity
2,Aaronsohn,A.,,,,,,,"Aaronsohn, A.",3,https://je.jacq.org/JE00010154,2.0,"Aaronsohn, A.","Aaronsohn, A.",1.0
7,Abbe,E.C.,,,,,,,"Abbe, E.C.",2,https://herbarium.bgbm.org/object/B100241637,6.0,"Abbe, E.C.","Abbe, E.C.",1.0
11,Abbott,J.R.,,,,,,,"Abbott, J.R.",80,https://herbarium.bgbm.org/object/B100181131,9.0,"Abbott, J.R.","Abbott, J.R.",1.0
13,Abbott,W.L.,,,,,,,"Abbott, W.L.",4,http://id.snsb.info/snsb/collection/504820/626...,11.0,"Abbott, W.L.","Abbott, W.L.",1.0
21,Abedin,S.,,,,,,,"Abedin, S.",14,https://herbarium.bgbm.org/object/B100046632,14.0,"Abedin, S.","Abedin, S.",1.0


Save the plain name matching results only ...

In [30]:
if not os.path.exists('data'):
    os.makedirs('data')

this_output_file='data/results_bgbm_collectors_vs_wikidata-botanists_cossim-similarity_plain-names_%s.csv' % (
    # "20230821"
    time.strftime('%Y%m%d') # '%Y%m%d_%H%M'
)

collectors_all_matches.to_csv(this_output_file)

print("Wrote plain name matches of collector names into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote plain name matches of collector names into data/results_bgbm_collectors_vs_wikidata-botanists_cossim-similarity_plain-names_20230821.csv (2188 kB)


In [31]:
# old code # Join Wikidata items
# df_avh_matches_wikidata = pd.merge(df_avh_matches, df_wikidata                , left_on='namematch_resource_data', right_on='canonical_string', how='left')
# df_avh_matches_wikidata = pd.merge(df_avh_matches_wikidata, df_wikidata_unique, left_on='namematch_resource_data', right_on='canonical_string', how='left')
# df_avh_matches_wikidata.rename(columns={df_avh_matches_wikidata.columns.tolist()[-1]: 'dup_count'}, inplace=True)


In [32]:
# merge now with WikiData: the matching data and the wiki data’s on the conaonical string name
collectors_matches_tmp_names_abbreviated = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string'
)
collectors_matches_tmp_fullnames = pd.merge(
    collectors_matches_fullname, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string_fullname'
)
collectors_matches_g1_merged_wikidata = pd.concat(
    [collectors_matches_tmp_names_abbreviated, collectors_matches_tmp_fullnames]
    , ignore_index=True
)

In [33]:
print("Show example data of «Kotschy…» with Cosine Similiarity we had 0.6 … 1.0 (Nearest Neighbour distances were from 0.0 to almost 1.0)")
print("There was a match of Kotschyi, C.G.T. → Kotschy, T.   → 0.614 → http://www.wikidata.org/wiki/Q113299  with lower similarity and a correct match to Carl Georg Theodor … :-)")
print("There was a match of Kotschy, C.G.T.  → Kotschy, C.F. → 0.824 → http://www.wikidata.org/entity/Q86842 with higer similarity but it is probably a wrong match of Carl Georg Theodor → to Carl Friedrich … :-/")

criterion = collectors_matches_g1_merged_wikidata['canonical_string_collector_parsed'].map(lambda x: x.startswith('Kotschy'))
collectors_matches_g1_merged_wikidata[criterion].get([
    # 'canonical_string_collector_parsed', # canonical_string_collector_parsed = namematch_source_data
    'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
    'namematch_source_data', 'namematch_resource_data', 'namematch_similarity', 
    # 'canonical_string_fullname', 
    'itemLabel', 'wikidata_link'
])

Show example data of «Kotschy…» with Cosine Similiarity we had 0.6 … 1.0 (Nearest Neighbour distances were from 0.0 to almost 1.0)
There was a match of Kotschyi, C.G.T. → Kotschy, T.   → 0.614 → http://www.wikidata.org/wiki/Q113299  with lower similarity and a correct match to Carl Georg Theodor … :-)
There was a match of Kotschy, C.G.T.  → Kotschy, C.F. → 0.824 → http://www.wikidata.org/entity/Q86842 with higer similarity but it is probably a wrong match of Carl Georg Theodor → to Carl Friedrich … :-/


Unnamed: 0,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,namematch_source_data,namematch_resource_data,namematch_similarity,itemLabel,wikidata_link
9162,2,https://dr.jacq.org/DR049432,Kotschy,"Kotschy, T.",0.849,Theodor Kotschy,http://www.wikidata.org/wiki/Q113299
9163,37,http://id.snsb.info/snsb/collection/16719/2549...,"Kotschy, K.G.T.","Kotschy, T.",0.723,Theodor Kotschy,http://www.wikidata.org/wiki/Q113299
9164,310,http://id.snsb.info/snsb/collection/117808/176...,"Kotschy, T.","Kotschy, T.",1.0,Theodor Kotschy,http://www.wikidata.org/wiki/Q113299
9165,5,https://herbarium.bgbm.org/object/B100526350,"Kotschy, Th","Kotschy, T.",0.888,Theodor Kotschy,http://www.wikidata.org/wiki/Q113299
9166,1,https://herbarium.bgbm.org/object/B100160086,"Kotschyi, C.G.T.","Kotschy, T.",0.614,Theodor Kotschy,http://www.wikidata.org/wiki/Q113299
9167,1,http://id.snsb.info/snsb/collection/22980/3175...,"Kotschy, C.G.","Kotschy, C.F.",0.895,Carl Friedrich Kotschy,http://www.wikidata.org/wiki/Q86842
9168,2494,http://id.snsb.info/snsb/collection/108230/167...,"Kotschy, C.G.T.","Kotschy, C.F.",0.824,Carl Friedrich Kotschy,http://www.wikidata.org/wiki/Q86842
19118,2,https://je.jacq.org/JE00022436,"Kotschy, Carl Georg Theodor","Kotschy, Theodor",0.746,Theodor Kotschy,http://www.wikidata.org/wiki/Q113299
19119,1,https://herbarium.bgbm.org/object/B101113772,"Kotschy, Karl Georg Th","Kotschy, Theodor",0.539,Theodor Kotschy,http://www.wikidata.org/wiki/Q113299


In [34]:
pprint.pprint(collectors_matches_g1_merged_wikidata.columns)

Index(['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick',
       'appellation', 'title', 'canonical_string_collector_parsed',
       'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
       'old_index', 'namematch_source_data', 'namematch_resource_data',
       'namematch_similarity', 'item', 'itemLabel', 'surname', 'initials',
       'canonical_string', 'canonical_string_fullname', 'orcid', 'viaf',
       'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod', 'wyb',
       'wye', 'wikidata_link', 'orcid_link', 'harv_link', 'ipni_link',
       'bionomia_link'],
      dtype='object')


In [35]:
collectors_matches_g1_merged_wikidata.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_collectors_count,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,A. Cano,E.,,,,,,,"A. Cano, E.",1,...,Q42335752,1964.0,2021.0,,,http://www.wikidata.org/wiki/Q42335752,https://orcid.org/0000-0003-3529-9439,,,https://bionomia.net/Q42335752
1,Cano-E,A.A.,,,,,,,"Cano-E, A.A.",2,...,Q42335752,1964.0,2021.0,,,http://www.wikidata.org/wiki/Q42335752,https://orcid.org/0000-0003-3529-9439,,,https://bionomia.net/Q42335752
2,Aaiki,,,,,,,,Aaiki,1,...,,,,,,http://www.wikidata.org/wiki/Q33686006,,,https://www.ipni.org/a/20029813-1,
3,Aaronsohn,A.,,,,,,,"Aaronsohn, A.",3,...,Q2086130,1876.0,1919.0,,,http://www.wikidata.org/wiki/Q2086130,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/23-1,https://bionomia.net/Q2086130
4,Abarca,R.,,,,,,,"Abarca, R.",1,...,,,,,,http://www.wikidata.org/wiki/Q36610614,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/34769-1,


In [36]:
# Select useful columns for data results
collectors_wikidata_cossim = collectors_matches_g1_merged_wikidata[
    ['canonical_string_collector_parsed', 'family', 'given', 
     'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
    'namematch_source_data', 'namematch_resource_data', 'namematch_similarity', 
    'item', 'canonical_string', 'itemLabel',
    'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod', 'wyb']
]

# Order by similarity (desc), number of Wikidata items (asc) and number of collections (desc)
collectors_wikidata_cossim.sort_values(by=['namematch_similarity', 'family'], ascending=[False, True], inplace=True)

collectors_wikidata_cossim # comparison-match of «Kotschy, Karl Georg Th» (collector data) →← «Kotschy, T» (Wikidata) has only 0.5 similarity but corresponds to the correct person name we need

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_wikidata_cossim.sort_values(by=['namematch_similarity', 'family'], ascending=[False, True], inplace=True)


Unnamed: 0,canonical_string_collector_parsed,family,given,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,namematch_source_data,namematch_resource_data,namematch_similarity,item,canonical_string,...,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb
3,"Aaronsohn, A.",Aaronsohn,A.,3,https://je.jacq.org/JE00010154,"Aaronsohn, A.","Aaronsohn, A.",1.0,http://www.wikidata.org/entity/Q2086130,"Aaronsohn, A.",...,,2795076,0000 0001 0948 8581,30592,23-1,Aarons.,Q2086130,1876.0,1919.0,
8,"Abbe, E.C.",Abbe,E.C.,2,https://herbarium.bgbm.org/object/B100241637,"Abbe, E.C.","Abbe, E.C.",1.0,http://www.wikidata.org/entity/Q10274118,"Abbe, E.C.",...,,101473381,0000 0000 7237 8505,30066,26-1,Abbe,Q10274118,1905.0,2000.0,
11,"Abbott, J.R.",Abbott,J.R.,80,https://herbarium.bgbm.org/object/B100181131,"Abbott, J.R.","Abbott, J.R.",1.0,http://www.wikidata.org/entity/Q18982386,"Abbott, J.R.",...,,,,,20015671-1,J.R.Abbott,,1968.0,,
13,"Abbott, W.L.",Abbott,W.L.,4,http://id.snsb.info/snsb/collection/504820/626...,"Abbott, W.L.","Abbott, W.L.",1.0,http://www.wikidata.org/entity/Q635604,"Abbott, W.L.",...,,1545420,0000 0000 3712 5377,27518,,,Q635604,1860.0,1936.0,
17,"Abedin, S.",Abedin,S.,14,https://herbarium.bgbm.org/object/B100046632,"Abedin, S.","Abedin, S.",1.0,http://www.wikidata.org/entity/Q16142861,"Abedin, S.",...,,5859151837993620520007,,69097,35239-1,Abedin,,1952.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8695,"Kokeil, F.",Kokeil,F.,1,https://herbarium.bgbm.org/object/B100244933,"Kokeil, F.","Keil, F.",0.5,http://www.wikidata.org/entity/Q1447763,"Keil, F.",...,,45045812,,,21283-1,Keil,,1822.0,1876.0,
9724,"Latelo, M.G.",Latelo,M.G.,1,https://herbarium.bgbm.org/object/B100005981,"Latelo, M.G.","Melo, M.A.",0.5,http://www.wikidata.org/entity/Q88839898,"Melo, M.A.",...,,,,,20037542-1,M.A.Melo,,,,
11610,"Molero, C.",Molero,C.,1,https://herbarium.bgbm.org/object/B100720720,"Molero, C.","Poveda-Molero, J.C.",0.5,http://www.wikidata.org/entity/Q88845286,"Poveda-Molero, J.C.",...,,,,,20039862-1,Poveda-Molero,,,,
12616,Ohnesorge,Ohnesorge,,1,https://herbarium.bgbm.org/object/B101142476,Ohnesorge,"Desor, É.",0.5,http://www.wikidata.org/entity/Q84445,"Desor, É.",...,,106994079,0000 0001 1696 4208,,,,Q84445,1811.0,1882.0,


In [37]:
# Kotschy example again with all merged columns
# pd.set_option("display.max_columns", None) # default ?20
#
# criterion = collectors_wikidata_cossim['canonical_string_collector_parsed'].map(lambda x: x.startswith('Kotschy'))
# print("Show example of «Kotschy…» with similarities of 0.5 … 1.0")
# print("There was a match of Kotschy, C.G.T. → Kotschy, C.F. → 0.824 → http://www.wikidata.org/entity/Q86842 and it is probably a wrong match of Carl Georg Theodor → to Carl Friedrich … :-/")
# collectors_wikidata_cossim[criterion]

In [38]:
# TODO further evaluation or filtering, counting, clean up aso.
if not os.path.exists('data'):
    os.makedirs('data')

# bgbm_collectors_cosine-similarity_wikidata-botanists_%s.csv
this_output_file='data/results_bgbm_collectors_vs_wikidata-botanists_cossim-similarity_merged-data_%s.csv' % (
    # "20230821"
    time.strftime('%Y%m%d') # '%Y%m%d_%H%M'
)

collectors_wikidata_cossim.to_csv(this_output_file)

print("Wrote matches of collector names into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote matches of collector names into data/results_bgbm_collectors_vs_wikidata-botanists_cossim-similarity_merged-data_20230821.csv (4376 kB)


TODO document columns

Explanation of columns:

Column | Description
-|-
**Botanical collectors** |
family | parsed family name
given | parsed given name
suffix | suffix from name parsing
particle | particle from name parsing
dropping_particle | dropping_particle from name parsing
nick | nick name from name parsing
appellation | appellation from name parsing
title | title from name parsing
occurrenceID_collectors_count | count of all occurrenceID of one particular collector name
occurrenceID_collectors_firstsample | a data sample of an occurrenceID 
TODO … | Year of first collection
TODO end_date | Year of last collection
TODO activity_span | Number of years between first and last collection
**Name matching** |
namematch_source_data | matched name of the collector data set
namematch_resource_data | matched name of Wikidata the collector was tried to matched to
namematch_similarity | calculated cosine-similarity
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
bionomia_id | identifier for a collector and/or determiner of natural history specimens, in the Bionomia database ([P6944](https://www.wikidata.org/wiki/Property:P6944))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Work year period begin ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | Work year period end ([P2032](https://www.wikidata.org/wiki/Property:P2032))