# Match BGBM Collectors to Wikidata Items Using Cosine Similarity

Basically we
- match of `canonical_string` of WikiData to `canonical_string` of the collectors (in this case the names were parsed beforehand into single names using <https://libraries.io/rubygems/dwc_agent>)
- follow the example of Niels Klazenga <https://github.com/nielsklazenga/avh-collectors/blob/master/cosine_similarity.ipynb>

### Load Wikidata Data Set

[Jupyter Notebook for creating the botanist Wikidata data set](./create_wikidata_datasets_botanists.ipynb) (TODO: improve query properties) 

Out of the Wikidata items data set we create a data frame with unique canonical name strings and their counts.

In [1]:
import pandas as pd
wikidata = pd.read_csv("data/wikidata_persons_botanists_20230703_1352.csv", index_col=0, low_memory=False)

wikidata.head()

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb,wye
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Bieberstein,F.A.M.v.,"Bieberstein, F.A.M.v.",,43340073,0000 0001 1630 5464,1373.0,6129-1,M.Bieb.,Q66612,1768.0,1826.0,,
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Behr,H.H.,"Behr, H.H.",,20328622,0000 0001 1604 8680,42741.0,619-1,Behr,Q66934,1818.0,1904.0,,
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Schäffer,J.C.,"Schäffer, J.C.",,47016953,0000 0000 8343 3899,1101.0,12818-1,Schaeff.,,1718.0,1790.0,,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Klotzsch,J.F.,"Klotzsch, J.F.",,20426762,0000 0001 1749 2732,135.0,4855-1,Klotzsch,Q67003,1805.0,1860.0,,
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Menge,F.A.,"Menge, F.A.",,59847236,0000 0001 1653 0899,73782.0,23266-1,Menge,,1808.0,1880.0,,


In [2]:
# compile data having only unique canonical strings
# group by canonical name/string, count douplicated names
wikidata_unique = wikidata.groupby('canonical_string').agg({'item': ['count']}).reset_index()
cols = wikidata_unique.columns.tolist()

wikidata_unique

# colls = list(wikidata.columns)

Unnamed: 0_level_0,canonical_string,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"(-Walraevens), O.H.",1
1,"(1835-1906), G.A.F.E.",1
2,"(1873-1926), S.S.",1
3,"(1888–1973), G.A.",1
4,"(1904-1990), J.J.",1
...,...,...
61296,"Șerbanescu, I.",1
61297,"Ștefureac, T.",1
61298,"Țopa, E.",1
61299,"Ḥalwaǧī, R.",1


## Load Collectors Data Set

Data sources:

- option 1: Jupyter Notebook for `create_bgbm_botanypilot_collectors_dataset.ipynb` from SPARQL (not in this official documentation yet)
- option 2: Jupyter Notebook for [`create_bgbm_gbif-occurrence_collectors_dataset.ipynb`](./create_bgbm_gbif-occurrence_collectors_dataset.ipynb)

Then parse collector names to get single, separate collector names using `dwcagent`, use ruby gem package available at  <https://rubygems.org/gems/dwc_agent>:

- use ruby script `./bin/agent_parse4tsv.rb` for parsing text lines like `"Abbe,L.B., Abbe,E.C., Smitinand,T. & Rollet,B."`

TODO
```
refactor df_avh = → = collectors
refactor df_avh['label'] = → = collectors['canonical_string_collector_parsed']
```

In [3]:
# unique names parsed already by ruby gem package: dwcagent

# collectors = pd.read_csv("data/bgbm_collectors_20230510_1429_single-line_parsed_unique_names.tab", sep="\t")
collectors = pd.read_csv("data/VHde_0195853-230224095556074_BGBM/occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv", sep="\t")

collectors.dropna(subset=['family'], inplace=True) # remove where family was NA, e.g. from originally «??» aso.
collectors.sort_values(by=['family', 'given','occurrenceID_first'], inplace=True)
collectors

  collectors = pd.read_csv("data/VHde_0195853-230224095556074_BGBM/occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv", sep="\t")


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first
2,A. Cano,E.,,,,,,,1,https://herbarium.bgbm.org/object/B100699397
39762,Aaiki,,,,,,,,1,https://herbarium.bgbm.org/object/B101149305
5,Aaronsohn,A.,,,,,,,3,https://je.jacq.org/JE00010154
26985,Abaouz,A.,,,,,,,3,https://herbarium.bgbm.org/object/B100217620
26989,Abaouz,A.,,,,,,,2,https://herbarium.bgbm.org/object/B100326682
...,...,...,...,...,...,...,...,...,...,...
66575,Ždanova,O.,,,,,,,5,https://herbarium.bgbm.org/object/B100263330
32851,Ždanova,O.,,,,,,,1,https://herbarium.bgbm.org/object/B100263331
66576,Žíla,V.,,,,,,,3,https://herbarium.bgbm.org/object/B100009590
66577,Волкова,Е.,,,,,,,1,https://herbarium.bgbm.org/object/B100530714


### Check Composition of Parsed Collector Data

In [6]:
# test particle for NA values (perhaps particle is the most important)
test_collectors = collectors.loc[(collectors.particle.isna() == False)]
print("names with name particle (%s records)…" % len(test_collectors.index))
test_collectors.head()

names with name particle (534 records)…


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first
21037,Abreu,Guilherme,,de,,,,,1,http://id.snsb.info/snsb/collection/22086/3086...
4096,Aguilar,M.L.,,Reyna de,,,,,4,https://herbarium.bgbm.org/object/B100031063
60867,Aguilar,M.L.,,Reyna de,,,,,26,https://herbarium.bgbm.org/object/B100031454
16765,Aguilar,M.L.,,Reyna de,,,,,2,https://herbarium.bgbm.org/object/B100031644
46755,Aguilar,M.L.,,Reyna de,,,,,3,https://herbarium.bgbm.org/object/B100031648


In [7]:
# test suffix for NA values
test_collectors = collectors.loc[(collectors.suffix.isna() == False)]
print("names with name suffix (%s records)…" % len(test_collectors.index))
test_collectors.head()

names with name suffix (15 records)…


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first
17288,August,Friedrich,II.,,,,,,21,https://dr.jacq.org/DR014960
58907,Dogma,I.J.,Jr.,,,,,,1,https://je.jacq.org/JE04008848
17017,Forsyth,W.,jr.,,,,,,1,http://id.snsb.info/snsb/collection/504525/625...
801,Grear,J.W.,Jr.,,,,,,1,https://herbarium.bgbm.org/object/B100525791
26194,Grear,J.W.,Jr.,,,,,,2,https://herbarium.bgbm.org/object/B100525792


In [8]:
# test dropping_particle for NA values
test_collectors = collectors.loc[(collectors.dropping_particle.isna() == False)]
print("names with name dropping_particle (%s records)…" % len(test_collectors.index))
test_collectors.head()

names with name dropping_particle (0 records)…


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first


In [9]:
test_collectors = collectors.loc[(collectors.appellation.isna() == False)]
print("names with name appellation (%s records)…" % len(test_collectors.index))
test_collectors.head()
# Remark: “Fr Sennen” in https://herbarium.bgbm.org/object/B100127256 is Frère Sennen (i.e. Brother Sennen), so: appelation “Fr” is parsed the right (expected) way

names with name appellation (1 records)…


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_count,occurrenceID_first
17120,Sennen,,,,,,Fr,,2,https://herbarium.bgbm.org/object/B100127256


Compile `canonical_string…` for the collector data we will later match the WikiData names with:

In [10]:
# TODO improve (perhaps) the composition of the canonical string out of parsed name fragments
collectors['canonical_string_collector_parsed'] = (
  # use collectors.family only where given name has NA values, otherwise use family name + given name
  collectors.family.where(
      # condition
      collectors.given.isna(),
      # any other TODO improve the combined name for canonical_string_collector_parsed if any of the other dwc_parsed fields is not NaN
      # other= collectors.family + ", " + collectors.given 
      other= (collectors.family + ", " + collectors.given) \
        if any(collectors.particle.isna()) \
        else collectors.particle + " " + collectors.family + ", " + collectors.given
  )
)
# move column canonical_string_collector_parsed after title
col = collectors.pop("canonical_string_collector_parsed")
collectors.insert(collectors.columns.get_loc('title') + 1, col.name, col)
collectors.tail()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,occurrenceID_count,occurrenceID_first
66575,Ždanova,O.,,,,,,,"Ždanova, O.",5,https://herbarium.bgbm.org/object/B100263330
32851,Ždanova,O.,,,,,,,"Ždanova, O.",1,https://herbarium.bgbm.org/object/B100263331
66576,Žíla,V.,,,,,,,"Žíla, V.",3,https://herbarium.bgbm.org/object/B100009590
66577,Волкова,Е.,,,,,,,"Волкова, Е.",1,https://herbarium.bgbm.org/object/B100530714
66578,Жирова,O.,,,,,,,"Жирова, O.",1,https://herbarium.bgbm.org/object/B100630811


In [11]:
# group and aggregate data: 
collectors_unique=collectors.groupby(['canonical_string_collector_parsed']).agg(
    family=('family', lambda x: list(x)[0]),
    given=('given', lambda x: list(x)[0]),
    suffix=('suffix', lambda x: list(x)[0]),
    particle=('particle', lambda x: list(x)[0]),
    dropping_particle=('dropping_particle', lambda x: list(x)[0]),
    nick=('nick', lambda x: list(x)[0]),
    appellation=('appellation', lambda x: list(x)[0]),
    title=('title', lambda x: list(x)[0]),
    occurrenceID_collectors_count= ('occurrenceID_count', 'sum'), # use count function
    occurrenceID_collectors_firstsample=('occurrenceID_first', lambda x: list(x)[0]) # custom function, to get the first entry
).reset_index()

collectors_unique

Unnamed: 0,canonical_string_collector_parsed,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_collectors_count,occurrenceID_collectors_firstsample
0,"A. Cano, E.",A. Cano,E.,,,,,,,1,https://herbarium.bgbm.org/object/B100699397
1,Aaiki,Aaiki,,,,,,,,1,https://herbarium.bgbm.org/object/B101149305
2,"Aaronsohn, A.",Aaronsohn,A.,,,,,,,3,https://je.jacq.org/JE00010154
3,"Abaouz, A.",Abaouz,A.,,,,,,,5,https://herbarium.bgbm.org/object/B100217620
4,"Abarca, R.",Abarca,R.,,,,,,,1,https://herbarium.bgbm.org/object/B101153811
...,...,...,...,...,...,...,...,...,...,...,...
20844,"Żelazny, J.",Żelazny,J.,,,,,,,4,https://herbarium.bgbm.org/object/B100344466
20845,"Ždanova, O.",Ždanova,O.,,,,,,,6,https://herbarium.bgbm.org/object/B100263330
20846,"Žíla, V.",Žíla,V.,,,,,,,3,https://herbarium.bgbm.org/object/B100009590
20847,"Волкова, Е.",Волкова,Е.,,,,,,,1,https://herbarium.bgbm.org/object/B100530714


In [12]:
# collectors=collectors.add_suffix('_parsed') \
#  if not any(col.endswith("_parsed") for col in list(collectors.columns))

## Set Up the Cosine Similarity and Text Search

See 
- for the application code https://github.com/nielsklazenga/avh-collectors/blob/master/cosine_similarity.ipynb
- for reading on the topic: Taylor, Josh. 2019. ‘Fuzzy Matching at Scale’. Towards Data Science (blog). 2 July 2019. https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536.

The `ngrams`-function is used as an analyzer in the text search later.

In [13]:
import pandas as pd, numpy as np, re
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from sparse_dot_topn import awesome_cossim_topn # pip install sparse-dot-topn

def get_matches_df(sparse_matrix, A, B, top=100):
    non_zeros = sparse_matrix.nonzero()

    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]

    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size

    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similarity = np.zeros(nr_matches)

    for index in range(0, nr_matches):
        left_side[index] = A[sparserows[index]]
        right_side[index] = B[sparsecols[index]]
        similarity[index] = round(sparse_matrix.data[index], 3)

    return pd.DataFrame({'left_side': left_side,
                         'right_side': right_side,
                         'similarity': similarity})

!pip install ftfy
from ftfy import fix_text

def ngrams(string, n=3):
    string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower()
    chars_to_remove = [")","(","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.replace('.', ' ')
    string = string.title() # normalise case - capital at start of each word
    string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single
    string = ' '+ string +' ' # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD',r'', string)
    string = string.strip()
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

Defaulting to user installation because normal site-packages is not writeable


In [14]:
# some example data

[wikidata_unique['canonical_string'].at[row] for row in range(3)]

['(-Walraevens), O.H.', '(1835-1906), G.A.F.E.', '(1873-1926), S.S.']

In [15]:
print("Example from name:", ngrams('Klazenga, N.'))
print("Example from collectors:", ngrams(collectors_unique["canonical_string_collector_parsed"].at[1])) 
print("Example from match-test:", ngrams(wikidata_unique['canonical_string'].at[0]))

Example from name: ['Kla', 'laz', 'aze', 'zen', 'eng', 'nga', 'ga ', 'a N']
Example from collectors: ['Aai', 'aik', 'iki']
Example from match-test: ['Wal', 'alr', 'lra', 'rae', 'aev', 'eve', 'ven', 'ens', 'ns ', 's O', ' O ', 'O H']


In [24]:
import time
time_start = time.time()

# Vectorize Wikidata name (use fit_transform())
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix_clean = vectorizer.fit_transform(wikidata_unique['canonical_string'])

# Vectorize collectors’ names (use transform())
tf_idf_matrix_dirty = vectorizer.transform(collectors_unique['canonical_string_collector_parsed'])
print("data vectorized in %s s" % (time.time() - time_start))

# Calculate Cosine Similarity; keep only the best match (ntop=1) and only if the similarity is greater than 0.5 (lower_bound=0.5)
# (lower_bound: a threshold that the element of A*B must be greater than 
#  https://github.com/ing-bank/sparse_dot_topn/blob/3f40611b0553b50c27f23c7dcffc3ca9a9e8f5b5/sparse_dot_topn/awesome_cossim_topn.py#L26C9-L26C78)
time_start = time.time()
matches = awesome_cossim_topn( 
    tf_idf_matrix_dirty, 
    tf_idf_matrix_clean.transpose(), 
    ntop=1, 
    lower_bound=0.5 
)
print("matches calculated in %s s" % (time.time() - time_start))

data vectorized in 3.8578572273254395 s
matches calculated in 0.6409316062927246 s


In [25]:
# construct the matching data frame
matches_df = get_matches_df(
    matches, 
    collectors_unique['canonical_string_collector_parsed'].reset_index()['canonical_string_collector_parsed'], 
    wikidata_unique['canonical_string'], 
    top=0
)

matches_df

Unnamed: 0,left_side,right_side,similarity
0,"A. Cano, E.","Cano, E.B.",0.664
1,Aaiki,"Naiki, A.",0.707
2,"Aaronsohn, A.","Aaronsohn, A.",1.000
3,"Abarca, R.","Abarca, L.",0.879
4,"Abarca, R.J.","Abarca, L.",0.800
...,...,...,...
17960,"Ţopa, E.","Țopa, E.",1.000
17961,"Żarnowiec, J.","Żarnowiec, J.T.",0.943
17962,"Żelany, J.","Ważny, J.",0.667
17963,"Ždanova, O.","Baranova, O.G.",0.599


Note: merging 18.770.000 collector matches to wikidata was too much to calculate. Hence the descision was to make the data unique by canonical_string_collector_parsed.

In [26]:
# Join matches data frame to collectors data frame
# TODO CONTINUE (20230719)
collectors_matches = pd.merge(
    collectors_unique, matches_df, 
    left_on='canonical_string_collector_parsed', right_on='left_side', 
    how='left'
)

collectors_matches # 18775907 rows × 14 columns

Unnamed: 0,canonical_string_collector_parsed,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,left_side,right_side,similarity
0,"A. Cano, E.",A. Cano,E.,,,,,,,1,https://herbarium.bgbm.org/object/B100699397,"A. Cano, E.","Cano, E.B.",0.664
1,Aaiki,Aaiki,,,,,,,,1,https://herbarium.bgbm.org/object/B101149305,Aaiki,"Naiki, A.",0.707
2,"Aaronsohn, A.",Aaronsohn,A.,,,,,,,3,https://je.jacq.org/JE00010154,"Aaronsohn, A.","Aaronsohn, A.",1.000
3,"Abaouz, A.",Abaouz,A.,,,,,,,5,https://herbarium.bgbm.org/object/B100217620,,,
4,"Abarca, R.",Abarca,R.,,,,,,,1,https://herbarium.bgbm.org/object/B101153811,"Abarca, R.","Abarca, L.",0.879
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20844,"Żelazny, J.",Żelazny,J.,,,,,,,4,https://herbarium.bgbm.org/object/B100344466,,,
20845,"Ždanova, O.",Ždanova,O.,,,,,,,6,https://herbarium.bgbm.org/object/B100263330,"Ždanova, O.","Baranova, O.G.",0.599
20846,"Žíla, V.",Žíla,V.,,,,,,,3,https://herbarium.bgbm.org/object/B100009590,"Žíla, V.","Žíla, V.",1.000
20847,"Волкова, Е.",Волкова,Е.,,,,,,,1,https://herbarium.bgbm.org/object/B100530714,,,


In [27]:
collectors_matches.dropna(subset=['similarity'], inplace=True)

collectors_matches # 18775907 → 18771627 rows × 14 columns

Unnamed: 0,canonical_string_collector_parsed,family,given,suffix,particle,dropping_particle,nick,appellation,title,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,left_side,right_side,similarity
0,"A. Cano, E.",A. Cano,E.,,,,,,,1,https://herbarium.bgbm.org/object/B100699397,"A. Cano, E.","Cano, E.B.",0.664
1,Aaiki,Aaiki,,,,,,,,1,https://herbarium.bgbm.org/object/B101149305,Aaiki,"Naiki, A.",0.707
2,"Aaronsohn, A.",Aaronsohn,A.,,,,,,,3,https://je.jacq.org/JE00010154,"Aaronsohn, A.","Aaronsohn, A.",1.000
4,"Abarca, R.",Abarca,R.,,,,,,,1,https://herbarium.bgbm.org/object/B101153811,"Abarca, R.","Abarca, L.",0.879
5,"Abarca, R.J.",Abarca,R.J.,,,,,,,15,https://herbarium.bgbm.org/object/B101139201,"Abarca, R.J.","Abarca, L.",0.800
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20841,"Ţopa, E.",Ţopa,E.,,,,,,,4,https://herbarium.bgbm.org/object/B100124910,"Ţopa, E.","Țopa, E.",1.000
20842,"Żarnowiec, J.",Żarnowiec,J.,,,,,,,7,https://je.jacq.org/JE04006443,"Żarnowiec, J.","Żarnowiec, J.T.",0.943
20843,"Żelany, J.",Żelany,J.,,,,,,,1,https://herbarium.bgbm.org/object/B100220196,"Żelany, J.","Ważny, J.",0.667
20845,"Ždanova, O.",Ždanova,O.,,,,,,,6,https://herbarium.bgbm.org/object/B100263330,"Ždanova, O.","Baranova, O.G.",0.599


In [28]:
# # Join Wikidata items
# df_avh_matches_wikidata = pd.merge(df_avh_matches, df_wikidata                , left_on='right_side', right_on='canonical_string', how='left')
# df_avh_matches_wikidata = pd.merge(df_avh_matches_wikidata, df_wikidata_unique, left_on='right_side', right_on='canonical_string', how='left')
# df_avh_matches_wikidata.rename(columns={df_avh_matches_wikidata.columns.tolist()[-1]: 'dup_count'}, inplace=True)


In [29]:
# Join Wikidata items
time_start = time.time()

collectors_matches_wikidata = pd.merge(collectors_matches, wikidata, left_on='right_side', right_on='canonical_string', how='left')
# collectors_matches_wikidata = pd.merge(collectors_matches_wikidata, wikidata_unique, left_on='right_side', right_on='canonical_string', how='left')
print("merge of collectors matches and wikidata in %s s" % (time.time() - time_start))

print(list(collectors_matches_wikidata.columns))

# collectors_matches_wikidata.rename(columns={collectors_matches_wikidata.columns.tolist()[-1]: 'dup_count'}, inplace=True)

merge of collectors matches and wikidata in 0.1557629108428955 s
['canonical_string_collector_parsed', 'family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick', 'appellation', 'title', 'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample', 'left_side', 'right_side', 'similarity', 'item', 'itemLabel', 'surname', 'initials', 'canonical_string', 'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod', 'wyb', 'wye']


In [30]:
# Remove unwanted columns
collectors_wikidata_cossim = collectors_matches_wikidata[
    ['canonical_string_collector_parsed', 'family', 'given', 
     'occurrenceID_collectors_count', 'occurrenceID_collectors_firstsample',
    'left_side', 'right_side', 'similarity', 
    'item', 'canonical_string', 'itemLabel',
    'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod', 'wyb']
]

# Order by similarity (desc), number of Wikidata items (asc) and number of collections (desc)
collectors_wikidata_cossim.sort_values(by=['similarity', 'family'], ascending=[False, True], inplace=True)

collectors_wikidata_cossim # comparison-match of «Kotschy, Karl Georg Th» (collector data) →← «Kotschy, T» (Wikidata) has only 0.5 similarity but corresponds to the correct person name we need

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_wikidata_cossim.sort_values(by=['similarity', 'family'], ascending=[False, True], inplace=True)


Unnamed: 0,canonical_string_collector_parsed,family,given,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,left_side,right_side,similarity,item,canonical_string,...,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb
2,"Aaronsohn, A.",Aaronsohn,A.,3,https://je.jacq.org/JE00010154,"Aaronsohn, A.","Aaronsohn, A.",1.0,http://www.wikidata.org/entity/Q2086130,"Aaronsohn, A.",...,,2795076,0000 0001 0948 8581,30592.0,23-1,Aarons.,Q2086130,1876.0,1919.0,
7,"Abbe, E.C.",Abbe,E.C.,2,https://herbarium.bgbm.org/object/B100241637,"Abbe, E.C.","Abbe, E.C.",1.0,http://www.wikidata.org/entity/Q10274118,"Abbe, E.C.",...,,101473381,0000 0000 7237 8505,30066.0,26-1,Abbe,Q10274118,1905.0,2000.0,
10,"Abbott, J.R.",Abbott,J.R.,80,https://herbarium.bgbm.org/object/B100181131,"Abbott, J.R.","Abbott, J.R.",1.0,http://www.wikidata.org/entity/Q18982386,"Abbott, J.R.",...,,,,,20015671-1,J.R.Abbott,,1968.0,,
12,"Abbott, W.L.",Abbott,W.L.,4,http://id.snsb.info/snsb/collection/504820/626...,"Abbott, W.L.","Abbott, W.L.",1.0,http://www.wikidata.org/entity/Q635604,"Abbott, W.L.",...,,1545420,0000 0000 3712 5377,27518.0,,,Q635604,1860.0,1936.0,
17,"Abedin, S.",Abedin,S.,14,https://herbarium.bgbm.org/object/B100046632,"Abedin, S.","Abedin, S.",1.0,http://www.wikidata.org/entity/Q16142861,"Abedin, S.",...,,5859151837993620520007,,69097.0,35239-1,Abedin,,1952.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8878,"Kotschy, Karl Georg Th",Kotschy,Karl Georg Th,1,https://herbarium.bgbm.org/object/B101113772,"Kotschy, Karl Georg Th","Kotschy, T.",0.5,http://www.wikidata.org/entity/Q113299,"Kotschy, T.",...,,5113711,0000 0000 8084 6890,23120.0,4989-1,Kotschy,Q113299,1813.0,1866.0,
11370,"Molero, C.",Molero,C.,1,https://herbarium.bgbm.org/object/B100720720,"Molero, C.","Poveda-Molero, J.C.",0.5,http://www.wikidata.org/entity/Q88845286,"Poveda-Molero, J.C.",...,,,,,20039862-1,Poveda-Molero,,,,
12448,Ohnesorge,Ohnesorge,,1,https://herbarium.bgbm.org/object/B101142476,Ohnesorge,"Desor, É.",0.5,http://www.wikidata.org/entity/Q84445,"Desor, É.",...,,106994079,0000 0001 1696 4208,,,,Q84445,1811.0,1882.0,
15816,Sebesta,Sebesta,,1,https://herbarium.bgbm.org/object/B100002535,Sebesta,"Šebesta, F.",0.5,http://www.wikidata.org/entity/Q53091029,"Šebesta, F.",...,,83917646,0000 0000 5653 8783,50363.0,,,Q53091029,1844.0,1896.0,


In [36]:
pd.set_option("display.max_columns", None)

criterion = collectors_wikidata_cossim['canonical_string_collector_parsed'].map(lambda x: x.startswith('Kotschy'))
print("Show example of «Kotschy…» with similarities of 0.5 … 1.0")
collectors_wikidata_cossim[criterion]

Show example of «Kotschy…» with similarities of 0.5 … 1.0


Unnamed: 0,canonical_string_collector_parsed,family,given,occurrenceID_collectors_count,occurrenceID_collectors_firstsample,left_side,right_side,similarity,item,canonical_string,itemLabel,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb
8879,"Kotschy, T.",Kotschy,T.,310,http://id.snsb.info/snsb/collection/117808/176...,"Kotschy, T.","Kotschy, T.",1.0,http://www.wikidata.org/entity/Q113299,"Kotschy, T.",Theodor Kotschy,,5113711,0000 0000 8084 6890,23120.0,4989-1,Kotschy,Q113299,1813.0,1866.0,
8875,"Kotschy, C.G.",Kotschy,C.G.,1,http://id.snsb.info/snsb/collection/22980/3175...,"Kotschy, C.G.","Kotschy, C.F.",0.895,http://www.wikidata.org/entity/Q86842,"Kotschy, C.F.",Carl Friedrich Kotschy,,317065809,,,,,Q86842,1789.0,1856.0,
8880,"Kotschy, Th",Kotschy,Th,5,https://herbarium.bgbm.org/object/B100526350,"Kotschy, Th","Kotschy, T.",0.888,http://www.wikidata.org/entity/Q113299,"Kotschy, T.",Theodor Kotschy,,5113711,0000 0000 8084 6890,23120.0,4989-1,Kotschy,Q113299,1813.0,1866.0,
8874,Kotschy,Kotschy,,2,https://dr.jacq.org/DR049432,Kotschy,"Kotschy, T.",0.849,http://www.wikidata.org/entity/Q113299,"Kotschy, T.",Theodor Kotschy,,5113711,0000 0000 8084 6890,23120.0,4989-1,Kotschy,Q113299,1813.0,1866.0,
8876,"Kotschy, C.G.T.",Kotschy,C.G.T.,2494,http://id.snsb.info/snsb/collection/108230/167...,"Kotschy, C.G.T.","Kotschy, C.F.",0.824,http://www.wikidata.org/entity/Q86842,"Kotschy, C.F.",Carl Friedrich Kotschy,,317065809,,,,,Q86842,1789.0,1856.0,
8877,"Kotschy, K.G.T.",Kotschy,K.G.T.,37,http://id.snsb.info/snsb/collection/16719/2549...,"Kotschy, K.G.T.","Kotschy, T.",0.722,http://www.wikidata.org/entity/Q113299,"Kotschy, T.",Theodor Kotschy,,5113711,0000 0000 8084 6890,23120.0,4989-1,Kotschy,Q113299,1813.0,1866.0,
8881,"Kotschyi, C.G.T.",Kotschyi,C.G.T.,1,https://herbarium.bgbm.org/object/B100160086,"Kotschyi, C.G.T.","Kotschy, T.",0.614,http://www.wikidata.org/entity/Q113299,"Kotschy, T.",Theodor Kotschy,,5113711,0000 0000 8084 6890,23120.0,4989-1,Kotschy,Q113299,1813.0,1866.0,
8878,"Kotschy, Karl Georg Th",Kotschy,Karl Georg Th,1,https://herbarium.bgbm.org/object/B101113772,"Kotschy, Karl Georg Th","Kotschy, T.",0.5,http://www.wikidata.org/entity/Q113299,"Kotschy, T.",Theodor Kotschy,,5113711,0000 0000 8084 6890,23120.0,4989-1,Kotschy,Q113299,1813.0,1866.0,


In [33]:
# TODO further evaluation or filtering, counting, clean up aso.
from datetime import datetime
import os
if not os.path.exists('data'):
    os.makedirs('data')

this_output_file='data/bgbm_collectors_cosine-similarity_wikidata-botanists_%s.csv' % (
    # "20230705"
    datetime.today().strftime('%Y%m%d') # '%Y%m%d_%H%M'
)

collectors_wikidata_cossim.to_csv(this_output_file)

print("Wrote matches of collector names into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote matches of collector names into data/bgbm_collectors_cosine-similarity_wikidata-botanists_20230726.csv (4410 kB)


TODO document columns

Explanation of columns:

Column | Description
-|-
**Botanical collectors** |
family | parsed family name
given | parsed given name
suffix | suffix from name parsing
particle | particle from name parsing
dropping_particle | dropping_particle from name parsing
nick | nick name from name parsing
appellation | appellation from name parsing
title | title from name parsing
occurrenceID_collectors_count | count of all occurrenceID of one particular collector name
occurrenceID_collectors_firstsample | a data sample of an occurrenceID 
TODO … | Year of first collection
TODO end_date | Year of last collection
TODO activity_span | Number of years between first and last collection
**Name matching** |
left_side | matched name of the collector data set
right_side | matched name of Wikidata the collector was tried to matched to
similarity | calculated cosine-similarity
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
bionomia_id | identifier for a collector and/or determiner of natural history specimens, in the Bionomia database ([P6944](https://www.wikidata.org/wiki/Property:P6944))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Work year period begin ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | Work year period end ([P2032](https://www.wikidata.org/wiki/Property:P2032))