# Create Plazi Collectors Data Set and Match Names to WikiData

Create a data set of collectors recorded by Plazi:

- see <https://tb.plazi.org/GgServer/srsStats> section “Materials Citation Data”
- then select the data (columns) of interest, and then below on section **Fields to Use in Statistics** you can alter the output
    - choose **Operation** “show individual values”
    - filter values at **Filter on Values**
    - set the limit to e.g. 5 to see what data you would get
    - below you can get the download link to the data format you get offered there

# Example Data

| Field Name | Filter on Values |
|-|-|
| Collector Name          | >0 |
| GBIF Occurrence ID      | !0 |
| Collecting Month        |    |
| Collecting Year         |    |
| Collecting Decade       |    |
| Collecting Date         |    |
| Materials Citation UUID |    |

```bash
# added filter: gbifOccurrenceId → !0
# added filter: collector → >0 (seems to give the non empty collector names)
filename="plazi-stats_numberOfTreatments_gbifOccurrenceId-not0_date_decade_year_month_collector-gt0_$(date '+%Y%m%d').tsv"
wget --output-document="${filename}" \
'https://tb.plazi.org/GgServer/srsStats/stats?outputFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector&groupingFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector&FP-matCit.gbifOccurrenceId=!0&FP-matCit.collector=%3E0&format=TSV'

cat "${filename}" | wc -l
# 417402 minus 1 record (=column header)

{ head -n 5 "${filename}"; echo "..."; tail -n 5 "${filename}"; } | column --table --separator $'\t' | sed 's@^@  # @;'
  # DocCount  MatCitId                          MatCitGbifOccurrenceId  MatCitDate  MatCitDecade  MatCitYear  MatCitMonth  MatCitCollector
  # 1         78F03CF8FFE2FFE5C0C4F883FE73F8B4  3419301320                          0             0           0            1888 - 1890 & Morong, T.
  # 1         78F03CF8FFE5FFE2C187FB83FD0AFB94  3419301397                          0             0           0            1914 & Chodat, R.
  # 1         1FFD3CFF806D3D11C410027311B3FEAC  4012799597              1980-09-19  1980          1980        9            1980 - Sino- American Botanical Expedition
  # 1         AFA17A73FFA8F2414DA6F9AB94DCF942  3466701331                          0             0           0            20. 8.201 3 & Delage, A.
  # ...                                                                                                                    
  # 1         3B7F3CD7FFEDFFF5FB68FCBD4061FCB8  3072658352              2017-07-05  2010          2017        7            Z. Z. Xia
  # 1         3B5C3CD3FF9FFFACFCCB2B09BAD0FE79  1699618906              2002-06-25  2000          2002        6            Z. Z. Yang
  # 1         B5B23CA2C006FF87FB6FF9CBFA17F94A  2028140173              2009-08-18  2000          2009        8            Z. Z. Yang
  # 1         3B063C92F16FFF93DA9FFC4DFEDB1D0B  3866542316              2015-06-08  2010          2015        6            ZZ Zhang
  # 1         3B7C3CAD6B18FFBCADDEFA01FE543FE5  3034555558              1956-06-20  1950          1956        6            А. Schnitnikov
```



In [2]:
import json
import requests
import pandas as pd
import time
import pprint

# https://tb.plazi.org/GgServer/srsStats/stats?
#   outputFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector
#   &
#   groupingFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector
#   &
#   FP-matCit.gbifOccurrenceId=!0
#   &
#   FP-matCit.collector=%3E0
#   &
#   format=TSV
url = 'https://tb.plazi.org/GgServer/srsStats/stats'
params = [
    ('outputFields',   'matCit.id matCit.gbifOccurrenceId matCit.date matCit.decade matCit.year matCit.month matCit.collector'),
    ('groupingFields', 'matCit.id matCit.gbifOccurrenceId matCit.date matCit.decade matCit.year matCit.month matCit.collector'),
    ('FP-matCit.gbifOccurrenceId', '!0'),
    ('FP-matCit.collector', '>0'),
    ('format', 'JSON')
]

start_time = time.time()
print("Send data request to" , url)

response = requests.get(url, params)
dict = response.json()
collectors = dict['data']

print("Response of %s came in %s seconds (HTTP-code: %s)" % (
    url, 
    (time.time() - start_time), 
    response.status_code)
)

start_time = time.time()
print("Normalize JSON data with pandas …")

df = pd.json_normalize(collectors)

print("Normalization took %s seconds" % (time.time() - start_time) )

print("Print data sample …")
df



Send data request to https://tb.plazi.org/GgServer/srsStats/stats
Response of https://tb.plazi.org/GgServer/srsStats/stats came in 12.322981357574463 seconds (HTTP-code: 200)
Normalize JSON data with pandas …
Normalization took 2.3167567253112793 seconds
Print data sample …


Unnamed: 0,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth,MatCitCollector
0,1,78F03CF8FFE2FFE5C0C4F883FE73F8B4,3419301320,,0,0,0,"1888 - 1890 & Morong, T."
1,1,78F03CF8FFE5FFE2C187FB83FD0AFB94,3419301397,,0,0,0,"1914 & Chodat, R."
2,1,1FFD3CFF806D3D11C410027311B3FEAC,4012799597,1980-09-19,1980,1980,9,1980 - Sino- American Botanical Expedition
3,1,AFA17A73FFA8F2414DA6F9AB94DCF942,3466701331,,0,0,0,"20. 8.201 3 & Delage, A."
4,1,87ADD56BFF8DFF9BFBA0164C25E5FA86,3467693310,,0,0,0,"20. IX. 1957 & fr., Service Forestier"
...,...,...,...,...,...,...,...,...
423968,1,3B7F3CD7FFEDFFF5FB68FCBD4061FCB8,3072658352,2017-07-05,2010,2017,7,Z. Z. Xia
423969,1,3B5C3CD3FF9FFFACFCCB2B09BAD0FE79,1699618906,2002-06-25,2000,2002,6,Z. Z. Yang
423970,1,B5B23CA2C006FF87FB6FF9CBFA17F94A,2028140173,2009-08-18,2000,2009,8,Z. Z. Yang
423971,1,3B063C92F16FFF93DA9FFC4DFEDB1D0B,3866542316,2015-06-08,2010,2015,6,ZZ Zhang


In [3]:
list(df.columns)

['DocCount',
 'MatCitId',
 'MatCitGbifOccurrenceId',
 'MatCitDate',
 'MatCitDecade',
 'MatCitYear',
 'MatCitMonth',
 'MatCitCollector']

In [4]:
# move 'MatCitCollector' to be the first column (prepare parsing names for bin/agent_parse4tsv.rb: collectors in the 1st column)
col = df.pop("MatCitCollector")
df.insert(0, col.name, col)
df

Unnamed: 0,MatCitCollector,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
0,"1888 - 1890 & Morong, T.",1,78F03CF8FFE2FFE5C0C4F883FE73F8B4,3419301320,,0,0,0
1,"1914 & Chodat, R.",1,78F03CF8FFE5FFE2C187FB83FD0AFB94,3419301397,,0,0,0
2,1980 - Sino- American Botanical Expedition,1,1FFD3CFF806D3D11C410027311B3FEAC,4012799597,1980-09-19,1980,1980,9
3,"20. 8.201 3 & Delage, A.",1,AFA17A73FFA8F2414DA6F9AB94DCF942,3466701331,,0,0,0
4,"20. IX. 1957 & fr., Service Forestier",1,87ADD56BFF8DFF9BFBA0164C25E5FA86,3467693310,,0,0,0
...,...,...,...,...,...,...,...,...
423968,Z. Z. Xia,1,3B7F3CD7FFEDFFF5FB68FCBD4061FCB8,3072658352,2017-07-05,2010,2017,7
423969,Z. Z. Yang,1,3B5C3CD3FF9FFFACFCCB2B09BAD0FE79,1699618906,2002-06-25,2000,2002,6
423970,Z. Z. Yang,1,B5B23CA2C006FF87FB6FF9CBFA17F94A,2028140173,2009-08-18,2000,2009,8
423971,ZZ Zhang,1,3B063C92F16FFF93DA9FFC4DFEDB1D0B,3866542316,2015-06-08,2010,2015,6


## Write the Output Data



In [32]:
import os
import time

if not os.path.exists('data'):
    print("Make data directory for saving …")
    os.makedirs('data')

this_output_file=os.path.join(
    "data", ("plazi_GbifOccurrenceId_CitCollector_%s.tsv" % time.strftime('%Y%m%d'))
)

df.to_csv(this_output_file
          , sep='\t'
          ,index=False # skip the index
    # , header=["custom_colname_1", "custom_colname_2", "…"] # could rewrite header labels
)

print("Wrote data results into into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Write data results into into data/plazi_GbifOccurrenceId_CitCollector_20230719.tsv (34790 kB)


## Parse Collector Names

Now you can parse the names with dwcagent, if the collector names are in the first column:

```bash
cd bin
ruby agent_parse4tsv.rb \
  --input ../data/plazi_GbifOccurrenceId_CitCollector_20230719.tsv \
  --output ../data/plazi_GbifOccurrenceId_CitCollector_20230719_parsed.tsv

# or check also running time of the parsing script with `time command`; adding --logfile for information of skipped names

time ruby agent_parse4tsv.rb --logfile \
  --input ../data/plazi_GbifOccurrenceId_CitCollector_20230719.tsv \
  --output ../data/plazi_GbifOccurrenceId_CitCollector_20230719_parsed.tsv
# -------------------------
# Done.
# We have 30203 empty parsing results detected.
#   You can also use --develop to get a full result table including the used source data of each parsed line
# Wrote log file of skipped names to
#   ../data/plazi_GbifOccurrenceId_CitCollector_20230719_parsed.tsv_dwcagent_3.0.8.0.log
# Wrote data to
#   ../data/plazi_GbifOccurrenceId_CitCollector_20230719_parsed.tsv
# -------------------------
# 
# real    5m15,474s
# user    2m51,371s
# sys     2m2,877s
```

## Load WikiData Names and Parsed Collector Data

This procedure follows Niels Klazenga’s `match_names_to_wikidata_items.ipynb` (<https://github.com/nielsklazenga/avh-collectors/blob/47c3374f02bea4064b1c6708d79bcd9ba55a08a0/match_names_to_wikidata_items.ipynb>).

Use [`create_wikidata_datasets_botanists.ipynb`](create_wikidata_datasets_botanists.ipynb) to generate the data of botanist of WikiData first, then load those data to prepare the match of your data:

In [6]:
import pandas as pd
wikidata = pd.read_csv("data/wikidata_persons_botanists_20230703_1352.csv", index_col=0, low_memory=False)

wikidata.head()

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Bieberstein,F.A.M.v.,"Bieberstein, F.A.M.v.","Bieberstein, Friedrich August Marschall von",,43340073,0000 0001 1630 5464,1373,...,Q66612,1768.0,1826.0,,,http://www.wikidata.org/wiki/Q66612,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/6129-1,https://bionomia.net/Q66612
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Behr,H.H.,"Behr, H.H.","Behr, Hans Hermann",,20328622,0000 0001 1604 8680,42741,...,Q66934,1818.0,1904.0,,,http://www.wikidata.org/wiki/Q66934,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/619-1,https://bionomia.net/Q66934
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Schäffer,J.C.,"Schäffer, J.C.","Schäffer, Jacob Christian",,47016953,0000 0000 8343 3899,1101,...,,1718.0,1790.0,,,http://www.wikidata.org/wiki/Q66661,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/12818-1,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Klotzsch,J.F.,"Klotzsch, J.F.","Klotzsch, Johann Friedrich",,20426762,0000 0001 1749 2732,135,...,Q67003,1805.0,1860.0,,,http://www.wikidata.org/wiki/Q67003,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/4855-1,https://bionomia.net/Q67003
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Menge,F.A.,"Menge, F.A.","Menge, Franz Anton",,59847236,0000 0001 1653 0899,73782,...,,1808.0,1880.0,,,http://www.wikidata.org/wiki/Q66322,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/23266-1,


In [7]:
# create the test data set of WikiData data
# group by canonical name/string, count douplicated names
wd_matchtest = wikidata.groupby('canonical_string').agg({'item': ['count']}).reset_index()
# TODO AP: meaning of wd_matchtest + count for merge later on? 

wd_matchtest.tail()

# colls = list(wikidata.columns)

Unnamed: 0_level_0,canonical_string,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
61479,"Șerbanescu, I.",1
61480,"Ștefureac, T.",1
61481,"Țopa, E.",1
61482,"Ḥalwaǧī, R.",1
61483,"Ḳushnir, Ṭ.",1


In [8]:
# unique names parsed already by ruby gem package: dwcagent

collectors = pd.read_csv("data/plazi_GbifOccurrenceId_CitCollector_20230719_parsed.tsv", sep="\t", low_memory=False)

collectors.dropna(subset=['family'], inplace=True) # remove where family was NA, e.g. from originally «??» aso.
collectors

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
0,Chodat,R.,,,,,,,1,78F03CF8FFE5FFE2C187FB83FD0AFB94,3419301397,,0,0,0
1,Mayo,,,de,,,,,1,31ADD85BA138FFE3FF45A111FB90F6CB,3421410670,2001-01-18,2000,2001,1
2,Garcete,B.,,,,,,,1,31ADD85BA138FFE3FF45A111FB90F6CB,3421410670,2001-01-18,2000,2001,1
5,Virginia,A,,,,,,,1,3B553CEFFFD5FFC6FF34F4ABFD46FEDA,3333037406,2019-08-01,2010,2019,8
6,Virginia,A,,,,,,,1,3B553CEFFFD6FFC6FBC8F2A7FAA2FDF8,3333037676,2019-08-02,2010,2019,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
587870,Xia,Z.Z.,,,,,,,1,3B7F3CD7FFEDFFF5FB68FCBD4061FCB8,3072658352,2017-07-05,2010,2017,7
587871,Yang,Z.Z.,,,,,,,1,3B5C3CD3FF9FFFACFCCB2B09BAD0FE79,1699618906,2002-06-25,2000,2002,6
587872,Yang,Z.Z.,,,,,,,1,B5B23CA2C006FF87FB6FF9CBFA17F94A,2028140173,2009-08-18,2000,2009,8
587873,Zhang,Z.Z.,,,,,,,1,3B063C92F16FFF93DA9FFC4DFEDB1D0B,3866542316,2015-06-08,2010,2015,6


#### Check Composition of Parsed Collector Data

In [9]:
# check the name-parsed columns if they are empty or need to be considerd as data for matching or not
import pprint
for parsed_name_part in ["particle", "suffix", "dropping_particle", "appellation"]:
    test_collectors = collectors.loc[(collectors[parsed_name_part].isna() == False)]
    print("\n----------------------------------------\nshow names with **%s** found %s records:\n" % (parsed_name_part, len(test_collectors.index)))
    pprint.pprint(test_collectors.head())


----------------------------------------
show names with **particle** found 19674 records:

             family given suffix particle  dropping_particle  nick  \
1              Mayo   NaN    NaN       de                NaN   NaN   
268   A. A. Girault    G.    NaN       as                NaN   NaN   
269   A. A. Girault    G.    NaN       as                NaN   NaN   
1059          Grave    S.    NaN       De                NaN   NaN   
1063          Grave    S.    NaN       De                NaN   NaN   

     appellation title  DocCount                          MatCitId  \
1            NaN   NaN         1  31ADD85BA138FFE3FF45A111FB90F6CB   
268          NaN   NaN         1  E4E73CEFE566FFFE6C4A0CFC1E2D5BF7   
269          NaN   NaN         1  E4E73CEFE566FFFE6C040CD1191C5BD2   
1059         NaN   NaN         1  84B80478FFF8FFBFFE9DFC0A2796D4CB   
1063         NaN   NaN         1  84B80478FFF8FFBFFF2FFC2E278FD337   

      MatCitGbifOccurrenceId  MatCitDate  MatCitDecade  MatCitYea

Compile `canonical_string...` for the collector data we will later match the WikiData names with:

In [10]:
collectors['canonical_string_collector_parsed'] = (
  # use collectors.family only where given name has NA values, otherwise use family name + given name
  collectors.family.where(
      # condition
      collectors.given.isna(),
      # any other TODO improve the combined name for canonical_string_collector_parsed if any of the other dwc_parsed fields is not NaN
      # other= collectors.family + ", " + collectors.given 
      other= (collectors.family + ", " + collectors.given) \
        if any(collectors.particle.isna()) \
        else collectors.particle + " " + collectors.family \
         + ", " + collectors.given
  )
)

# # move 'canonical_string_collector_parsed' after column title
col = collectors.pop("canonical_string_collector_parsed")
collectors.insert(collectors.columns.get_loc('title') + 1, col.name, col)
collectors.tail()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
587870,Xia,Z.Z.,,,,,,,"Xia, Z.Z.",1,3B7F3CD7FFEDFFF5FB68FCBD4061FCB8,3072658352,2017-07-05,2010,2017,7
587871,Yang,Z.Z.,,,,,,,"Yang, Z.Z.",1,3B5C3CD3FF9FFFACFCCB2B09BAD0FE79,1699618906,2002-06-25,2000,2002,6
587872,Yang,Z.Z.,,,,,,,"Yang, Z.Z.",1,B5B23CA2C006FF87FB6FF9CBFA17F94A,2028140173,2009-08-18,2000,2009,8
587873,Zhang,Z.Z.,,,,,,,"Zhang, Z.Z.",1,3B063C92F16FFF93DA9FFC4DFEDB1D0B,3866542316,2015-06-08,2010,2015,6
587874,Schnitnikov,А.,,,,,,,"Schnitnikov, А.",1,3B7C3CAD6B18FFBCADDEFA01FE543FE5,3034555558,1956-06-20,1950,1956,6


In [None]:
# collectors=collectors.add_suffix('_parsed') \
#  if not any(col.endswith("_parsed") for col in list(collectors.columns))

### Set Up the Text Search

See https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536

The ngrams function is used as an analyzer in the text search later.

In [11]:
wd_matchtest['canonical_string'].at[0]

'(-Walraevens), O.H.'

In [12]:
import re
!pip install ftfy # amazing text cleaning for decode issues..
from ftfy import fix_text

def ngrams(string, n=3):
    string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower()
    chars_to_remove = [")","(",".","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.title() # normalise case - capital at start of each word
    string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single
    string = ' '+ string +' ' # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]


print("Example from name:", ngrams('Klazenga, N.'))
print("Example from collectors:", ngrams(collectors["canonical_string_collector_parsed"].at[1])) 
print("Example from match-test:", ngrams(wd_matchtest['canonical_string'].at[1]))

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try 'pacman -S
[31m   [0m python-xyz', where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch-packaged Python package,
[31m   [0m create a virtual environment using 'python -m venv path/to/venv'.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch packaged Python application,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. Make sure you have python-pipx
[31m   [0m installed via pacman.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-s

Vectorize Wikidata names. Background: We use an information retrieval technique (Term Frequency — Inverse Document Frequency, blog [towardsdatascience.com/tf-idf-explained…](https://towardsdatascience.com/tf-idf-explained-and-python-sklearn-implementation-b020c5e83275)) for matching the source names with WikiData names, for that a calculated dinsance measure of the name match will help to match similar names and distinguish names that are rather no match. In general see also https://scikit-learn.org, https://pypi.org/project/scikit-learn/. 

Convert a collection of raw documents to a matrix of TF-IDF features:

In [13]:
# pip install --upgrade scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

wikidata_names = wd_matchtest['canonical_string']

# vectorize wikidata names
print('Vectorizing data. This may take a while...')
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
tfidf_vector_data = vectorizer.fit_transform(wikidata_names)
print('Vectorizing completed: Created a matrix of TF-IDF featurs')


Vectorizing data. This may take a while...
Vectorizing completed: Created a matrix of TF-IDF featurs


Set up the function that performs the nearest neighbour matches...

In [14]:
from sklearn.neighbors import NearestNeighbors

nbrs_data = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf_vector_data) 
# tfidf_vector_data contains the vectorized wikidata names from the previous step

# matching query
def getNearestN(query):
    queryTFIDF_ = vectorizer.transform(query)
    distances, indices = nbrs_data.kneighbors(queryTFIDF_)
    return distances, indices


### Perform the Matching

Perform the nearest neighbour (NN) matches on the (Plazi) collector names and create a data frame with matches... (can take 10 to 30 minutes)

In [15]:
collectors_names = set(collectors['canonical_string_collector_parsed'].values) 
  # convert list to set for better performance

import time
start = time.time()
print('Getting nearest neighbours...')
distances, indices = getNearestN(collectors_names)
duration = time.time() - start
print('Completed in:', duration, 's')

collectors_names = list(collectors_names) # convert back to list

print('Finding matches...')
matches = []
for i,j in enumerate(indices):
    temp = [collectors_names[i], wd_matchtest.values[j][0][0], round(distances[i][0],2)]
    matches.append(temp)

duration = time.time() - start
print('Building matches data frame:', duration, 's')  
matches = pd.DataFrame(
    matches, 
    columns=['namematch_collector','namematch_wikidata','namematch_distance']
)

duration = time.time() - start
print('Done:', duration, 's') 

matches = matches.sort_values(['namematch_distance'])
matches = matches.reset_index()

matches.head()

Getting nearest neighbours...
Completed in: 790.5521192550659 s
Finding matches...
Building matches data frame: 1298.823076248169 s
Done: 1298.9033777713776 s


Unnamed: 0,index,namematch_collector,namematch_wikidata,namematch_distance
0,110835,Ellenberger,Ellenberger,0.0
1,9146,"Rosa, N.A.","Rosa, N.A.",0.0
2,46045,"Palmer, M.G.","Palmer, M.G.",0.0
3,88206,"Takahashi, T.","Takahashi, T.",0.0
4,28752,"Jurado, M.D.","Jurado, M.D.",0.0


### Create Output Results

Combine the matches data frame back to the (Plazi) collectors and Wikidata items …

In [16]:
# join matches data frame back to source collectors  dataframe 
collectors_matches = pd.merge(
    collectors, matches, 
    left_on='canonical_string_collector_parsed', right_on='namematch_collector'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

collectors_matches.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth,index,namematch_collector,namematch_wikidata,namematch_distance
0,Chodat,R.,,,,,,,"Chodat, R.",1,78F03CF8FFE5FFE2C187FB83FD0AFB94,3419301397,,0,0,0,109530,"Chodat, R.","Chodat, R.H.",0.53
1,Chodat,R.,,,,,,,"Chodat, R.",1,88A73C93FFA7664B5BDE97A93F63FADE,3416963306,,0,0,0,109530,"Chodat, R.","Chodat, R.H.",0.53
2,Chodat,R.,,,,,,,"Chodat, R.",1,88A73C93FFA7664B592397123F00FB55,3416963307,1914.0,1910,1914,0,109530,"Chodat, R.","Chodat, R.H.",0.53
3,Chodat,R.,,,,,,,"Chodat, R.",1,78F03CF8FFE5FFE2C045FC03FE70FBD4,3419301305,1914.0,1910,1914,0,109530,"Chodat, R.","Chodat, R.H.",0.53
4,Chodat,R.,,,,,,,"Chodat, R.",1,78F03CF8FFEBFFECC068FE83FE71FE94,3419301322,1914.0,1910,1914,0,109530,"Chodat, R.","Chodat, R.H.",0.53


Save the results...

In [17]:
import time
import os
if not os.path.exists('data'):
    os.makedirs('data')

this_output_file='data/plazi_collectors_matches_wikidata-botanists_%s.csv' % (
    # "20230719"
    time.strftime('%Y%m%d') # '%Y%m%d_%H%M'
)

collectors_matches.to_csv(this_output_file)

print(
    "Wrote matches of collector names into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
) 

Wrote matches of collector names into data/plazi_collectors_matches_wikidata-botanists_20230719.csv (79877 kB)


### Aggregate Matched Data

Aggregate now the data, if multiple names are found … aso. and join multiple results by “…|…”

In [18]:
# link counts of wikidata items with canonical name string
# collectors_matches_g1 = pd.merge(collectors_matches, wd_matchtest, 
#                                  left_on='matched_name', right_on='canonical_string')
collectors_matches_g1 = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_wikidata', right_on='canonical_string'
)
collectors_matches_g1.rename(columns = {list(collectors_matches_g1)[-1]: 'item_count'}, inplace=True)

# link wikidata items with canonical name string (pipe separated if more than one)
print('Aggregate WD item, i.e. WD IDs (multiple records joined by “…|…”) ...')
wikidata_uniq_items = wikidata.groupby(['canonical_string'])['item'].apply('|'.join).reset_index()
print('Done.')

collectors_matches_g2 = pd.merge(# no merging unique WikiData names to collectors
    collectors_matches_g1, wikidata_uniq_items, 
    left_on='namematch_wikidata', right_on='canonical_string'
    , suffixes=('__grp_by_itemcount', '__grp_by_item') 
      # append to left-data, right-data only when identical column names occur
)
collectors_matches_g2.rename(columns = {list(collectors_matches_g2)[-1]: 'items'}, inplace=True)

# link wikidata items with canonical name string (pipe separated if more than one)
print('Aggregate WD itemLabel, i.e. names (multiple records joined by “…|…”) ...')
wikidata_uniq_itemlabels = wikidata.groupby(['canonical_string'])['itemLabel'].apply('|'.join).reset_index()
print('Done.')

collectors_matches_g3 = pd.merge(
    collectors_matches_g2, wikidata_uniq_itemlabels, 
    left_on='namematch_wikidata', right_on='canonical_string'
    , suffixes=('__grp_by_items', '__grp_by_itemlabel') 
      # append to left-data, right-data only when identical column names occur
)

collectors_matches_g3.rename(columns = {list(collectors_matches_g3)[-1]: 'item_labels'}, inplace=True)


Aggregate WD item, i.e. WD IDs (multiple records joined by “…|…”) ...
Done.
Aggregate WD itemLabel, i.e. names (multiple records joined by “…|…”) ...
Done.


Prepare data to save later on …

In [19]:
collectors_matches_group = collectors_matches_g3

pprint.pprint(collectors_matches_group.columns)
# from merge: _x means from left column, _y means from right column

Index(['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick',
       'appellation', 'title', 'canonical_string_collector_parsed', 'DocCount',
       'MatCitId', 'MatCitGbifOccurrenceId', 'MatCitDate', 'MatCitDecade',
       'MatCitYear', 'MatCitMonth', 'index', 'namematch_collector',
       'namematch_wikidata', 'namematch_distance', 'item__grp_by_itemcount',
       'itemLabel__grp_by_items', 'surname', 'initials',
       'canonical_string__grp_by_itemcount', 'canonical_string_fullname',
       'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob',
       'yod', 'wyb', 'wye', 'wikidata_link', 'orcid_link', 'harv_link',
       'ipni_link', 'item_count', 'canonical_string__grp_by_item', 'items',
       'canonical_string', 'item_labels'],
      dtype='object')


In [20]:
collectors_matches_group.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount,...,wye,wikidata_link,orcid_link,harv_link,ipni_link,item_count,canonical_string__grp_by_item,items,canonical_string,item_labels
0,Chodat,R.,,,,,,,"Chodat, R.",1,...,,http://www.wikidata.org/wiki/Q2613173,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/1611-1,,"Chodat, R.H.",http://www.wikidata.org/entity/Q2613173,"Chodat, R.H.",Robert Hippolyte Chodat
1,Chodat,R.,,,,,,,"Chodat, R.",1,...,,http://www.wikidata.org/wiki/Q2613173,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/1611-1,,"Chodat, R.H.",http://www.wikidata.org/entity/Q2613173,"Chodat, R.H.",Robert Hippolyte Chodat
2,Chodat,R.,,,,,,,"Chodat, R.",1,...,,http://www.wikidata.org/wiki/Q2613173,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/1611-1,,"Chodat, R.H.",http://www.wikidata.org/entity/Q2613173,"Chodat, R.H.",Robert Hippolyte Chodat
3,Chodat,R.,,,,,,,"Chodat, R.",1,...,,http://www.wikidata.org/wiki/Q2613173,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/1611-1,,"Chodat, R.H.",http://www.wikidata.org/entity/Q2613173,"Chodat, R.H.",Robert Hippolyte Chodat
4,Chodat,R.,,,,,,,"Chodat, R.",1,...,,http://www.wikidata.org/wiki/Q2613173,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/1611-1,,"Chodat, R.H.",http://www.wikidata.org/entity/Q2613173,"Chodat, R.H.",Robert Hippolyte Chodat


In [21]:
# Remove superfluous columns 
# TODO check WARNING: A value is trying to be set on a copy of a slice from a DataFrame
# TODO check duplicates
collectors_matches_group = collectors_matches_g3[
    ['family', 'given', 'canonical_string_collector_parsed', 
    'namematch_collector', 'namematch_wikidata', 'namematch_distance','MatCitDate',
    'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob', 'yod', 'wyb',
    'items', 'canonical_string', 'item_labels']
]
# collectors_matches_group = collectors_matches_g3
collectors_matches_group.sort_values(
    by=['namematch_distance', 'canonical_string_collector_parsed','MatCitDate']
    , inplace=True
)
collectors_matches_group.drop_duplicates(inplace=True)
collectors_matches_group.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_matches_group.sort_values(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_matches_group.drop_duplicates(inplace=True)


Unnamed: 0,family,given,canonical_string_collector_parsed,namematch_collector,namematch_wikidata,namematch_distance,MatCitDate,orcid,viaf,isni,harv,ipni,abbr,bionomia_id,yob,yod,wyb,items,canonical_string,item_labels
609620,Aagaard,K.,"Aagaard, K.","Aagaard, K.","Aagaard, K.",0.0,18.VII.198,,273390791.0,,,,,Q55216516,1947.0,,,http://www.wikidata.org/entity/Q55216516,"Aagaard, K.",Kaare Aagaard
609621,Aagaard,K.,"Aagaard, K.","Aagaard, K.","Aagaard, K.",0.0,1986-07-01,,273390791.0,,,,,Q55216516,1947.0,,,http://www.wikidata.org/entity/Q55216516,"Aagaard, K.",Kaare Aagaard
19691,Aarvik',L.,"Aarvik', L.","Aarvik', L.","Aarvik, L.",0.0,1992-03-25,0000-0002-0112-8837,14511016.0,0000 0000 4817 3374,,,,0000-0002-0112-8837,1954.0,,,http://www.wikidata.org/entity/Q17114254|http:...,"Aarvik, L.",Leif Aarvik|Lars Aarvik
19692,Aarvik',L.,"Aarvik', L.","Aarvik', L.","Aarvik, L.",0.0,1992-03-25,,,,,,,Q106823278,1892.0,1981.0,,http://www.wikidata.org/entity/Q17114254|http:...,"Aarvik, L.",Leif Aarvik|Lars Aarvik
19683,Aarvik,L.,"Aarvik, L.","Aarvik, L.","Aarvik, L.",0.0,1955-01,0000-0002-0112-8837,14511016.0,0000 0000 4817 3374,,,,0000-0002-0112-8837,1954.0,,,http://www.wikidata.org/entity/Q17114254|http:...,"Aarvik, L.",Leif Aarvik|Lars Aarvik


In [22]:
this_output_file='data/plazi_collectors_matches_wikidata_items_group_concat_%s.csv' % (
    # "20230719"
    time.strftime('%Y%m%d') # '%Y%m%d_%H%M'
)

# collectors_matches_group.to_csv(this_output_file_name)
collectors_matches_group.to_csv(this_output_file)

print("Wrote groups of collectors matches into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote groups of collectors matches into data/plazi_collectors_matches_wikidata_items_group_concat_20230719.csv (84964 kB)


Get individual WikiData items (TODO review code): 
- associate collector name match + individual WikiData items (remember: we matched the `canonical_string`)

In [23]:
# TODO get  list of atomized collectors matches down to single wikidata items
collectors_matches_t1 = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_wikidata', right_on='canonical_string'
)
# collectors_matches_t1.drop(columns=['canonical_string'])

# link counts of wikidata items with same canonical name string
collectors_matches_t2 = pd.merge(
    collectors_matches_t1, wikidata, 
    left_on="namematch_wikidata", right_on="canonical_string"
    , suffixes=('__collmatches', '__wdata-isolated') # append to left-data, right-data only when identical column names occur
)

# TODO AP: add count of duplicates?

pprint.pprint(collectors_matches_t2.columns)

Index(['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick',
       'appellation', 'title', 'canonical_string_collector_parsed', 'DocCount',
       'MatCitId', 'MatCitGbifOccurrenceId', 'MatCitDate', 'MatCitDecade',
       'MatCitYear', 'MatCitMonth', 'index', 'namematch_collector',
       'namematch_wikidata', 'namematch_distance', 'item__collmatches',
       'itemLabel__collmatches', 'surname__collmatches',
       'initials__collmatches', 'canonical_string__collmatches',
       'canonical_string_fullname__collmatches', 'orcid__collmatches',
       'viaf__collmatches', 'isni__collmatches', 'harv__collmatches',
       'ipni__collmatches', 'abbr__collmatches', 'bionomia_id__collmatches',
       'yob__collmatches', 'yod__collmatches', 'wyb__collmatches',
       'wye__collmatches', 'wikidata_link__collmatches',
       'orcid_link__collmatches', 'harv_link__collmatches',
       'ipni_link__collmatches', 'bionomia_link__collmatches',
       'item__wdata-isolated', 'itemLabel

In [24]:
collectors_matches_t2.sort_values(by=['namematch_distance', 'canonical_string_collector_parsed','MatCitDate'], inplace=True)
collectors_matches_t2.drop_duplicates(inplace=True)
collectors_matches_t2.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount,...,bionomia_id__wdata-isolated,yob__wdata-isolated,yod__wdata-isolated,wyb__wdata-isolated,wye__wdata-isolated,wikidata_link__wdata-isolated,orcid_link__wdata-isolated,harv_link__wdata-isolated,ipni_link__wdata-isolated,bionomia_link__wdata-isolated
844352,Aagaard,K.,,,,,,,"Aagaard, K.",1,...,Q55216516,1947.0,,,,http://www.wikidata.org/wiki/Q55216516,,,,https://bionomia.net/Q55216516
844353,Aagaard,K.,,,,,,,"Aagaard, K.",1,...,Q55216516,1947.0,,,,http://www.wikidata.org/wiki/Q55216516,,,,https://bionomia.net/Q55216516
24113,Aarvik',L.,,,,,,,"Aarvik', L.",1,...,0000-0002-0112-8837,1954.0,,,,http://www.wikidata.org/wiki/Q17114254,https://orcid.org/0000-0002-0112-8837,,,https://bionomia.net/0000-0002-0112-8837
24114,Aarvik',L.,,,,,,,"Aarvik', L.",1,...,Q106823278,1892.0,1981.0,,,http://www.wikidata.org/wiki/Q106823278,,,,https://bionomia.net/Q106823278
24115,Aarvik',L.,,,,,,,"Aarvik', L.",1,...,0000-0002-0112-8837,1954.0,,,,http://www.wikidata.org/wiki/Q17114254,https://orcid.org/0000-0002-0112-8837,,,https://bionomia.net/0000-0002-0112-8837


In [25]:
this_output_file='data/plazi_collectors_matches_wikidata-botanists_all-columns_%s.csv' % (
    # "20230719"
    time.strftime('%Y%m%d') # '%Y%m%d_%H%M'
)

collectors_matches_t2.to_csv(this_output_file)

print("Wrote isolated WikiData items of collector matches into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote isolated WikiData items of collector matches into data/plazi_collectors_matches_wikidata-botanists_all-columns_20230719.csv (530446 kB)


In [None]:
# TODO further evaluation or filtering, counting, clean up aso.

TODO document columns

Explanation of columns:

Column | Description
-|-
**Botanical collectors** |
family | parsed family name
given | parsed given name
suffix | suffix from name parsing
particle | particle from name parsing
dropping_particle | dropping_particle from name parsing
nick | nick name from name parsing
appellation | appellation from name parsing
title | title from name parsing
TODO … | Year of first collection
TODO end_date | Year of last collection
TODO activity_span | Number of years between first and last collection
**Name matching** |
nammatch_collector | matched name of the data set
nammatch_wikidata | matched name; = Wikidata item label name is matched to
name_match_distance | Nearest Neighbour distance between the name and matched name; the lower the value, the better the match
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
bionomia_id | identifier for a collector and/or determiner of natural history specimens, in the Bionomia database ([P6944](https://www.wikidata.org/wiki/Property:P6944))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Work year period begin ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | Work year period end ([P2032](https://www.wikidata.org/wiki/Property:P2032))