# Create Plazi Collectors Data Set and Match Names to WikiData

Create a data set of collectors recorded by Plazi:

- see <https://tb.plazi.org/GgServer/srsStats> section “Materials Citation Data”
- then select the data (columns) of interest, and then below on section **Fields to Use in Statistics** you can alter the output
    - choose **Operation** “show individual values”
    - filter values at **Filter on Values**
    - set the limit to e.g. 5 to see what data you would get
    - below you can get the download link to the data format you get offered there

# Example Data

| Field Name | Filter on Values |
|-|-|
| Collector Name          | >0 |
| GBIF Occurrence ID      | !0 |
| Collecting Month        |    |
| Collecting Year         |    |
| Collecting Decade       |    |
| Collecting Date         |    |
| Materials Citation UUID |    |

```bash
# added filter: gbifOccurrenceId → !0
# added filter: collector → >0 (seems to give the non empty collector names)
filename="plazi-stats_numberOfTreatments_gbifOccurrenceId-not0_date_decade_year_month_collector-gt0_$(date '+%Y%m%d').tsv"
wget --output-document="${filename}" \
'https://tb.plazi.org/GgServer/srsStats/stats?outputFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector&groupingFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector&FP-matCit.gbifOccurrenceId=!0&FP-matCit.collector=%3E0&format=TSV'

cat "${filename}" | wc -l
# 417402 minus 1 record (=column header)

{ head -n 5 "${filename}"; echo "..."; tail -n 5 "${filename}"; } | column --table --separator $'\t' | sed 's@^@  # @;'
  # DocCount  MatCitId                          MatCitGbifOccurrenceId  MatCitDate  MatCitDecade  MatCitYear  MatCitMonth  MatCitCollector
  # 1         78F03CF8FFE2FFE5C0C4F883FE73F8B4  3419301320                          0             0           0            1888 - 1890 & Morong, T.
  # 1         78F03CF8FFE5FFE2C187FB83FD0AFB94  3419301397                          0             0           0            1914 & Chodat, R.
  # 1         1FFD3CFF806D3D11C410027311B3FEAC  4012799597              1980-09-19  1980          1980        9            1980 - Sino- American Botanical Expedition
  # 1         AFA17A73FFA8F2414DA6F9AB94DCF942  3466701331                          0             0           0            20. 8.201 3 & Delage, A.
  # ...                                                                                                                    
  # 1         3B7F3CD7FFEDFFF5FB68FCBD4061FCB8  3072658352              2017-07-05  2010          2017        7            Z. Z. Xia
  # 1         3B5C3CD3FF9FFFACFCCB2B09BAD0FE79  1699618906              2002-06-25  2000          2002        6            Z. Z. Yang
  # 1         B5B23CA2C006FF87FB6FF9CBFA17F94A  2028140173              2009-08-18  2000          2009        8            Z. Z. Yang
  # 1         3B063C92F16FFF93DA9FFC4DFEDB1D0B  3866542316              2015-06-08  2010          2015        6            ZZ Zhang
  # 1         3B7C3CAD6B18FFBCADDEFA01FE543FE5  3034555558              1956-06-20  1950          1956        6            А. Schnitnikov
```



In [2]:
import json
import requests
import pandas as pd
import time
import pprint

# https://tb.plazi.org/GgServer/srsStats/stats?
#   outputFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector
#   &
#   groupingFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector
#   &
#   FP-matCit.gbifOccurrenceId=!0
#   &
#   FP-matCit.collector=%3E0
#   &
#   format=TSV
url = 'https://tb.plazi.org/GgServer/srsStats/stats'
params = [
    ('outputFields',   'matCit.id matCit.gbifOccurrenceId matCit.date matCit.decade matCit.year matCit.month matCit.collector'),
    ('groupingFields', 'matCit.id matCit.gbifOccurrenceId matCit.date matCit.decade matCit.year matCit.month matCit.collector'),
    ('FP-matCit.gbifOccurrenceId', '!0'),
    ('FP-matCit.collector', '>0'),
    ('format', 'JSON')
]

start_time = time.time()
print("Send data request to" , url)

response = requests.get(url, params)
dict = response.json()
collectors = dict['data']

print("Response of %s came in %s seconds (HTTP-code: %s)" % (
    url, 
    (time.time() - start_time), 
    response.status_code)
)

start_time = time.time()
print("Normalize JSON data with pandas …")

df = pd.json_normalize(collectors)

print("Normalization took %s seconds" % (time.time() - start_time) )

print("Print data sample …")
df



Send data request to https://tb.plazi.org/GgServer/srsStats/stats
Response of https://tb.plazi.org/GgServer/srsStats/stats came in 12.322981357574463 seconds (HTTP-code: 200)
Normalize JSON data with pandas …
Normalization took 2.3167567253112793 seconds
Print data sample …


Unnamed: 0,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth,MatCitCollector
0,1,78F03CF8FFE2FFE5C0C4F883FE73F8B4,3419301320,,0,0,0,"1888 - 1890 & Morong, T."
1,1,78F03CF8FFE5FFE2C187FB83FD0AFB94,3419301397,,0,0,0,"1914 & Chodat, R."
2,1,1FFD3CFF806D3D11C410027311B3FEAC,4012799597,1980-09-19,1980,1980,9,1980 - Sino- American Botanical Expedition
3,1,AFA17A73FFA8F2414DA6F9AB94DCF942,3466701331,,0,0,0,"20. 8.201 3 & Delage, A."
4,1,87ADD56BFF8DFF9BFBA0164C25E5FA86,3467693310,,0,0,0,"20. IX. 1957 & fr., Service Forestier"
...,...,...,...,...,...,...,...,...
423968,1,3B7F3CD7FFEDFFF5FB68FCBD4061FCB8,3072658352,2017-07-05,2010,2017,7,Z. Z. Xia
423969,1,3B5C3CD3FF9FFFACFCCB2B09BAD0FE79,1699618906,2002-06-25,2000,2002,6,Z. Z. Yang
423970,1,B5B23CA2C006FF87FB6FF9CBFA17F94A,2028140173,2009-08-18,2000,2009,8,Z. Z. Yang
423971,1,3B063C92F16FFF93DA9FFC4DFEDB1D0B,3866542316,2015-06-08,2010,2015,6,ZZ Zhang


In [3]:
list(df.columns)

['DocCount',
 'MatCitId',
 'MatCitGbifOccurrenceId',
 'MatCitDate',
 'MatCitDecade',
 'MatCitYear',
 'MatCitMonth',
 'MatCitCollector']

In [4]:
# move 'MatCitCollector' to be the first column (prepare parsing names for bin/agent_parse4tsv.rb: collectors in the 1st column)
col = df.pop("MatCitCollector")
df.insert(0, col.name, col)
df

Unnamed: 0,MatCitCollector,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
0,"1888 - 1890 & Morong, T.",1,78F03CF8FFE2FFE5C0C4F883FE73F8B4,3419301320,,0,0,0
1,"1914 & Chodat, R.",1,78F03CF8FFE5FFE2C187FB83FD0AFB94,3419301397,,0,0,0
2,1980 - Sino- American Botanical Expedition,1,1FFD3CFF806D3D11C410027311B3FEAC,4012799597,1980-09-19,1980,1980,9
3,"20. 8.201 3 & Delage, A.",1,AFA17A73FFA8F2414DA6F9AB94DCF942,3466701331,,0,0,0
4,"20. IX. 1957 & fr., Service Forestier",1,87ADD56BFF8DFF9BFBA0164C25E5FA86,3467693310,,0,0,0
...,...,...,...,...,...,...,...,...
423968,Z. Z. Xia,1,3B7F3CD7FFEDFFF5FB68FCBD4061FCB8,3072658352,2017-07-05,2010,2017,7
423969,Z. Z. Yang,1,3B5C3CD3FF9FFFACFCCB2B09BAD0FE79,1699618906,2002-06-25,2000,2002,6
423970,Z. Z. Yang,1,B5B23CA2C006FF87FB6FF9CBFA17F94A,2028140173,2009-08-18,2000,2009,8
423971,ZZ Zhang,1,3B063C92F16FFF93DA9FFC4DFEDB1D0B,3866542316,2015-06-08,2010,2015,6


## Write the Output Data

Write source data and also set some global script variables


In [1]:
import os
import time

if not os.path.exists('data'):
    print("Make data directory for saving …")
    os.makedirs('data')

# Set some global varialbes
# this_timestamp_for_data=time.strftime('%Y%m%d') # 20230719
this_timestamp_for_data=20230719

this_name_source_file=\
  os.path.join("data", ("plazi_GbifOccurrenceId_CitCollector_%s.tsv" % this_timestamp_for_data))
this_name_source_file_parsed=\
  os.path.join("data", ("plazi_GbifOccurrenceId_CitCollector_%s_parsed.tsv" % this_timestamp_for_data))

if 'df' in locals():
    df.to_csv(this_name_source_file, sep='\t', index=False # skip the index
        # , header=["custom_colname_1", "custom_colname_2", "…"] # could rewrite header labels
    )
    print("Wrote data results into into %s (%d kB)" % (
        this_name_source_file
        , os.path.getsize(this_name_source_file) >> 10 
          # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
        ) 
    )
else:
    if os.path.exists(this_name_source_file):
        print("Recent data from a Plazi data query was not found, but a data result file exists\nand can be used from %s (%d kB).\nIn this script we use:\n- %s\n- %s\n- timestamp: %s" % 
            (this_name_source_file
             , os.path.getsize(this_name_source_file) >> 10 # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
             , this_name_source_file
             , this_name_source_file_parsed
             , this_timestamp_for_data
            )
        )
    else:
        print("No source data found that can be analysed (%s)"
        "\nRun a new data request on Plazi again or set a different name source file." % this_name_source_file)



Recent data from a Plazi data query was not found, but a data result file exists
and can be used from data/plazi_GbifOccurrenceId_CitCollector_20230719.tsv (34790 kB).
In this script we use:
- data/plazi_GbifOccurrenceId_CitCollector_20230719.tsv
- data/plazi_GbifOccurrenceId_CitCollector_20230719_parsed.tsv
- timestamp: 20230719


## Parse Collector Names

Now you can parse the names with dwcagent, if the collector names are in the first column:

```bash
cd bin
ruby agent_parse4tsv.rb \
  --input ../data/plazi_GbifOccurrenceId_CitCollector_20230719.tsv \
  --output ../data/plazi_GbifOccurrenceId_CitCollector_20230719_parsed.tsv

# or check also running time of the parsing script with `time command`; 
# add «nice ruby …» if the process drains the system too much
# adding --logfile for information of skipped names

time ruby agent_parse4tsv.rb --logfile \
  --input ../data/plazi_GbifOccurrenceId_CitCollector_20230719.tsv \
  --output ../data/plazi_GbifOccurrenceId_CitCollector_20230719_parsed.tsv
# -------------------------
# Done.
# We have 24838 empty parsing cleaned results detected.
#   You can also use --develop to get a full result table including the used source data of each parsed line
# Wrote log file of skipped names to
#   ../data/plazi_GbifOccurrenceId_CitCollector_20230719_parsed.tsv_dwcagent_3.0.11.0.log
# Wrote data to
#   ../data/plazi_GbifOccurrenceId_CitCollector_20230719_parsed.tsv
# -------------------------
# 
# real    6m23,077s
# user    3m40,778s
# sys     2m4,969s
```

## Load WikiData Names and Parsed Collector Data

This procedure follows Niels Klazenga’s `match_names_to_wikidata_items.ipynb` (<https://github.com/nielsklazenga/avh-collectors/blob/47c3374f02bea4064b1c6708d79bcd9ba55a08a0/match_names_to_wikidata_items.ipynb>).

Use [`create_wikidata_datasets_botanists.ipynb`](create_wikidata_datasets_botanists.ipynb) to generate the data of botanist of WikiData first, then load those data to prepare the match of your data:

In [2]:
import pandas as pd
wikidata = pd.read_csv("data/wikidata_persons_botanists_20230703_1352.csv", index_col=0, low_memory=False)

wikidata.head()

Unnamed: 0,item,itemLabel,surname,initials,canonical_string,canonical_string_fullname,orcid,viaf,isni,harv,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,http://www.wikidata.org/entity/Q66612,Friedrich August Marschall von Bieberstein,Bieberstein,F.A.M.v.,"Bieberstein, F.A.M.v.","Bieberstein, Friedrich August Marschall von",,43340073,0000 0001 1630 5464,1373,...,Q66612,1768.0,1826.0,,,http://www.wikidata.org/wiki/Q66612,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/6129-1,https://bionomia.net/Q66612
1,http://www.wikidata.org/entity/Q66934,Hans Hermann Behr,Behr,H.H.,"Behr, H.H.","Behr, Hans Hermann",,20328622,0000 0001 1604 8680,42741,...,Q66934,1818.0,1904.0,,,http://www.wikidata.org/wiki/Q66934,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/619-1,https://bionomia.net/Q66934
2,http://www.wikidata.org/entity/Q66661,Jacob Christian Schäffer,Schäffer,J.C.,"Schäffer, J.C.","Schäffer, Jacob Christian",,47016953,0000 0000 8343 3899,1101,...,,1718.0,1790.0,,,http://www.wikidata.org/wiki/Q66661,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/12818-1,
3,http://www.wikidata.org/entity/Q67003,Johann Friedrich Klotzsch,Klotzsch,J.F.,"Klotzsch, J.F.","Klotzsch, Johann Friedrich",,20426762,0000 0001 1749 2732,135,...,Q67003,1805.0,1860.0,,,http://www.wikidata.org/wiki/Q67003,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/4855-1,https://bionomia.net/Q67003
4,http://www.wikidata.org/entity/Q66322,Franz Anton Menge,Menge,F.A.,"Menge, F.A.","Menge, Franz Anton",,59847236,0000 0001 1653 0899,73782,...,,1808.0,1880.0,,,http://www.wikidata.org/wiki/Q66322,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/23266-1,


In [5]:
# create the test data set of WikiData data
# group by canonical name/string, count douplicated names
wd_matchtest = wikidata.groupby('canonical_string').agg({'item': ['count']}).reset_index()
# TODO AP: meaning of wd_matchtest + count for merge later on? 

wd_matchtest.tail()

# colls = list(wikidata.columns)

Unnamed: 0_level_0,canonical_string,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
61479,"Șerbanescu, I.",1
61480,"Ștefureac, T.",1
61481,"Țopa, E.",1
61482,"Ḥalwaǧī, R.",1
61483,"Ḳushnir, Ṭ.",1


In [6]:
wd_matchtest_fullnames = wikidata.groupby('canonical_string_fullname').agg({'item': ['count']}).reset_index()

wd_matchtest_fullnames

Unnamed: 0_level_0,canonical_string_fullname,item
Unnamed: 0_level_1,Unnamed: 1_level_1,count
0,"(-Walraevens), O Heylen",1
1,"(1835-1906), Gustav Adolf Ferdinand Eichler",1
2,"(1873-1926), Søren Sørensen",1
3,"(1888–1973), Georges André",1
4,"(1904-1990), Johannes Johannessen",1
...,...,...
63605,"Șerbanescu, Ioan",1
63606,"Ștefureac, Traian",1
63607,"Țopa, Emilian",1
63608,"Ḥalwaǧī, Riyāḍ",1


In [11]:
# atomized names parsed already by ruby gem package: dwcagent

collectors = pd.read_csv(this_name_source_file_parsed, sep="\t", low_memory=False)

def convert_to_time_periode(x, freq='ms'):
    try:
        return pd.Period(x, freq=freq)
    except:
        # TODO check and curate date string values
        return pd.NaT

print("modify MatCitDate to periode and remove some 0 time values...")

for col in ['MatCitDate']:
    print("- convert", col, "to pd.Period(...) in collectors ...")
    collectors[col] = collectors[col].apply(lambda x: convert_to_time_periode(x, freq='ms'))
    
for col in ['MatCitMonth', 'MatCitDecade', 'MatCitYear']:
    print("- replace in col", col,"0 by NA ...")
    collectors[col] = collectors[col].replace(0, pd.NA)
print("Done modifying.")    

collectors.dropna(subset=['family'], inplace=True) # remove where family was NA, e.g. from originally «??» aso.
collectors

modify MatCitDate to periode and remove some 0 time values...
- convert MatCitDate to pd.Period(...) in collectors ...
- replace in col MatCitMonth 0 by NA ...
- replace in col MatCitDecade 0 by NA ...
- replace in col MatCitYear 0 by NA ...
Done modifying.


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
0,Chodat,R.,,,,,,,1,78F03CF8FFE5FFE2C187FB83FD0AFB94,3419301397,NaT,,,
1,Delage,A.,,,,,,,1,AFA17A73FFA8F2414DA6F9AB94DCF942,3466701331,NaT,,,
2,Crouzet,N.,,,,,,,1,D1FB3E5ECB2F3138FF35F270690D00CF,3426268604,NaT,,,
3,Mayo,,,de,,,,,1,31ADD85BA138FFE3FF45A111FB90F6CB,3421410670,2001-01-18 00:00:00.000,2000,2001,1
4,Garcete,B.,,,,,,,1,31ADD85BA138FFE3FF45A111FB90F6CB,3421410670,2001-01-18 00:00:00.000,2000,2001,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
596665,Xia,Z.Z.,,,,,,,1,3B7F3CD7FFEDFFF5FB68FCBD4061FCB8,3072658352,2017-07-05 00:00:00.000,2010,2017,7
596666,Yang,Z.Z.,,,,,,,1,3B5C3CD3FF9FFFACFCCB2B09BAD0FE79,1699618906,2002-06-25 00:00:00.000,2000,2002,6
596667,Yang,Z.Z.,,,,,,,1,B5B23CA2C006FF87FB6FF9CBFA17F94A,2028140173,2009-08-18 00:00:00.000,2000,2009,8
596668,Zhang,Z.Z.,,,,,,,1,3B063C92F16FFF93DA9FFC4DFEDB1D0B,3866542316,2015-06-08 00:00:00.000,2010,2015,6


#### Check Composition of Parsed Collector Data

In [12]:
# TODO review code of abbreviated names and full name matching
criterion_fullnames = collectors.given.str.contains('^\w{3,}', na=False)
print("Show collecors given name has (propably) a full name (%s records) …" % len(collectors[criterion_fullnames].index))
collectors[criterion_fullnames]

Show collecors given name has (propably) a full name (60118 records) …


Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
813,State,Santa Catarina,,,,,,,1,D3EB3CAEFFC7FFCB1C193D2BFC69FBE2,3400646364,1996-05-03 00:00:00.000,1990,1996,5
1242,Hornuni,Bajo,,,,,,,1,3B0DED47FF80AB2EA965308E8A530E37,2332229252,NaT,,,
1453,Smith,Aaron D.,,,,,,,1,EFE9AF18FF89FFD7DD26FE7E9AF2FDC3,2625368377,NaT,,,
1454,Smith,Aaron D.,,,,,,,1,EFE9AF18FF8CFFD2DE6FFEEE9C3FFE08,2625368379,NaT,,,
1455,Smith,Aaron D.,,,,,,,1,EFE9AF18FF89FFD7DE85FAE99BBBFA4A,2625368398,NaT,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
596253,Jin,Zuyin,,,,,,,1,C6823CA8FFABFFF8FF4DFED6FBD6FE97,3311613465,NaT,,,
596276,Distr,Zvenihorodka,,,,,,,1,3B04D01167DEB5A6C4754EE23A4E249F,2608713023,NaT,,,
596278,Distr,Zvenyhorodka,,,,,,,1,3B04D0116739B541C0CE4F9E3C112457,2608712784,NaT,,,
596297,Zweifel,Guaymas,,,,,,,1,A7D63277BF31D63C4EF6323CA4989878,3067212370,1960-08-20 00:00:00.000,1960,1960,8


In [13]:
# check the name-parsed columns if they are empty or need to be considerd as data for matching or not
import pprint
for parsed_name_part in ["particle", "suffix", "dropping_particle", "appellation"]:
    test_collectors = collectors.loc[(collectors[parsed_name_part].isna() == False)]
    print("\n----------------------------------------\nshow names with **%s** found %s records:\n" % (parsed_name_part, len(test_collectors.index)))
    display(test_collectors.head())


----------------------------------------
show names with **particle** found 19827 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
3,Mayo,,,de,,,,,1,31ADD85BA138FFE3FF45A111FB90F6CB,3421410670,2001-01-18 00:00:00.000,2000,2001,1
270,A. A. Girault,G.,,as,,,,,1,E4E73CEFE566FFFE6C4A0CFC1E2D5BF7,3743912342,1909-08-25 00:00:00.000,1900,1909,8
271,A. A. Girault,G.,,as,,,,,1,E4E73CEFE566FFFE6C040CD1191C5BD2,3743912408,1910-07-01 00:00:00.000,1910,1910,7
1070,Grave,S.,,De,,,,,1,84B80478FFF8FFBFFE9DFC0A2796D4CB,3026647302,2019-09-25 00:00:00.000,2010,2019,9
1074,Grave,S.,,De,,,,,1,84B80478FFF8FFBFFF2FFC2E278FD337,3026647304,2019-09-25 00:00:00.000,2010,2019,9



----------------------------------------
show names with **suffix** found 769 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
728,Braul,A.,Jr.,,,,,,1,3B1AA47D24368929DD80AAE03C1925BF,1977784106,NaT,,,
730,Braul,A.,Jr.,,,,,,1,3B1AA47D24368929DE4DAA283A792583,1977784095,1992-08-04 00:00:00.000,1990.0,1992.0,8.0
732,Braul,A.,Jr.,,,,,,1,3B1AA47D24368929D9F1AAE33CC725DB,1977783800,1993-08-02 00:00:00.000,1990.0,1993.0,8.0
3369,Creek,Abingdon,SR,,,,,,1,3B7E8260FFAB730BC762FF05FECFECA1,4089505452,2011-04-18 00:00:00.000,2010.0,2011.0,4.0
3533,Kingman,Abner,Jr.,,,,,,1,FC407A7EAD115D2C3F4EA0A25F0B203C,1058480287,1992-07-21 00:00:00.000,1990.0,1992.0,7.0



----------------------------------------
show names with **dropping_particle** found 0 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth



----------------------------------------
show names with **appellation** found 278 records:



Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
22091,Longfield,C.,,,,,Miss,,1,15897E5611E2679F420F33806B931C6B,4127603396,1927-05-30 00:00:00.000,1920.0,1927.0,5.0
28309,Farren,,,,,,Mrs.,,1,3B2D3C96831E9D43FF4D4018FB442853,1585880133,1809-01-01 00:00:00.000,1800.0,1849.0,9.0
32536,Carr,,,,,,Mrs,,1,3B693415875AFFDF80BC40A8FC707060,3399886349,1957-03-01 00:00:00.000,1950.0,1957.0,3.0
81694,Barrett,F.,,,,,Miss,,1,3B77D118FFF6E35AE4DBFD1BFC96FD26,2610429310,NaT,,,
88072,Cooper,,,,,,Mr,,1,3B61A0705C0EFFC86382FBA0FE73FA38,3328594336,1802-12-08 00:00:00.000,1800.0,1802.0,12.0


Compile `canonical_string...` for the collector data we will later match the WikiData names with:

In [14]:
collectors['canonical_string_collector_parsed'] = (
  # use collectors.family only where given name has NA values, otherwise use family name + given name
  collectors.family.where(
      # condition
      collectors.given.isna(),
      # any other TODO improve the combined name for canonical_string_collector_parsed if any of the other dwc_parsed fields is not NaN
      # other= collectors.family + ", " + collectors.given 
      other= (collectors.family + ", " + collectors.given) \
        if any(collectors.particle.isna()) \
        else collectors.particle + " " + collectors.family \
         + ", " + collectors.given
  )
)

# # move 'canonical_string_collector_parsed' after column title
col = collectors.pop("canonical_string_collector_parsed")
collectors.insert(collectors.columns.get_loc('title') + 1, col.name, col)
collectors.tail()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
596665,Xia,Z.Z.,,,,,,,"Xia, Z.Z.",1,3B7F3CD7FFEDFFF5FB68FCBD4061FCB8,3072658352,2017-07-05 00:00:00.000,2010,2017,7
596666,Yang,Z.Z.,,,,,,,"Yang, Z.Z.",1,3B5C3CD3FF9FFFACFCCB2B09BAD0FE79,1699618906,2002-06-25 00:00:00.000,2000,2002,6
596667,Yang,Z.Z.,,,,,,,"Yang, Z.Z.",1,B5B23CA2C006FF87FB6FF9CBFA17F94A,2028140173,2009-08-18 00:00:00.000,2000,2009,8
596668,Zhang,Z.Z.,,,,,,,"Zhang, Z.Z.",1,3B063C92F16FFF93DA9FFC4DFEDB1D0B,3866542316,2015-06-08 00:00:00.000,2010,2015,6
596669,Schnitnikov,А.,,,,,,,"Schnitnikov, А.",1,3B7C3CAD6B18FFBCADDEFA01FE543FE5,3034555558,1956-06-20 00:00:00.000,1950,1956,6


In [10]:
# collectors=collectors.add_suffix('_parsed') \
#  if not any(col.endswith("_parsed") for col in list(collectors.columns))

In [15]:
collectors.dtypes

family                                  object
given                                   object
suffix                                  object
particle                                object
dropping_particle                      float64
nick                                   float64
appellation                             object
title                                   object
canonical_string_collector_parsed       object
DocCount                                 int64
MatCitId                                object
MatCitGbifOccurrenceId                   int64
MatCitDate                           period[L]
MatCitDecade                            object
MatCitYear                              object
MatCitMonth                             object
dtype: object

In [16]:
# group and aggregate data to have unique name rows only for the matching of names later on
collectors_unique=collectors.groupby(['canonical_string_collector_parsed']).agg(
    family=('family', lambda x: list(x)[0]),
    given=('given', lambda x: list(x)[0]),
    suffix=('suffix', lambda x: list(x)[0]),
    particle=('particle', lambda x: list(x)[0]),
    dropping_particle=('dropping_particle', lambda x: list(x)[0]),
    nick=('nick', lambda x: list(x)[0]),
    appellation=('appellation', lambda x: list(x)[0]),
    title=('title', lambda x: list(x)[0]),
    DocCount_count= ('DocCount', 'sum'), # use count function
    MatCitId_firstsample=('MatCitId', lambda x: list(x)[0]), # custom function, to get the first entry
    MatCitGbifOccurrenceId_firstsample=('MatCitGbifOccurrenceId', lambda x: list(x)[0]), # custom function, to get the first entry
    MatCitDate_mean=('MatCitDate', 'mean'),
    MatCitDate_min=('MatCitDate', 'min'),
    MatCitDate_max=('MatCitDate', 'max'),
    # MatCitDecade_mean=('MatCitDecade', 'mean'),
    # MatCitDecade_min=('MatCitDecade', 'min'),
    # MatCitDecade_max=('MatCitDecade', 'max'),
    MatCitYear_mean=('MatCitYear', 'mean'),
    MatCitYear_min=('MatCitYear', 'min'),
    MatCitYear_max=('MatCitYear', 'max')
    # MatCitMonth_mean=('MatCitMonth', 'mean'),
    # MatCitMonth_min=('MatCitMonth', 'min'),
    # MatCitMonth_max=('MatCitMonth', 'max')
).reset_index()

# move canonical_string_collector_parsed after column title
col = collectors_unique.pop("canonical_string_collector_parsed")
collectors_unique.insert(collectors_unique.columns.get_loc('title') + 1, col.name, col)

display(collectors_unique)

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount_count,MatCitId_firstsample,MatCitGbifOccurrenceId_firstsample,MatCitDate_mean,MatCitDate_min,MatCitDate_max,MatCitYear_mean,MatCitYear_min,MatCitYear_max
0,A,,,,,,,,A,49,3B303CFDB311FF86FA9D8D16FCEB09EB,1914265692,1999-07-25 12:00:00.000,1893-07-01 00:00:00.000,2021-01-29 00:00:00.000,1999.113636,1893.0,2021.0
1,A Jorio,W.,,,,,,,"A Jorio, W.",1,CCDB3C93FFE0F42725467900AA4DD270,1585189534,2010-03-28 00:00:00.000,2010-03-28 00:00:00.000,2010-03-28 00:00:00.000,2010.0,2010.0,2010.0
2,A,Acuna E.E.,,,,,,,"A, Acuna E.E.",2,3B183CD2A46DFFABFF58FC35FE82FB92,3464288392,1960-07-17 00:00:00.000,1960-07-17 00:00:00.000,1960-07-17 00:00:00.000,1960.0,1960.0,1960.0
3,A,Ae,,,,,,,"A, Ae",3,B9AF7B1CFFACE27585C0FBAF12D3FAB7,1438449014,NaT,NaT,NaT,,,
4,A,Agrobosques S.,,,,,,,"A, Agrobosques S.",1,3B083C841A434F0C70EFF861ECEEFF3E,1701220194,1991-01-23 00:00:00.000,1991-01-23 00:00:00.000,1991-01-23 00:00:00.000,1991.0,1991.0,1991.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
121558,Štěpánek,J.,,,,,,,"Štěpánek, J.",3,7D832E77FFD3FF81FE6635BA52C0F848,4068767324,1985-10-25 00:00:00.000,1985-10-25 00:00:00.000,1985-10-25 00:00:00.000,1985.0,1985.0,1985.0
121559,Šumpich,,,,,,,,Šumpich,4,3B413CDBCE21492DA278FADDFEC9FA4B,3987425391,2016-12-30 12:00:00.000,2015-06-21 00:00:00.000,2019-06-27 00:00:00.000,2016.5,2015.0,2019.0
121560,Τanikawa,A.,,,,,,,"Τanikawa, A.",10,3B1A470EFFB4463AE13EFCB8ED7A63E3,1229615812,1986-06-09 18:40:00.000,1984-05-03 00:00:00.000,1988-08-19 00:00:00.000,1985.888889,1984.0,1988.0
121561,ҫa,F. A. Mendon,,,,,,,"ҫa, F. A. Mendon",5,3B373C827748543A9E5EFA08EF80DAC0,4037809325,1937-04-25 04:48:00.000,1937-04-16 00:00:00.000,1937-05-06 00:00:00.000,1937.0,1937.0,1937.0


### Set Up the Text Search

See https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536

The ngrams function is used as an analyzer in the text search later.

In [17]:
# some example data
for i, row in enumerate(range(5)):
    if (i == 0):
        print('(WikiData’s) canonical_string = (constructed) canonical_string_fullname') 
    pprint.pprint("%s = %s" % (
        wd_matchtest['canonical_string'].at[row],
        wd_matchtest_fullnames['canonical_string_fullname'].at[row]
    ))

(WikiData’s) canonical_string = (constructed) canonical_string_fullname
'(-Walraevens), O.H. = (-Walraevens), O Heylen'
'(1835-1906), G.A.F.E. = (1835-1906), Gustav Adolf Ferdinand Eichler'
'(1873-1926), S.S. = (1873-1926), Søren Sørensen'
'(1888–1973), G.A. = (1888–1973), Georges André'
'(1904-1990), J.J. = (1904-1990), Johannes Johannessen'


In [18]:
import re
!pip install ftfy # amazing text cleaning for decode issues..
from ftfy import fix_text

def ngrams(string, n=3):
    string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower()
    chars_to_remove = [")","(",".","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.title() # normalise case - capital at start of each word
    string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single
    string = ' '+ string +' ' # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try 'pacman -S
[31m   [0m python-xyz', where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch-packaged Python package,
[31m   [0m create a virtual environment using 'python -m venv path/to/venv'.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch packaged Python application,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. Make sure you have python-pipx
[31m   [0m installed via pacman.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-s

In [19]:
print("Show ngram examples:")
print("- simple name:", ngrams('Klazenga, N.'))
print("- data from collectors:", ngrams(collectors_unique["canonical_string_collector_parsed"].at[1])) 
print("- data from match-test:", ngrams(wd_matchtest['canonical_string'].at[0]))
print("- data from match-test (full name):", ngrams(wd_matchtest_fullnames['canonical_string_fullname'].at[3]))

Show ngram examples:
- simple name: [' Kl', 'Kla', 'laz', 'aze', 'zen', 'eng', 'nga', 'ga ', 'a N', ' N ']
- data from collectors: [' A ', 'A J', ' Jo', 'Jor', 'ori', 'rio', 'io ', 'o W', ' W ']
- data from match-test: [' Wa', 'Wal', 'alr', 'lra', 'rae', 'aev', 'eve', 'ven', 'ens', 'ns ', 's O', ' Oh', 'Oh ']
- data from match-test (full name): [' 18', '188', '888', '881', '819', '197', '973', '73 ', '3 G', ' Ge', 'Geo', 'eor', 'org', 'rge', 'ges', 'es ', 's A', ' An', 'And', 'ndr', 'dr ']


Vectorize Wikidata names. Background: We use an information retrieval technique (Term Frequency — Inverse Document Frequency, blog [towardsdatascience.com/tf-idf-explained…](https://towardsdatascience.com/tf-idf-explained-and-python-sklearn-implementation-b020c5e83275)) for matching the source names with WikiData names, for that a calculated dinsance measure of the name match will help to match similar names and distinguish names that are rather no match. In general see also https://scikit-learn.org, https://pypi.org/project/scikit-learn/. 

Convert a collection of raw documents to a matrix of TF-IDF features and set up the function that performs the nearest neighbour matches...

### Perform the Matching

Perform the nearest neighbour (NN) matches on the (Plazi) collector names and create a data frame with matches, and we try to distinguish abbreviated and full names in the source to better match source data and WikiData ... (can take 10 to 30 minutes)

Now convert a collection of raw documents to a matrix of TF-IDF features and set up the function that performs the matches...

In [20]:
# pip install --upgrade scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# nbrs_data = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf_vector_data) 
# tfidf_vector_data contains the vectorized wikidata names from the previous step


def getNearestNeighbour(query, this_vectorizer, this_nbrs_data):
    """Calculate the k-nearest distance for query data using package scikit-learn


    @param query: DataFrame the query data to vectorize and transform
    @param this_vectorizer: the vectorizer of TfidfVectorizer
    @param this_nbrs_data: the data of NearestNeighbors calculations
    @return: (distances, indices) distances and indices
    @rtype (int, int)
    """
    queryTFIDF_ = this_vectorizer.transform(query)
    distances, indices = this_nbrs_data.kneighbors(queryTFIDF_)
    return distances, indices


def calculateTFIDFmatchingOfData(query_data, match_data, n_neighbors=1):
    """
    Calculate a TF-IDF (Term Frequency — Inverse Document Frequency) matching with getNearestN

    @param query_data: DataFrame usually a pandas data column to query names or strings for
    @param match_data: DataFrame against to match with
    @param n_neighbors: Number of neighbors required for each sample by default for :meth:`kneighbors` queries (originally 5).

    @requires NearestNeighbors()
    @requires getNearestNeighbour()
    @requires ngrams()
    @requires TfidfVectorizer()
    @requires NearestNeighbors()

    @return: DataFrame a data frame of matches with columns 'namematch_source_data', 'namematch_resource_data', 'namematch_distance'
    """

    import time
    start = time.time()
    query_data = set(query_data)
    # convert list to set for better performance

    print('Vectorizing data. This may take a while...')
    # vectorize wikidata names
    vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
    tfidf_vector_data = vectorizer.fit_transform(match_data
        # wd_matchtest['canonical_string']
    )
    nbrs_data = NearestNeighbors(n_neighbors=n_neighbors, n_jobs=-1).fit(tfidf_vector_data)
    duration = time.time() - start
    print('Vectorizing completed: created a matrix of TF-IDF features after %s s' % duration)

    print('Getting nearest neighbours of %s data with %s neighbor sample(s)...' % (len(query_data), n_neighbors))
    distances, indices = getNearestNeighbour(query_data, vectorizer, nbrs_data)
    duration = time.time() - start
    print('Completed after %s s' % duration)

    query_data = list(query_data)  # convert back to list

    print('Finding matches build new data frame ...')
    matches = []
    for i, j in enumerate(indices):
        temp = [query_data[i], match_data.values[j][0], round(distances[i][0], 2)]
        matches.append(temp)

    duration = time.time() - start
    print('Building matches done after %s s' % duration)
    matches = pd.DataFrame(
        matches,
        columns=['namematch_source_data', 'namematch_resource_data', 'namematch_distance']
    )

    print('Done')
    return matches

In [21]:
print("Calculate matching for **abbrevated** names separately …")

criterion_fullnames = collectors_unique.given.str.contains('^\w{3,}', na=False)
collectors_names = collectors_unique['canonical_string_collector_parsed'][[not fullname for fullname in criterion_fullnames]].values
# collectors_names = set(collectors_unique['canonical_string_collector_parsed'][[not fullname for fullname in criterion_fullnames]].values)
matches = calculateTFIDFmatchingOfData(collectors_names, wd_matchtest['canonical_string'], 5) # TODO what effect has n_neighbors ? originally in the very source code it is set to 5, not 1

matches = matches.sort_values(['namematch_distance'])
matches = matches.reset_index(names=['old_index'])

matches

Calculate matching for **abbrevated** names separately …
Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 2.883341073989868 s
Getting nearest neighbours of 100488 data with 5 neighbor sample(s)...
Completed after 592.6211576461792 s
Finding matches build new data frame ...
Building matches done after 593.5970041751862 s
Done


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,0,"Calatayud, G.","Calatayud, G.",0.00
1,21092,"Forest, F.","Forest, F.",0.00
2,63597,"Staudt, A.","Staudt, A.",0.00
3,5396,"Fang, Z.D.","Fang, Z.D.",0.00
4,91138,"Dima, B.","Dima, B.",0.00
...,...,...,...,...
100483,59193,Verkhnyodniprosk,"Vattiprolu, P.K.",1.26
100484,71046,"Swamp, N. D. Te Paki Coastal Reserve Wirihi","Serve, L.",1.26
100485,37301,Itaipuacu,"Capua, E.D.",1.27
100486,42260,Rasoavimbahoaka,"Timbal, J.",1.27


In [22]:
# criterion_fullnames = collectors_unique.given.str.contains('^\w{3,}', na=False)
print("Calculate matching for **full** names separately …")
collectors_fullnames = collectors_unique['canonical_string_collector_parsed'][criterion_fullnames].values
matches_fullnames = calculateTFIDFmatchingOfData(collectors_fullnames, wd_matchtest_fullnames['canonical_string_fullname'], 5) # TODO what effect has n_neighbors ? originally in the very source code it is set to 5, not 1

matches_fullnames = matches_fullnames.sort_values(['namematch_distance'])
matches_fullnames = matches_fullnames.reset_index(names=['old_index'])

matches_fullnames

Calculate matching for **full** names separately …
Vectorizing data. This may take a while...
Vectorizing completed: created a matrix of TF-IDF features after 3.454129695892334 s
Getting nearest neighbours of 21075 data with 5 neighbor sample(s)...
Completed after 301.0935504436493 s
Finding matches build new data frame ...
Building matches done after 301.29733443260193 s
Done


Unnamed: 0,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,10147,"Dong, Wei","Dong, Wei",0.00
1,18573,"Chilton, Charles","Chilton, Charles",0.00
2,18556,"Li, Sai-Fei","Li, Sai Fei",0.00
3,4235,"Wang, Tao","Wang, Tao",0.00
4,19770,"Wang, Yong","Wang, Yong",0.00
...,...,...,...,...
21070,15570,"Bacuyo, Bofedal","Boffelli, Alessandro",1.28
21071,13288,"Prefecture, Jingpo Autonomous","Conomos, T John",1.28
21072,20989,"Reservoir, Jirau Hydroelectric","Iwata, Jirô",1.29
21073,4952,"Biosfera El Triunfo, Reserva","Trivelli, Piera",1.29


### Create Output Results

Combine the matches data frame back to the (Plazi) collectors and Wikidata items …

In [23]:
# join matches data frame back to source collectors  dataframe 
collectors_matches = pd.merge(
    collectors_unique, matches, 
    left_on='canonical_string_collector_parsed', right_on='namematch_source_data'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

collectors_matches.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount_count,...,MatCitDate_mean,MatCitDate_min,MatCitDate_max,MatCitYear_mean,MatCitYear_min,MatCitYear_max,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,A,,,,,,,,A,49,...,1999-07-25 12:00:00.000,1893-07-01 00:00:00.000,2021-01-29 00:00:00.000,1999.113636,1893.0,2021.0,6785,A,"Mas, A.",1.16
1,A Jorio,W.,,,,,,,"A Jorio, W.",1,...,2010-03-28 00:00:00.000,2010-03-28 00:00:00.000,2010-03-28 00:00:00.000,2010.0,2010.0,2010.0,27811,"A Jorio, W.","Jorissenne, G.",1.12
2,A,Ae,,,,,,,"A, Ae",3,...,NaT,NaT,NaT,,,,79342,"A, Ae","Lee, A.E.",1.11
3,A,No,,,,,,,"A, No",8,...,1982-03-07 10:17:08.571,1953-01-13 00:00:00.000,1992-12-08 00:00:00.000,1981.857143,1953.0,1992.0,18947,"A, No","Noé, A.C.",0.82
4,A,Nr,,,,,,,"A, Nr",1,...,1979-04-01 00:00:00.000,1979-04-01 00:00:00.000,1979-04-01 00:00:00.000,1979.0,1979.0,1979.0,45617,"A, Nr","Rao, N.R.",0.84


In [24]:
# append full name matches
collectors_matches_fullname = pd.merge(
    collectors_unique, matches_fullnames, 
    left_on='canonical_string_collector_parsed', right_on='namematch_source_data'
    #, suffixes=(None, '_namematch') # append to left-data, right-data only when identical column names occur
)

collectors_matches_fullname.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount_count,...,MatCitDate_mean,MatCitDate_min,MatCitDate_max,MatCitYear_mean,MatCitYear_min,MatCitYear_max,old_index,namematch_source_data,namematch_resource_data,namematch_distance
0,A,Acuna E.E.,,,,,,,"A, Acuna E.E.",2,...,1960-07-17 00:00:00.000,1960-07-17 00:00:00.000,1960-07-17 00:00:00.000,1960.0,1960.0,1960.0,7748,"A, Acuna E.E.","Lee, Gaik Ee",1.16
1,A,Agrobosques S.,,,,,,,"A, Agrobosques S.",1,...,1991-01-23 00:00:00.000,1991-01-23 00:00:00.000,1991-01-23 00:00:00.000,1991.0,1991.0,1991.0,296,"A, Agrobosques S.","Mosquera, Juan",1.17
2,A,Berkov,,,,,,,"A, Berkov",1,...,2013-12-29 00:00:00.000,2013-12-29 00:00:00.000,2013-12-29 00:00:00.000,2013.0,2013.0,2013.0,19331,"A, Berkov","Feráková, Viera",1.03
3,A,Boothia,,,,,,,"A, Boothia",1,...,NaT,NaT,NaT,,,,7682,"A, Boothia","Booth, R",0.94
4,A,Buchan-Hepburn B.C.,,,,,,,"A, Buchan-Hepburn B.C.",1,...,NaT,NaT,NaT,,,,3059,"A, Buchan-Hepburn B.C.","Hepburn, Ian",0.85


In [25]:
collectors_all_matches=pd.concat([collectors_matches, collectors_matches_fullname])
collectors_all_matches.sort_values(by=['namematch_distance', 'family'], ascending=[True, True], inplace=True)
collectors_all_matches.head()

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount_count,...,MatCitDate_mean,MatCitDate_min,MatCitDate_max,MatCitYear_mean,MatCitYear_min,MatCitYear_max,old_index,namematch_source_data,namematch_resource_data,namematch_distance
51,Aagaard,K.,,,,,,,"Aagaard, K.",2,...,1986-07-01 00:00:00.000,1986-07-01 00:00:00.000,1986-07-01 00:00:00.000,1986.0,1986.0,1986.0,30011,"Aagaard, K.","Aagaard, K.",0.0
50,Aagaard,Kaare,,,,,,,"Aagaard, Kaare",6,...,NaT,NaT,NaT,,,,8148,"Aagaard, Kaare","Aagaard, Kaare",0.0
65,Aarvik,L.,,,,,,,"Aarvik, L.",62,...,2001-01-26 06:37:14.482,1955-01-01 00:00:00.000,2016-11-29 00:00:00.000,2000.37931,1955.0,2016.0,35398,"Aarvik, L.","Aarvik, L.",0.0
52,Aarvik,Leif,,,,,,,"Aarvik, Leif",10,...,2010-06-11 07:12:00.000,1993-01-01 00:00:00.000,2014-10-19 00:00:00.000,2009.8,1993.0,2014.0,1244,"Aarvik, Leif","Aarvik, Leif",0.0
64,Aarvik',L.,,,,,,,"Aarvik', L.",1,...,1992-03-25 00:00:00.000,1992-03-25 00:00:00.000,1992-03-25 00:00:00.000,1992.0,1992.0,1992.0,47087,"Aarvik', L.","Aarvik, L.",0.0


Save the results...

In [26]:
import time
import os
if not os.path.exists('data'):
    os.makedirs('data')

this_output_file='data/plazi_collectors_matches_wikidata-botanists_%s.csv' % (this_timestamp_for_data)

collectors_all_matches.to_csv(this_output_file)

print(
    "Wrote matches of collector names into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
) 

Wrote matches of collector names into data/plazi_collectors_matches_wikidata-botanists_20230719.csv (24055 kB)


### Aggregate Matched Data

Aggregate now the data, if multiple names are found … aso. and join multiple results by “…|…”

In [27]:
# merge now the matching data and the wiki data’s on the conaonical string name
collectors_matches_tmp_names_abbreviated = pd.merge(
    collectors_matches, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string'
)
collectors_matches_tmp_fullnames = pd.merge(
    collectors_matches_fullname, wikidata, 
    left_on='namematch_resource_data', right_on='canonical_string_fullname'
)
collectors_matches_g1_merged_wikidata = pd.concat(
    [collectors_matches_tmp_names_abbreviated, collectors_matches_tmp_fullnames]
    , ignore_index=True
)

In [28]:
print("Show some name match examples (e.g. «Louis…» matching various names) …")
for testname in ['Aarvik', 'Louis', 'Abbot']:
    criterion = collectors_matches_g1_merged_wikidata['canonical_string_collector_parsed'].map(lambda x: x.startswith(testname))    
    this_table=collectors_matches_g1_merged_wikidata[criterion].get([
        # 'canonical_string_collector_parsed', # canonical_string_collector_parsed = namematch_source_data
        'DocCount_count', 'MatCitGbifOccurrenceId_firstsample',
        'namematch_source_data', 'namematch_resource_data', 'namematch_distance', 
        # 'canonical_string_fullname', 
        'itemLabel', 'wikidata_link',
        'MatCitYear_min', 'MatCitYear_max',
        'yob', 'yod', 'wyb', 'wye'
    ]).sort_values(by=['namematch_distance'])
    print("# ---------------------------------------------\n# «%s…» as test name, %d collector names begin with:" % (testname, criterion.sum()))    
    display(this_table)

Show some name match examples (e.g. «Louis…» matching various names) …
# ---------------------------------------------
# «Aarvik…» as test name, 9 collector names begin with:


Unnamed: 0,DocCount_count,MatCitGbifOccurrenceId_firstsample,namematch_source_data,namematch_resource_data,namematch_distance,itemLabel,wikidata_link,MatCitYear_min,MatCitYear_max,yob,yod,wyb,wye
570,1,3464736542,"Aarvik', L.","Aarvik, L.",0.0,Leif Aarvik,http://www.wikidata.org/wiki/Q17114254,1992.0,1992.0,1954.0,,,
571,1,3464736542,"Aarvik', L.","Aarvik, L.",0.0,Lars Aarvik,http://www.wikidata.org/wiki/Q106823278,1992.0,1992.0,1892.0,1981.0,,
572,62,3712345314,"Aarvik, L.","Aarvik, L.",0.0,Leif Aarvik,http://www.wikidata.org/wiki/Q17114254,1955.0,2016.0,1954.0,,,
573,62,3712345314,"Aarvik, L.","Aarvik, L.",0.0,Lars Aarvik,http://www.wikidata.org/wiki/Q106823278,1955.0,2016.0,1892.0,1981.0,,
107793,10,3407622304,"Aarvik, Leif","Aarvik, Leif",0.0,Leif Aarvik,http://www.wikidata.org/wiki/Q17114254,1993.0,2014.0,1954.0,,,
568,7,3425371579,Aarvik,"Aarvik, L.",0.43,Leif Aarvik,http://www.wikidata.org/wiki/Q17114254,1991.0,2016.0,1954.0,,,
569,7,3425371579,Aarvik,"Aarvik, L.",0.43,Lars Aarvik,http://www.wikidata.org/wiki/Q106823278,1991.0,2016.0,1892.0,1981.0,,
574,2,2848996716,"Aarvik, L.A.","Aarvik, L.",0.46,Leif Aarvik,http://www.wikidata.org/wiki/Q17114254,1992.0,1992.0,1954.0,,,
575,2,2848996716,"Aarvik, L.A.","Aarvik, L.",0.46,Lars Aarvik,http://www.wikidata.org/wiki/Q106823278,1992.0,1992.0,1892.0,1981.0,,


# ---------------------------------------------
# «Louis…» as test name, 9 collector names begin with:


Unnamed: 0,DocCount_count,MatCitGbifOccurrenceId_firstsample,namematch_source_data,namematch_resource_data,namematch_distance,itemLabel,wikidata_link,MatCitYear_min,MatCitYear_max,yob,yod,wyb,wye
77894,10,3698751601,"Louis, A.M.","Louis, A.M.",0.0,Adriaan M. Louis,http://www.wikidata.org/wiki/Q21338327,1983.0,2011.0,1944.0,,,
77884,13,3695183626,Louis,"Louis, A.",0.41,A. Louis,http://www.wikidata.org/wiki/Q33682458,1937.0,1991.0,,,,
77885,2,4037809378,"Louis, J.","Louis, A.",0.56,A. Louis,http://www.wikidata.org/wiki/Q33682458,1937.0,1938.0,,,,
77889,1,3014903368,"Louis Philippe, I.","Philippe, M.",0.8,Mathieu-Yves Philippe,http://www.wikidata.org/wiki/Q19001498,,,1810.0,1869.0,,
122777,1,2556158301,"Louis, Pic","Picarda, Louis",0.86,Louis Picarda,http://www.wikidata.org/wiki/Q3262897,2001.0,2001.0,1848.0,1901.0,,
77886,1,3042685358,Louis-Alphonse,"Louis, A.",0.89,A. Louis,http://www.wikidata.org/wiki/Q33682458,1955.0,1955.0,,,,
77887,1,2564277833,Louisiana,"Louis, A.",0.89,A. Louis,http://www.wikidata.org/wiki/Q33682458,1984.0,1984.0,,,,
77888,1,3766716404,"Louisville, M.","Louis, A.",0.98,A. Louis,http://www.wikidata.org/wiki/Q33682458,1975.0,1975.0,,,,
122776,1,4058829344,"Louis Ci, Saint","Saint-Victor, Louis Morin de",1.03,Louis Morin de Saint-Victor,http://www.wikidata.org/wiki/Q29613412,1897.0,1897.0,1635.0,1715.0,,


# ---------------------------------------------
# «Abbot…» as test name, 12 collector names begin with:


Unnamed: 0,DocCount_count,MatCitGbifOccurrenceId_firstsample,namematch_source_data,namematch_resource_data,namematch_distance,itemLabel,wikidata_link,MatCitYear_min,MatCitYear_max,yob,yod,wyb,wye
757,9,2605311615,"Abbott, W.L.","Abbott, W.L.",0.0,William Louis Abbott,http://www.wikidata.org/wiki/Q635604,41.0,2003.0,1860.0,1936.0,,
747,16,2428525684,Abbott,"Abbott, G.",0.43,George Abbott,http://www.wikidata.org/wiki/Q47112598,1896.0,2006.0,,,,
746,2,3336039328,"Abbot, J.C.","Abbot, J.",0.51,John Abbot,http://www.wikidata.org/wiki/Q303380,2003.0,2005.0,1751.0,1840.0,1766.0,1840.0
755,1,3080394386,"Abbott, S.","Abbott, S.R.",0.53,Sarah Rideout Abbott,http://www.wikidata.org/wiki/Q67079678,1971.0,1971.0,1871.0,1926.0,,
107813,2,3326496318,"Abbott, Edith","Abbott, Edith Mae",0.53,Edith Mae Abbott,http://www.wikidata.org/wiki/Q99342591,1984.0,1984.0,1909.0,2006.0,,
748,1,4092099304,"Abbott, A.","Abbott, G.",0.57,George Abbott,http://www.wikidata.org/wiki/Q47112598,1987.0,1987.0,,,,
751,1,4109171325,"Abbott, I.A.","Abbott, I.",0.57,Isabella Abbott,http://www.wikidata.org/wiki/Q6077932,1990.0,1990.0,1919.0,2010.0,,
756,2,3407812353,"Abbott, W.","Abbott, W.L.",0.57,William Louis Abbott,http://www.wikidata.org/wiki/Q635604,1922.0,1922.0,1860.0,1936.0,,
754,1,3765279074,"Abbott, L.","Abbott, L.K.",0.6,Lynette K. Abbott,http://www.wikidata.org/wiki/Q36610629,,,,,,
749,3,3034608415,"Abbott, K.","Abbott, G.",0.61,George Abbott,http://www.wikidata.org/wiki/Q47112598,1997.0,1997.0,,,,


In [29]:
print('Group data by canonical names (abbreviated and full name):'
      ' multiple related WD items (e.g. Q1232456), item labels, year of birth, year of death')
for i, wd_matching_column in enumerate(['canonical_string', 'canonical_string_fullname']):
    print('Run %s:   Group by wiki data’s %s, and aggregate/join item(s), labels, yob, yod '
          'by “…|…”, add new columns “…_joined” ...' % (i + 1, wd_matching_column))
    wdata_joined_items_and_others = wikidata.groupby([wd_matching_column]).agg(
        items_joined = ('item', lambda x: '|'.join(x)),
        item_labels_joined = ('itemLabel', lambda x: '|'.join(x)),
        yob_joined = ('yob', lambda x: '|'.join([str(s) for s in list(x)]) ),
        yod_joined = ('yod', lambda x: '|'.join([str(s) for s in list(x)]) )
    ).reset_index()

    # print("Done. Show examples of items having multiple matching data «|» … ")
    # criterion = wdata_joined_items['items'].map(lambda x: '|' in x)
    # wdata_joined_items[criterion].head()

    print('Run %s:   Merge all based on namematch_resource_data, add item(s) data ...' % (i + 1))
    collectors_matches_g2 = pd.merge(
        collectors_matches_g1_merged_wikidata, wdata_joined_items_and_others,
        left_on='namematch_resource_data', right_on=wd_matching_column
        , suffixes=('__wikidata_merge', '__grp_by_items')
        # append to left-data, right-data only when identical column names occur
    )

    print('Run %s:   Build data frame “collectors_matches_group” ...' % (i + 1))
    collectors_matches_group = collectors_matches_g2 \
        if i == 0 \
        else pd.concat([collectors_matches_group, collectors_matches_g2], ignore_index = True)
    
print('Done')

Group data by canonical names (abbreviated and full name): multiple related WD items (e.g. Q1232456), item labels, year of birth, year of death
Run 1:   Group by wiki data’s canonical_string, and aggregate/join item(s), labels, yob, yod by “…|…”, add new columns “…_joined” ...
Run 1:   Merge all based on namematch_resource_data, add item(s) data ...
Run 1:   Build data frame “collectors_matches_group” ...
Run 2:   Group by wiki data’s canonical_string_fullname, and aggregate/join item(s), labels, yob, yod by “…|…”, add new columns “…_joined” ...
Run 2:   Merge all based on namematch_resource_data, add item(s) data ...
Run 2:   Build data frame “collectors_matches_group” ...
Done


In [30]:
print("Show examples of item_labels_joined having multiple matching data «|» … ")
criterion = collectors_matches_group['item_labels_joined'].map(lambda x: '|' in x)

collectors_matches_group[criterion].get([ # empty 
    # 'canonical_string_collector_parsed', # canonical_string_collector_parsed = namematch_source_data
    'DocCount_count', 'MatCitGbifOccurrenceId_firstsample',
    'namematch_source_data', 'namematch_resource_data', 'namematch_distance', 
    # 'canonical_string_fullname', 
    'item_labels_joined', 'items_joined', 'yob_joined', 'yod_joined'
], default="...get: Are data empty or it has probably a wrong named column?")

Show examples of item_labels_joined having multiple matching data «|» … 


Unnamed: 0,DocCount_count,MatCitGbifOccurrenceId_firstsample,namematch_source_data,namematch_resource_data,namematch_distance,item_labels_joined,items_joined,yob_joined,yod_joined
112,1,2608709057,"A, Yu","Yu, J.",0.73,Ji Yu|Jin Yu,http://www.wikidata.org/entity/Q88832701|http:...,nan|nan,nan|nan
113,1,2608709057,"A, Yu","Yu, J.",0.73,Ji Yu|Jin Yu,http://www.wikidata.org/entity/Q88832701|http:...,nan|nan,nan|nan
114,1,2592348498,"Abasheev, R. Yu","Yu, J.",1.16,Ji Yu|Jin Yu,http://www.wikidata.org/entity/Q88832701|http:...,nan|nan,nan|nan
115,1,2592348498,"Abasheev, R. Yu","Yu, J.",1.16,Ji Yu|Jin Yu,http://www.wikidata.org/entity/Q88832701|http:...,nan|nan,nan|nan
116,1,3400158341,"Arzanow, Yu","Yu, J.",1.15,Ji Yu|Jin Yu,http://www.wikidata.org/entity/Q88832701|http:...,nan|nan,nan|nan
...,...,...,...,...,...,...,...,...,...
130783,22,2609011493,"Zhang, Yalin","Zhang, Yan Min",0.92,Yan Min Zhang|Yan Min Zhang,http://www.wikidata.org/entity/Q19588973|http:...,1957.0|nan,nan|nan
130784,1,1438476436,"Zhang, Yan-Long","Zhang, Yan Min",0.80,Yan Min Zhang|Yan Min Zhang,http://www.wikidata.org/entity/Q19588973|http:...,1957.0|nan,nan|nan
130785,1,1438476436,"Zhang, Yan-Long","Zhang, Yan Min",0.80,Yan Min Zhang|Yan Min Zhang,http://www.wikidata.org/entity/Q19588973|http:...,1957.0|nan,nan|nan
130786,4,3395956304,"Zhang, Yanhua","Zhang, Yan Min",0.93,Yan Min Zhang|Yan Min Zhang,http://www.wikidata.org/entity/Q19588973|http:...,1957.0|nan,nan|nan


In [31]:
# check what columns we have and what we would keep for further analysis and what to drop
pprint.pprint(collectors_matches_group.columns)

Index(['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick',
       'appellation', 'title', 'canonical_string_collector_parsed',
       'DocCount_count', 'MatCitId_firstsample',
       'MatCitGbifOccurrenceId_firstsample', 'MatCitDate_mean',
       'MatCitDate_min', 'MatCitDate_max', 'MatCitYear_mean', 'MatCitYear_min',
       'MatCitYear_max', 'old_index', 'namematch_source_data',
       'namematch_resource_data', 'namematch_distance', 'item', 'itemLabel',
       'surname', 'initials', 'canonical_string__wikidata_merge',
       'canonical_string_fullname', 'orcid', 'viaf', 'isni', 'harv', 'ipni',
       'abbr', 'bionomia_id', 'yob', 'yod', 'wyb', 'wye', 'wikidata_link',
       'orcid_link', 'harv_link', 'ipni_link', 'bionomia_link',
       'canonical_string__grp_by_items', 'items_joined', 'item_labels_joined',
       'yob_joined', 'yod_joined', 'canonical_string',
       'canonical_string_fullname__wikidata_merge',
       'canonical_string_fullname__grp_by_items'],
   

In [32]:
# Remove superfluous columns TODO check WARNING: A value is trying to be set on a copy of a slice from a DataFrame
# TODO check duplicates
collectors_matches_group_simplified = collectors_matches_group.get(
    ['family', 'given', 'canonical_string_collector_parsed', 
    'namematch_source_data', 'namematch_resource_data', 'namematch_distance',
     'MatCitYear_mean', 'MatCitYear_min', 'MatCitYear_max',
     'MatCitDate_min', 'MatCitDate_max', 'MatCitYear_mean', 'MatCitYear_min',
      'yob_joined', 'yod_joined', # WikiData dates
      'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 
      'items_joined', 'canonical_string', 'canonical_string_fullname', 'surname', 'initials', 'item_labels_joined'
    ], default="...get: Are data empty or it has probably a wrong named column?"
)
# collectors_matches_group = collectors_matches_g3
collectors_matches_group_simplified.sort_values(
    by=['namematch_distance', 'canonical_string_collector_parsed']
    , inplace=True
)
collectors_matches_group_simplified.drop_duplicates(inplace=True)
collectors_matches_group_simplified.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_matches_group_simplified.sort_values(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  collectors_matches_group_simplified.drop_duplicates(inplace=True)


Unnamed: 0,family,given,canonical_string_collector_parsed,namematch_source_data,namematch_resource_data,namematch_distance,MatCitYear_mean,MatCitYear_min,MatCitYear_max,MatCitDate_min,...,harv,ipni,abbr,bionomia_id,items_joined,canonical_string,canonical_string_fullname,surname,initials,item_labels_joined
546,Aagaard,K.,"Aagaard, K.","Aagaard, K.","Aagaard, K.",0.0,1986.0,1986.0,1986.0,1986-07-01 00:00:00.000,...,,,,Q55216516,http://www.wikidata.org/entity/Q55216516,,"Aagaard, Kaare",Aagaard,K.,Kaare Aagaard
110163,Aagaard,Kaare,"Aagaard, Kaare","Aagaard, Kaare","Aagaard, Kaare",0.0,,,,NaT,...,,,,Q55216516,http://www.wikidata.org/entity/Q55216516,"Aagaard, K.",,Aagaard,K.,Kaare Aagaard
570,Aarvik',L.,"Aarvik', L.","Aarvik', L.","Aarvik, L.",0.0,1992.0,1992.0,1992.0,1992-03-25 00:00:00.000,...,,,,0000-0002-0112-8837,http://www.wikidata.org/entity/Q17114254|http:...,,"Aarvik, Leif",Aarvik,L.,Leif Aarvik|Lars Aarvik
571,Aarvik',L.,"Aarvik', L.","Aarvik', L.","Aarvik, L.",0.0,1992.0,1992.0,1992.0,1992-03-25 00:00:00.000,...,,,,Q106823278,http://www.wikidata.org/entity/Q17114254|http:...,,"Aarvik, Lars",Aarvik,L.,Leif Aarvik|Lars Aarvik
572,Aarvik,L.,"Aarvik, L.","Aarvik, L.","Aarvik, L.",0.0,2000.37931,1955.0,2016.0,1955-01-01 00:00:00.000,...,,,,0000-0002-0112-8837,http://www.wikidata.org/entity/Q17114254|http:...,,"Aarvik, Leif",Aarvik,L.,Leif Aarvik|Lars Aarvik


In [33]:
this_output_file='data/plazi_collectors_matches_wikidata_items_group_concat_%s.csv' % (this_timestamp_for_data)

# collectors_matches_group.to_csv(this_output_file_name)
collectors_matches_group.to_csv(this_output_file)

print("Wrote groups of collectors matches into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote groups of collectors matches into data/plazi_collectors_matches_wikidata_items_group_concat_20230719.csv (69490 kB)


### Merge Data to Individual WikiData Items

For this, merge by namematch_resource_data and focus to get individual WikiData items.

Get individual WikiData items (TODO review code): 
- associate collector name match + individual WikiData items (remember: we matched the `canonical_string`)

In [34]:
print('Merge simply namematch_resource_data to Wiki data for abbreviated and full names... ')
for i, wd_matching_column in enumerate(['canonical_string', 'canonical_string_fullname']):

    # join wikidata items to avh collectors matches
    #   avh_matches = pd.merge(avh, matches, left_on='label', right_on='name')
    #   avh_matches_t1 = pd.merge(avh_matches, wikidata, left_on='matched_name', right_on='canonical_string')
    # link counts of wikidata items with same canonical name string
    #   avh_matches_t2 = pd.merge(avh_matches_t1, wd_test, left_on="matched_name", right_on="canonical_string")
    #   avh_matches_t2.rename(columns = {list(avh_matches_t2.columns)[-1]: 'dup_count'}, inplace=True)
    
    print('Run %s:   Merge all (collectors matches) using namematch_resource_data, add wikidata ...' % (i + 1))
    collectors_matches_wd1 = pd.merge(
        collectors_all_matches, wikidata,
        left_on='namematch_resource_data', right_on=wd_matching_column,
        suffixes=('__coll_all_matches', '__wd')
        # append to left-data, right-data only when identical column names occur
    )

    print('Run %s:   Build data frame “collectors_matches_with_wdata” ...' % (i + 1))
    collectors_matches_with_wdata = collectors_matches_wd1 \
        if i == 0 \
        else pd.concat([collectors_matches_with_wdata, collectors_matches_wd1], ignore_index=True)

print('Done')

Merge simply namematch_resource_data to Wiki data for abbreviated and full names... 
Run 1:   Merge all (collectors matches) using namematch_resource_data, add wikidata ...
Run 1:   Build data frame “collectors_matches_with_wdata” ...
Run 2:   Merge all (collectors matches) using namematch_resource_data, add wikidata ...
Run 2:   Build data frame “collectors_matches_with_wdata” ...
Done


In [35]:
pprint.pprint(collectors_matches_with_wdata.columns)
# echo "${text}" | fold --spaces | sed 's@^@#  @'

Index(['family', 'given', 'suffix', 'particle', 'dropping_particle', 'nick',
       'appellation', 'title', 'canonical_string_collector_parsed',
       'DocCount_count', 'MatCitId_firstsample',
       'MatCitGbifOccurrenceId_firstsample', 'MatCitDate_mean',
       'MatCitDate_min', 'MatCitDate_max', 'MatCitYear_mean', 'MatCitYear_min',
       'MatCitYear_max', 'old_index', 'namematch_source_data',
       'namematch_resource_data', 'namematch_distance', 'item', 'itemLabel',
       'surname', 'initials', 'canonical_string', 'canonical_string_fullname',
       'orcid', 'viaf', 'isni', 'harv', 'ipni', 'abbr', 'bionomia_id', 'yob',
       'yod', 'wyb', 'wye', 'wikidata_link', 'orcid_link', 'harv_link',
       'ipni_link', 'bionomia_link'],
      dtype='object')


In [36]:
collectors_matches_with_wdata.drop_duplicates(inplace=True)
collectors_matches_with_wdata

Unnamed: 0,family,given,suffix,particle,dropping_particle,nick,appellation,title,canonical_string_collector_parsed,DocCount_count,...,bionomia_id,yob,yod,wyb,wye,wikidata_link,orcid_link,harv_link,ipni_link,bionomia_link
0,Aagaard,K.,,,,,,,"Aagaard, K.",2,...,Q55216516,1947.0,,,,http://www.wikidata.org/wiki/Q55216516,,,,https://bionomia.net/Q55216516
1,Aarvik,L.,,,,,,,"Aarvik, L.",62,...,0000-0002-0112-8837,1954.0,,,,http://www.wikidata.org/wiki/Q17114254,https://orcid.org/0000-0002-0112-8837,,,https://bionomia.net/0000-0002-0112-8837
2,Aarvik,L.,,,,,,,"Aarvik, L.",62,...,Q106823278,1892.0,1981.0,,,http://www.wikidata.org/wiki/Q106823278,,,,https://bionomia.net/Q106823278
3,Aarvik',L.,,,,,,,"Aarvik', L.",1,...,0000-0002-0112-8837,1954.0,,,,http://www.wikidata.org/wiki/Q17114254,https://orcid.org/0000-0002-0112-8837,,,https://bionomia.net/0000-0002-0112-8837
4,Aarvik',L.,,,,,,,"Aarvik', L.",1,...,Q106823278,1892.0,1981.0,,,http://www.wikidata.org/wiki/Q106823278,,,,https://bionomia.net/Q106823278
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
130927,Sangha-Mabaere,Prefecture,,,,,,,"Sangha-Mabaere, Prefecture",1,...,,,,,,http://www.wikidata.org/wiki/Q88848954,,,https://www.ipni.org/a/20039887-1,
130928,Bacuyo,Bofedal,,de,,,,,"Bacuyo, Bofedal",1,...,,,,,,http://www.wikidata.org/wiki/Q88811519,,,https://www.ipni.org/a/20034982-1,
130929,Biosfera El Triunfo,Reserva,,de la,,,,,"Biosfera El Triunfo, Reserva",5,...,,1933.0,,,,http://www.wikidata.org/wiki/Q21610963,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/27922-1,
130930,Reservoir,Jirau Hydroelectric,,,,,,,"Reservoir, Jirau Hydroelectric",1,...,,1909.0,,,,http://www.wikidata.org/wiki/Q21516836,,https://kiki.huh.harvard.edu/databases/botanis...,https://www.ipni.org/a/4347-1,


In [37]:
this_output_file='data/plazi_collectors_matches_wikidata-botanists_all-columns_%s.csv' % (this_timestamp_for_data)

collectors_matches_with_wdata.sort_values(
    by=['namematch_distance', 'canonical_string_collector_parsed']
    , inplace=True
)
collectors_matches_with_wdata.to_csv(
    this_output_file, index=False # drop index column
)

print("Wrote isolated WikiData items of collector matches into %s (%d kB)" % 
    (this_output_file, os.path.getsize(this_output_file) >> 10 ) # 10000 >> 10 = bitshift operator, to get kilo bytes (10-bits=>1024)
)

Wrote isolated WikiData items of collector matches into data/plazi_collectors_matches_wikidata-botanists_all-columns_20230719.csv (55807 kB)


In [None]:
# TODO further evaluation or filtering, counting, clean up aso.

TODO document columns

Explanation of columns:

Column | Description
-|-
**Botanical collectors** |
family | parsed family name
given | parsed given name
suffix | suffix from name parsing
particle | particle from name parsing
dropping_particle | dropping_particle from name parsing
nick | nick name from name parsing
appellation | appellation from name parsing
title | title from name parsing
TODO … | Year of first collection
TODO end_date | Year of last collection
TODO activity_span | Number of years between first and last collection
**Name matching** |
nammatch_collector | matched name of the data set
nammatch_wikidata | matched name; = Wikidata item label name is matched to
name_match_distance | Nearest Neighbour distance between the name and matched name; the lower the value, the better the match
**Wikidata** |
item | Wikidata Item ID (URL)
itemLabel | Wikidata Item label
surname	| Surname; derived from item label
initials | Initials; derived from item label
canonical_string | Canonical name string; derived from item label, used for matching
orcid | ORCID ([P496](https://www.wikidata.org/wiki/Property:P496))
viaf | VIAF ID ([P214](https://www.wikidata.org/wiki/Property:P214))
isni | ISNI ID ([P213](https://www.wikidata.org/wiki/Property:P496))	
harv | Harvard Index of Botanists ID ([P6264](https://www.wikidata.org/wiki/Property:P6264))
ipni | IPNI author ID ([P586](https://www.wikidata.org/wiki/Property:P586))
abbr | botanist author abbreviation (standard form) ([P428](https://www.wikidata.org/wiki/Property:P428))
bionomia_id | identifier for a collector and/or determiner of natural history specimens, in the Bionomia database ([P6944](https://www.wikidata.org/wiki/Property:P6944))
yob	| Year of birth (derived from [P569](https://www.wikidata.org/wiki/Property:P569))
yod	| Year of death (derived from [P496](https://www.wikidata.org/wiki/Property:P570))
wyb	| Work year period begin ([P2031](https://www.wikidata.org/wiki/Property:P2031))
wye | Work year period end ([P2032](https://www.wikidata.org/wiki/Property:P2032))