# Create Plazi collectors data set

Create a data set of collectors recorded by Plazi:

- see <https://tb.plazi.org/GgServer/srsStats> section “Materials Citation Data”
- then select the data (columns) of interest, and then below on section **Fields to Use in Statistics** you can alter the output
    - choose **Operation** “show individual values”
    - filter values at **Filter on Values**
    - set the limit to e.g. 5 to see what data you would get
    - below you can get the download link to the data format you get offered there

# Example Data

| Field Name | Filter on Values |
|-|-|
| Collector Name          | >0 |
| GBIF Occurrence ID      | !0 |
| Collecting Month        |    |
| Collecting Year         |    |
| Collecting Decade       |    |
| Collecting Date         |    |
| Materials Citation UUID |    |

```bash
# added filter: gbifOccurrenceId → !0
# added filter: collector → >0 (seems to give the non empty collector names)
filename="plazi-stats_numberOfTreatments_gbifOccurrenceId-not0_date_decade_year_month_collector-gt0_$(date '+%Y%m%d').tsv"
wget --output-document="${filename}" \
'https://tb.plazi.org/GgServer/srsStats/stats?outputFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector&groupingFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector&FP-matCit.gbifOccurrenceId=!0&FP-matCit.collector=%3E0&format=TSV'

cat "${filename}" | wc -l
# 417402 minus 1 record (=column header)

{ head -n 5 "${filename}"; echo "..."; tail -n 5 "${filename}"; } | column --table --separator $'\t' | sed 's@^@  # @;'
  # DocCount  MatCitId                          MatCitGbifOccurrenceId  MatCitDate  MatCitDecade  MatCitYear  MatCitMonth  MatCitCollector
  # 1         78F03CF8FFE2FFE5C0C4F883FE73F8B4  3419301320                          0             0           0            1888 - 1890 & Morong, T.
  # 1         78F03CF8FFE5FFE2C187FB83FD0AFB94  3419301397                          0             0           0            1914 & Chodat, R.
  # 1         1FFD3CFF806D3D11C410027311B3FEAC  4012799597              1980-09-19  1980          1980        9            1980 - Sino- American Botanical Expedition
  # 1         AFA17A73FFA8F2414DA6F9AB94DCF942  3466701331                          0             0           0            20. 8.201 3 & Delage, A.
  # ...                                                                                                                    
  # 1         3B7F3CD7FFEDFFF5FB68FCBD4061FCB8  3072658352              2017-07-05  2010          2017        7            Z. Z. Xia
  # 1         3B5C3CD3FF9FFFACFCCB2B09BAD0FE79  1699618906              2002-06-25  2000          2002        6            Z. Z. Yang
  # 1         B5B23CA2C006FF87FB6FF9CBFA17F94A  2028140173              2009-08-18  2000          2009        8            Z. Z. Yang
  # 1         3B063C92F16FFF93DA9FFC4DFEDB1D0B  3866542316              2015-06-08  2010          2015        6            ZZ Zhang
  # 1         3B7C3CAD6B18FFBCADDEFA01FE543FE5  3034555558              1956-06-20  1950          1956        6            А. Schnitnikov
```



In [1]:
import json
import requests
import pandas as pd
import time

# https://tb.plazi.org/GgServer/srsStats/stats?
#   outputFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector
#   &
#   groupingFields=matCit.id+matCit.gbifOccurrenceId+matCit.date+matCit.decade+matCit.year+matCit.month+matCit.collector
#   &
#   FP-matCit.gbifOccurrenceId=!0
#   &
#   FP-matCit.collector=%3E0
#   &
#   format=TSV
url = 'https://tb.plazi.org/GgServer/srsStats/stats'
params = [
    ('outputFields',   'matCit.id matCit.gbifOccurrenceId matCit.date matCit.decade matCit.year matCit.month matCit.collector'),
    ('groupingFields', 'matCit.id matCit.gbifOccurrenceId matCit.date matCit.decade matCit.year matCit.month matCit.collector'),
    ('FP-matCit.gbifOccurrenceId', '!0'),
    ('FP-matCit.collector', '>0'),
    ('format', 'JSON')
]

start_time = time.time()
print("Send data request to" , url)

response = requests.get(url, params)
dict = response.json()
collectors = dict['data']

print("Response of %s came in %s seconds (HTTP-code: %s)" % (
    url, 
    (time.time() - start_time), 
    response.status_code)
)

start_time = time.time()
print("Normalize JSON data with pandoc …")

df = pd.json_normalize(collectors)

print("Normalization took %s seconds" % (time.time() - start_time) )

print("Print data sample …")
df



Send data request to https://tb.plazi.org/GgServer/srsStats/stats
Response of https://tb.plazi.org/GgServer/srsStats/stats came in 11.317992448806763 seconds (HTTP-code: 200)
Normalize JSON data with pandoc …
Normalization took 2.426215410232544 seconds
Print data sample …


Unnamed: 0,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth,MatCitCollector
0,1,78F03CF8FFE2FFE5C0C4F883FE73F8B4,3419301320,,0,0,0,"1888 - 1890 & Morong, T."
1,1,78F03CF8FFE5FFE2C187FB83FD0AFB94,3419301397,,0,0,0,"1914 & Chodat, R."
2,1,1FFD3CFF806D3D11C410027311B3FEAC,4012799597,1980-09-19,1980,1980,9,1980 - Sino- American Botanical Expedition
3,1,AFA17A73FFA8F2414DA6F9AB94DCF942,3466701331,,0,0,0,"20. 8.201 3 & Delage, A."
4,1,87ADD56BFF8DFF9BFBA0164C25E5FA86,3467693310,,0,0,0,"20. IX. 1957 & fr., Service Forestier"
...,...,...,...,...,...,...,...,...
421015,1,3B7F3CD7FFEDFFF5FB68FCBD4061FCB8,3072658352,2017-07-05,2010,2017,7,Z. Z. Xia
421016,1,3B5C3CD3FF9FFFACFCCB2B09BAD0FE79,1699618906,2002-06-25,2000,2002,6,Z. Z. Yang
421017,1,B5B23CA2C006FF87FB6FF9CBFA17F94A,2028140173,2009-08-18,2000,2009,8,Z. Z. Yang
421018,1,3B063C92F16FFF93DA9FFC4DFEDB1D0B,3866542316,2015-06-08,2010,2015,6,ZZ Zhang


In [2]:
list(df.columns)

['DocCount',
 'MatCitId',
 'MatCitGbifOccurrenceId',
 'MatCitDate',
 'MatCitDecade',
 'MatCitYear',
 'MatCitMonth',
 'MatCitCollector']

In [3]:
# move 'MatCitCollector' to be the first column (perhaps suitable for bin/agent_parse4tsv.rb
col = df.pop("MatCitCollector")
df.insert(0, col.name, col)
df

Unnamed: 0,MatCitCollector,DocCount,MatCitId,MatCitGbifOccurrenceId,MatCitDate,MatCitDecade,MatCitYear,MatCitMonth
0,"1888 - 1890 & Morong, T.",1,78F03CF8FFE2FFE5C0C4F883FE73F8B4,3419301320,,0,0,0
1,"1914 & Chodat, R.",1,78F03CF8FFE5FFE2C187FB83FD0AFB94,3419301397,,0,0,0
2,1980 - Sino- American Botanical Expedition,1,1FFD3CFF806D3D11C410027311B3FEAC,4012799597,1980-09-19,1980,1980,9
3,"20. 8.201 3 & Delage, A.",1,AFA17A73FFA8F2414DA6F9AB94DCF942,3466701331,,0,0,0
4,"20. IX. 1957 & fr., Service Forestier",1,87ADD56BFF8DFF9BFBA0164C25E5FA86,3467693310,,0,0,0
...,...,...,...,...,...,...,...,...
421015,Z. Z. Xia,1,3B7F3CD7FFEDFFF5FB68FCBD4061FCB8,3072658352,2017-07-05,2010,2017,7
421016,Z. Z. Yang,1,3B5C3CD3FF9FFFACFCCB2B09BAD0FE79,1699618906,2002-06-25,2000,2002,6
421017,Z. Z. Yang,1,B5B23CA2C006FF87FB6FF9CBFA17F94A,2028140173,2009-08-18,2000,2009,8
421018,ZZ Zhang,1,3B063C92F16FFF93DA9FFC4DFEDB1D0B,3866542316,2015-06-08,2010,2015,6


## Write the Output Data



In [5]:
import os
from datetime import datetime

if not os.path.exists('data'):
    print("Make data directory for saving …")
    os.makedirs('data')

this_output_file=os.path.join(
    "data", ("plazi_GbifOccurrenceId_CitCollector_%s.tsv" % datetime.today().strftime('%Y%m%d'))
)

print("Write data results into %s" % this_output_file)

df.to_csv(this_output_file
          , sep='\t'
          ,index=False # skip the index
    # , header=["custom_colname_1", "custom_colname_2", "…"] # could rewrite header labels
)


Write data results into data/plazi_GbifOccurrenceId_CitCollector_20230705.tsv


## Parse Collector Names

Now you can parse the names with dwcagent, if the collector names are in the first column:

```bash
cd bin
ruby agent_parse4tsv.rb \
  --input ../data/plazi_GbifOccurrenceId_CitCollector_20230705.tsv \
  --output ../data/plazi_GbifOccurrenceId_CitCollector_20230705_parsed.tsv

# or check also running time of the parsing script with `time command`

time ruby agent_parse4tsv.rb \
  --input ../data/plazi_GbifOccurrenceId_CitCollector_20230705.tsv \
  --output ../data/plazi_GbifOccurrenceId_CitCollector_20230705_parsed.tsv
# real    5m0,880s
# user    2m41,501s
# sys     1m56,399s
```