## Create Occurrence Data Set including eventDate

### Store GBIF Occurrence Data Set locally

For Meise Botanic Garden Herbarium (BR) see example data set <https://doi.org/10.15468/dl.ax9zkh>. We saved all of it into the local data directory `data/Meise_doi-10.15468-dl.ax9zkh/`:
- The CSV data file is +1GB large—please download it first (**it will not be in the official GitHub documentation**) or change the code here to read your special input data.

In [8]:
import os, time, pprint

gbif_dataset_path="data/Meise_doi-10.15468-dl.ax9zkh"
gbif_occurrence_source_file="0165208-230224095556074.csv" # was \t separated; expect CSV to have comma not tab as separator character(!)

# join file name dynamically for saving results
this_output_tabdata_file=os.path.join(
    gbif_dataset_path, (
        "occurrence_recordedBy_eventDate_occurrenceIDs_%s.tsv" % 
        # '20230726'
        time.strftime('%Y%m%d')
    )
)
# use static file name for saving
# this_output_tabdata_file=data/Meise_doi-10.15468-dl.ax9zkh/occurrence_recordedBy_occurrenceIDs_20230524.tsv

if not os.path.exists(gbif_dataset_path):
    print("Where is the folder of are GBIF occurrence data?", gbif_dataset_path, "not found")
    print("Recommendation is use a subfolder, e.g. “data/Meise_doi-10.15468-dl.ax9zkh”")
else:
    print("All right, expected data found:\n- GBIF data found in", gbif_dataset_path ,"\n- Results will later be written to", this_output_tabdata_file)
    if not os.path.exists(os.path.join(gbif_dataset_path, gbif_occurrence_source_file)):
        print("What data source file is the right one? The expected CSV-file", gbif_occurrence_source_file, "was not found.")
        print("Set the file in Python variable 'gbif_occurrence_source_file' to the correct file name.")

All right, expected data found:
- GBIF data found in data/Meise_doi-10.15468-dl.ax9zkh 
- Results will later be written to data/Meise_doi-10.15468-dl.ax9zkh/occurrence_recordedBy_eventDate_occurrenceIDs_20230830.tsv


### Read GBIF Occurrence Data

Get `recordedBy` and `created` of `gbif_occurrence_source_file=0165208-230224095556074.csv` and look into the data first, data columns aso. …

In [10]:
import pandas as pd # to read data

# Reading all at once does not work to read 1GB of data yet
occurrences = pd.read_csv(
    os.path.join(gbif_dataset_path, gbif_occurrence_source_file), sep="\t", low_memory=False,
    nrows=1
)

pprint.pprint(occurrences.columns)

Index(['gbifID', 'datasetKey', 'occurrenceID', 'kingdom', 'phylum', 'class',
       'order', 'family', 'genus', 'species', 'infraspecificEpithet',
       'taxonRank', 'scientificName', 'verbatimScientificName',
       'verbatimScientificNameAuthorship', 'countryCode', 'locality',
       'stateProvince', 'occurrenceStatus', 'individualCount',
       'publishingOrgKey', 'decimalLatitude', 'decimalLongitude',
       'coordinateUncertaintyInMeters', 'coordinatePrecision', 'elevation',
       'elevationAccuracy', 'depth', 'depthAccuracy', 'eventDate', 'day',
       'month', 'year', 'taxonKey', 'speciesKey', 'basisOfRecord',
       'institutionCode', 'collectionCode', 'catalogNumber', 'recordNumber',
       'identifiedBy', 'dateIdentified', 'license', 'rightsHolder',
       'recordedBy', 'typeStatus', 'establishmentMeans', 'lastInterpreted',
       'mediaType', 'issue'],
      dtype='object')


In [11]:
# just see the first rows
occurrences = pd.read_csv(
    os.path.join(gbif_dataset_path, gbif_occurrence_source_file), sep="\t", low_memory=False,
    usecols=["occurrenceID", "recordedBy", "eventDate"],
    nrows=50 # read all data results in memory kill
)
occurrences.head()


Unnamed: 0,occurrenceID,eventDate,recordedBy
0,http://www.botanicalcollections.be/specimen/BR...,,Lebrun J.
1,http://www.botanicalcollections.be/specimen/BR...,,Jurion F.
2,http://www.botanicalcollections.be/specimen/BR...,,Dubois H.
3,http://www.botanicalcollections.be/specimen/BR...,1960-04-16T00:00:00,Hendrickx F.L.
4,http://www.botanicalcollections.be/specimen/BR...,1921-05-01T00:00:00,Claessens J.


We follow <https://towardsdatascience.com/tips-and-tricks-for-loading-large-csv-files-into-pandas-dataframes-part-2-5fc02fc4e3ab> and filter for having an occourrenceID.

For large data sets it is better to read it defining a “chunksize” (because otherwise the processor would read all at once and gets stuck):

In [12]:
starttime = time.time()

chunks_occurrences = pd.read_csv(
    os.path.join(gbif_dataset_path, gbif_occurrence_source_file), sep="\t", low_memory=False,
    usecols=["occurrenceID", "recordedBy", "eventDate"],
    chunksize=100000
)

print("read large data as chunk", time.time() - starttime, 'seconds')

def filter_having_occurrenceID(df):
    df = df[df.occurrenceID.notnull()]
    print("filter having occurrenceID: " + str(df.shape))
    # print(df.shape)
    return df

starttime = time.time()
chunk_list = [] # used for storing dataframes
for chunk in chunks_occurrences:
    # each chunk is a dataframe
    # perform data filtering
    filtered_chunk = filter_having_occurrenceID(chunk)
    # Once the data filtering is done, append the filtered chunk to list
    chunk_list.append(filtered_chunk)

# concat all the dfs in the list in
occurrences = pd.concat(chunk_list)

print("process data having only occurrenceID took", time.time() - starttime, ' seconds')
# occurrences.dropna(subset=['eventDate'], inplace=True) # test to keep NA for eventDate
occurrences.head()

read large data as chunk 0.00852656364440918 seconds
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID:

Unnamed: 0,occurrenceID,eventDate,recordedBy
0,http://www.botanicalcollections.be/specimen/BR...,,Lebrun J.
1,http://www.botanicalcollections.be/specimen/BR...,,Jurion F.
2,http://www.botanicalcollections.be/specimen/BR...,,Dubois H.
3,http://www.botanicalcollections.be/specimen/BR...,1960-04-16T00:00:00,Hendrickx F.L.
4,http://www.botanicalcollections.be/specimen/BR...,1921-05-01T00:00:00,Claessens J.


In [13]:
### Convert to date/time
#
# occurrences.dtypes
#   occurrenceID    object
#   recordedBy      object
#   eventDate       object
#   dtype: object

# df['dates'] = pd.to_datetime(df['dates'], format='%Y%m%d')
# pd.to_datetime("12/29/2020  9:09:37 PM", utc=True)
# pd.to_datetime("1904-07-01T00:00:00", utc=True)

# occurrences['eventDate']= pd.to_datetime(occurrences.eventDate, utc=True) # Out of bounds nanosecond timestamp: 1652-01-01T00:00:00
#  because date nanoseconds range limitations of pandas, see https://stackoverflow.com/a/69507200/1240387
#  work around: use datetime
#  occurrences['eventDate'] = occurrences['eventDate'].apply(lambda x: datetime.strptime(x,'%Y-%m-%dT%H:%M:%S') if type(x)==str else pd.NaT)
# or using pd.Periode(…)
occurrences['eventDate'] = occurrences['eventDate'].apply(lambda x: pd.Period(x, freq='ms'))
occurrences.head()

Unnamed: 0,occurrenceID,eventDate,recordedBy
0,http://www.botanicalcollections.be/specimen/BR...,NaT,Lebrun J.
1,http://www.botanicalcollections.be/specimen/BR...,NaT,Jurion F.
2,http://www.botanicalcollections.be/specimen/BR...,NaT,Dubois H.
3,http://www.botanicalcollections.be/specimen/BR...,1960-04-16 00:00:00.000,Hendrickx F.L.
4,http://www.botanicalcollections.be/specimen/BR...,1921-05-01 00:00:00.000,Claessens J.


In [14]:

occurrences.dtypes

occurrenceID       object
eventDate       period[L]
recordedBy         object
dtype: object

In [15]:
# group and aggregate data: 

occurrences_unique=occurrences.groupby(['recordedBy']).agg(
    occurrenceID_count= ('occurrenceID', 'count'), # use count function
    occurrenceID_firstsample=('occurrenceID', lambda x: list(x)[0]) # custom function, to get the first entry
    , eventDate_mean=('eventDate', 'mean')
    , eventDate_min=('eventDate', 'min')
    , eventDate_max=('eventDate', 'max')
).reset_index()

occurrences_unique.head()

Unnamed: 0,recordedBy,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
0,'Discovery' Exped.,1,http://www.botanicalcollections.be/specimen/BR...,1934-04-01 00:00:00.000,1934-04-01 00:00:00.000,1934-04-01 00:00:00.000
1,'Discovery' Exped. 1934-35,3,http://www.botanicalcollections.be/specimen/BR...,1934-12-28 16:00:00.000,1934-12-14 00:00:00.000,1935-01-20 00:00:00.000
2,'Discovery' Expedition,1,http://www.botanicalcollections.be/specimen/BR...,1934-01-01 00:00:00.000,1934-01-01 00:00:00.000,1934-01-01 00:00:00.000
3,'Engledow in Bolton J.J.',2,http://www.botanicalcollections.be/specimen/BR...,1998-07-08 12:00:00.000,1998-07-08 00:00:00.000,1998-07-09 00:00:00.000
4,'Engledow in Bolton',2,http://www.botanicalcollections.be/specimen/BR...,1998-07-08 00:00:00.000,1998-07-08 00:00:00.000,1998-07-08 00:00:00.000


In [16]:
print("Write these tabbed data into", this_output_tabdata_file)

occurrences_unique.to_csv(this_output_tabdata_file, sep='\t', index=False)

Write these tabbed data into data/Meise_doi-10.15468-dl.ax9zkh/occurrence_recordedBy_eventDate_occurrenceIDs_20230830.tsv


## Parsing with dwcagent_bin

Dependency Ruby Gem package <https://libraries.io/rubygems/dwc_agent> has to be installed and Ruby itself.

You can use the ruby script in `bin/agent_parse4tsv.rb` and change the file(s) for input and output:
```bash
cd bin
ruby agent_parse4tsv.rb --help # display usage and help of the script

ruby agent_parse4tsv.rb \
  --input  ../data/Meise_doi-10.15468-dl.ax9zkh/occurrence_recordedBy_eventDate_occurrenceIDs_20230830.tsv \
  --output ../data/Meise_doi-10.15468-dl.ax9zkh/occurrence_recordedBy_eventDate_occurrenceIDs_20230830_parsed.tsv

# or if you want to measure how fast it parses use time … — add writing a logfile of skipped names
time ruby agent_parse4tsv.rb --logfile \
  --input  ../data/Meise_doi-10.15468-dl.ax9zkh/occurrence_recordedBy_eventDate_occurrenceIDs_20230830.tsv \
  --output ../data/Meise_doi-10.15468-dl.ax9zkh/occurrence_recordedBy_eventDate_occurrenceIDs_20230830_parsed.tsv
#   Now:
#   - read data from ../data/Meise_doi-10.15468-dl.ax9zkh/occurrence_recordedBy_eventDate_occurrenceIDs_20230830.tsv
#   - write data to  ../data/Meise_doi-10.15468-dl.ax9zkh/occurrence_recordedBy_eventDate_occurrenceIDs_20230830_parsed.tsv
#   - write log of skipped names into output directory as well: occurrence_recordedBy_eventDate_occurrenceIDs_20230830_parsed.tsv.log
# real    1m6,024s
# user    0m35,417s
# sys     0m24,186s
```

And look into the first data lines, e.g.
```bash
head ../data/Meise_doi-10.15468-dl.ax9zkh/occurrence_recordedBy_eventDate_occurrenceIDs_20230830_parsed.tsv | column --table --separator $'\t'
# … you could get something like:
# family     given  suffix  particle  dropping_particle  nick  appellation  title  occurrenceID_count  occurrenceID_firstsample                                  eventDate_mean           eventDate_min            eventDate_max
# Azofeifa   A.                                                                    2                   https://herbarium.bgbm.org/object/B200211416              1998-04-24 12:00:00.000  1998-03-10 00:00:00.000  1998-06-09 00:00:00.000
# A. Cano    E.                                                                    1                   https://herbarium.bgbm.org/object/B100699397              2008-06-05 00:00:00.000  2008-06-05 00:00:00.000  2008-06-05 00:00:00.000
# Selmons    Ad                                                                    1                   https://herbarium.bgbm.org/object/B100379213              1917-07-01 00:00:00.000  1917-07-01 00:00:00.000  1917-07-01 00:00:00.000
# Aaronsohn  A.                                                                    1                   https://herbarium.bgbm.org/object/B100379341              1908-06-20 00:00:00.000  1908-06-20 00:00:00.000  1908-06-20 00:00:00.000
# Ani        H.             Abbas al                                               1                   http://id.snsb.info/snsb/collection/462713/563871/241553  1964-11-18 00:00:00.000  1964-11-18 00:00:00.000  1964-11-18 00:00:00.000
# Abbe       L.B.                                                                  1                   https://herbarium.bgbm.org/object/BGT0003826              1960-03-18 00:00:00.000  1960-03-18 00:00:00.000  1960-03-18 00:00:00.000
# Abbe       E.C.                                                                  1                   https://herbarium.bgbm.org/object/BGT0003826              1960-03-18 00:00:00.000  1960-03-18 00:00:00.000  1960-03-18 00:00:00.000
```