## Create Occurrence Data Set including eventDate

### Store GBIF Occurrence Data Set locally

For Naturalis Biodiversity Center (NL) see example data set <https://doi.org/10.15468/dl.uw8rxk>. We saved all of it into the local data directory `data/Naturalis_doi-10.15468-dl.uw8rxk/`:
- The CSV data file is +3GB large—please download it first (**it will not be in the official GitHub documentation**) or change the code here to read your special input data.

In [1]:
import os, time, pprint

gbif_dataset_path="data/Naturalis_doi-10.15468-dl.uw8rxk"
gbif_occurrence_source_file="0165211-230224095556074.csv" # was \t separated; expect CSV to have comma not tab as separator character(!)

# join file name dynamically for saving results
this_output_tabdata_file=os.path.join(
    gbif_dataset_path, (
        "occurrence_recordedBy_eventDate_occurrenceIDs_%s.tsv" % 
        # '20230913'
        time.strftime('%Y%m%d')
    )
)
# use static file name for saving
# this_output_tabdata_file=data/Naturalis_doi-10.15468-dl.uw8rxk/occurrence_recordedBy_occurrenceIDs_20230524.tsv

if not os.path.exists(gbif_dataset_path):
    print("Where is the folder of are GBIF occurrence data?", gbif_dataset_path, "not found")
    print("Recommendation is use a subfolder, e.g. “data/Naturalis_doi-10.15468-dl.uw8rxk”")
else:
    print("All right, expected data found:\n- GBIF data found in", gbif_dataset_path ,"\n- Results will later be written to", this_output_tabdata_file)
    if not os.path.exists(os.path.join(gbif_dataset_path, gbif_occurrence_source_file)):
        print("What data source file is the right one? The expected CSV-file", gbif_occurrence_source_file, "was not found.")
        print("Set the file in Python variable 'gbif_occurrence_source_file' to the correct file name.")

All right, expected data found:
- GBIF data found in data/Naturalis_doi-10.15468-dl.uw8rxk 
- Results will later be written to data/Naturalis_doi-10.15468-dl.uw8rxk/occurrence_recordedBy_eventDate_occurrenceIDs_20230913.tsv


### Read GBIF Occurrence Data

Get `recordedBy` and `created` of `gbif_occurrence_source_file=0165211-230224095556074.csv` and look into the data first, data columns aso. …

In [2]:
import pandas as pd # to read data

# Reading all at once does not work to read 1GB of data yet
occurrences = pd.read_csv(
    os.path.join(gbif_dataset_path, gbif_occurrence_source_file), sep="\t", low_memory=False,
    nrows=1
)

pprint.pprint(occurrences.columns)

Index(['gbifID', 'datasetKey', 'occurrenceID', 'kingdom', 'phylum', 'class',
       'order', 'family', 'genus', 'species', 'infraspecificEpithet',
       'taxonRank', 'scientificName', 'verbatimScientificName',
       'verbatimScientificNameAuthorship', 'countryCode', 'locality',
       'stateProvince', 'occurrenceStatus', 'individualCount',
       'publishingOrgKey', 'decimalLatitude', 'decimalLongitude',
       'coordinateUncertaintyInMeters', 'coordinatePrecision', 'elevation',
       'elevationAccuracy', 'depth', 'depthAccuracy', 'eventDate', 'day',
       'month', 'year', 'taxonKey', 'speciesKey', 'basisOfRecord',
       'institutionCode', 'collectionCode', 'catalogNumber', 'recordNumber',
       'identifiedBy', 'dateIdentified', 'license', 'rightsHolder',
       'recordedBy', 'typeStatus', 'establishmentMeans', 'lastInterpreted',
       'mediaType', 'issue'],
      dtype='object')


In [3]:
# just see the first rows
occurrences = pd.read_csv(
    os.path.join(gbif_dataset_path, gbif_occurrence_source_file), sep="\t", low_memory=False,
    usecols=["occurrenceID", "recordedBy", "eventDate"],
    nrows=50 # read all data results in memory kill
)
occurrences.head()


Unnamed: 0,occurrenceID,eventDate,recordedBy
0,https://data.biodiversitydata.nl/naturalis/spe...,1842-05-13T00:00:00,Forsten EA
1,https://data.biodiversitydata.nl/naturalis/spe...,1902-08-01T00:00:00,Goethart JWC; Jongmans WJ
2,https://data.biodiversitydata.nl/naturalis/spe...,1972-06-27T00:00:00,Wilde WJJO de; Wilde-Duyfjes BEE de
3,https://data.biodiversitydata.nl/naturalis/spe...,1929-07-29T00:00:00,Kloos Jr AW
4,https://data.biodiversitydata.nl/naturalis/spe...,1975-05-23T00:00:00,X


We follow <https://towardsdatascience.com/tips-and-tricks-for-loading-large-csv-files-into-pandas-dataframes-part-2-5fc02fc4e3ab> and filter for having an occourrenceID.

For large data sets it is better to read it defining a “chunksize” (because otherwise the processor would read all at once and gets stuck):

In [4]:
starttime = time.time()

chunks_occurrences = pd.read_csv(
    os.path.join(gbif_dataset_path, gbif_occurrence_source_file), sep="\t", low_memory=False,
    usecols=["occurrenceID", "recordedBy", "eventDate"],
    chunksize=100000
)

print("read large data as chunk", time.time() - starttime, 'seconds')

def filter_having_occurrenceID(df):
    df = df[df.occurrenceID.notnull()]
    print("filter having occurrenceID: " + str(df.shape))
    # print(df.shape)
    return df

starttime = time.time()
chunk_list = [] # used for storing dataframes
for chunk in chunks_occurrences:
    # each chunk is a dataframe
    # perform data filtering
    filtered_chunk = filter_having_occurrenceID(chunk)
    # Once the data filtering is done, append the filtered chunk to list
    chunk_list.append(filtered_chunk)

# concat all the dfs in the list in
occurrences = pd.concat(chunk_list)

print("process data having only occurrenceID took", time.time() - starttime, ' seconds')
# occurrences.dropna(subset=['eventDate'], inplace=True) # test to keep NA for eventDate
occurrences.head()

read large data as chunk 0.016310453414916992 seconds
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID: (100000, 3)
filter having occurrenceID

Unnamed: 0,occurrenceID,eventDate,recordedBy
0,https://data.biodiversitydata.nl/naturalis/spe...,1842-05-13T00:00:00,Forsten EA
1,https://data.biodiversitydata.nl/naturalis/spe...,1902-08-01T00:00:00,Goethart JWC; Jongmans WJ
2,https://data.biodiversitydata.nl/naturalis/spe...,1972-06-27T00:00:00,Wilde WJJO de; Wilde-Duyfjes BEE de
3,https://data.biodiversitydata.nl/naturalis/spe...,1929-07-29T00:00:00,Kloos Jr AW
4,https://data.biodiversitydata.nl/naturalis/spe...,1975-05-23T00:00:00,X


In [5]:
### Convert to date/time
#
# occurrences.dtypes
#   occurrenceID    object
#   recordedBy      object
#   eventDate       object
#   dtype: object

# df['dates'] = pd.to_datetime(df['dates'], format='%Y%m%d')
# pd.to_datetime("12/29/2020  9:09:37 PM", utc=True)
# pd.to_datetime("1904-07-01T00:00:00", utc=True)

# occurrences['eventDate']= pd.to_datetime(occurrences.eventDate, utc=True) # Out of bounds nanosecond timestamp: 1652-01-01T00:00:00
#  because date nanoseconds range limitations of pandas, see https://stackoverflow.com/a/69507200/1240387
#  work around: use datetime
#  occurrences['eventDate'] = occurrences['eventDate'].apply(lambda x: datetime.strptime(x,'%Y-%m-%dT%H:%M:%S') if type(x)==str else pd.NaT)
# or using pd.Periode(…)
occurrences['eventDate'] = occurrences['eventDate'].apply(lambda x: pd.Period(x, freq='ms'))
occurrences.head()

Unnamed: 0,occurrenceID,eventDate,recordedBy
0,https://data.biodiversitydata.nl/naturalis/spe...,1842-05-13 00:00:00.000,Forsten EA
1,https://data.biodiversitydata.nl/naturalis/spe...,1902-08-01 00:00:00.000,Goethart JWC; Jongmans WJ
2,https://data.biodiversitydata.nl/naturalis/spe...,1972-06-27 00:00:00.000,Wilde WJJO de; Wilde-Duyfjes BEE de
3,https://data.biodiversitydata.nl/naturalis/spe...,1929-07-29 00:00:00.000,Kloos Jr AW
4,https://data.biodiversitydata.nl/naturalis/spe...,1975-05-23 00:00:00.000,X


In [6]:

occurrences.dtypes

occurrenceID       object
eventDate       period[L]
recordedBy         object
dtype: object

In [7]:
# group and aggregate data: 

occurrences_unique=occurrences.groupby(['recordedBy']).agg(
    occurrenceID_count= ('occurrenceID', 'count'), # use count function
    occurrenceID_firstsample=('occurrenceID', lambda x: list(x)[0]) # custom function, to get the first entry
    , eventDate_mean=('eventDate', 'mean')
    , eventDate_min=('eventDate', 'min')
    , eventDate_max=('eventDate', 'max')
).reset_index()

occurrences_unique.head()

Unnamed: 0,recordedBy,occurrenceID_count,occurrenceID_firstsample,eventDate_mean,eventDate_min,eventDate_max
0,!,43,https://data.biodiversitydata.nl/naturalis/spe...,1884-07-30 06:00:00.000,1840-04-01 00:00:00.000,1962-02-22 00:00:00.000
1,!; Mafumo A,1,https://data.biodiversitydata.nl/naturalis/spe...,1980-08-21 00:00:00.000,1980-08-21 00:00:00.000,1980-08-21 00:00:00.000
2,'Insub',59,https://data.biodiversitydata.nl/naturalis/spe...,1978-04-16 09:21:21.355,1971-05-14 00:00:00.000,1978-05-30 00:00:00.000
3,'Landloopers',185,https://data.biodiversitydata.nl/naturalis/spe...,1919-11-02 20:35:24.591,1910-06-20 00:00:00.000,1969-05-24 00:00:00.000
4,1989 Expedition to Sulawesi RMNH,1,https://data.biodiversitydata.nl/naturalis/spe...,NaT,NaT,NaT


In [8]:
print("Write these tabbed data into", this_output_tabdata_file)

occurrences_unique.to_csv(this_output_tabdata_file, sep='\t', index=False)

Write these tabbed data into data/Naturalis_doi-10.15468-dl.uw8rxk/occurrence_recordedBy_eventDate_occurrenceIDs_20230913.tsv


## Parsing with dwcagent_bin

Dependency Ruby Gem package <https://libraries.io/rubygems/dwc_agent> has to be installed and Ruby itself.

You can use the ruby script in `bin/agent_parse4tsv.rb` and change the file(s) for input and output:
```bash
cd bin
ruby agent_parse4tsv.rb --help # display usage and help of the script (use --develop to keep also original input data in parsed results)

# to get also the source_data of the name strings we use --develop
ruby agent_parse4tsv.rb  --develop \
  --input  ../data/Naturalis_doi-10.15468-dl.uw8rxk/occurrence_recordedBy_eventDate_occurrenceIDs_20230913.tsv \
  --output ../data/Naturalis_doi-10.15468-dl.uw8rxk/occurrence_recordedBy_eventDate_occurrenceIDs_20230913_parsed.tsv

# or if you want to measure how fast it parses use time … — add writing a logfile of skipped names
time ruby agent_parse4tsv.rb --logfile \
  --input  ../data/Naturalis_doi-10.15468-dl.uw8rxk/occurrence_recordedBy_eventDate_occurrenceIDs_20230913.tsv \
  --output ../data/Naturalis_doi-10.15468-dl.uw8rxk/occurrence_recordedBy_eventDate_occurrenceIDs_20230913_parsed.tsv
# Done.
# We have 10741 empty parsing results detected.
#   You can also use --develop to get a full result table including the used source data of each parsed line
# Wrote log file of skipped names to
#   ../data/Naturalis_doi-10.15468-dl.uw8rxk/occurrence_recordedBy_eventDate_occurrenceIDs_20230913_parsed.tsv_dwcagent_3.0.9.0.log
# Wrote data to
#   ../data/Naturalis_doi-10.15468-dl.uw8rxk/occurrence_recordedBy_eventDate_occurrenceIDs_20230913_parsed.tsv
# -------------------------
# 
# real    1m44,766s
# user    1m1,331s
# sys     0m30,897s
```

And look into the first data lines, e.g.
```bash
head ../data/Naturalis_doi-10.15468-dl.uw8rxk/occurrence_recordedBy_eventDate_occurrenceIDs_20230913_parsed.tsv | column --table --separator $'\t'
# … you could get something like:
# family       given   suffix  …  occurrenceID_count  occurrenceID_firstsample                                            eventDate_mean           eventDate_min            eventDate_max
# Mafumo       A.              …  1                   https://data.biodiversitydata.nl/naturalis/specimen/WAG.1180536     1980-08-21 00:00:00.000  1980-08-21 00:00:00.000  1980-08-21 00:00:00.000
# Bioloog      Jaars           …  8                   https://data.biodiversitydata.nl/naturalis/specimen/L.4126740       1978-03-03 15:00:00.000  1946-01-01 00:00:00.000  1982-10-08 00:00:00.000
# Diraviadass  A.              …  1                   https://data.biodiversitydata.nl/naturalis/specimen/L.1744328       1980-04-27 00:00:00.000  1980-04-27 00:00:00.000  1980-04-27 00:00:00.000
# Anthony      E.C.            …  1                   https://data.biodiversitydata.nl/naturalis/specimen/L.1744328       1980-04-27 00:00:00.000  1980-04-27 00:00:00.000  1980-04-27 00:00:00.000
# Gonggrijp    J.W.            …  1                   https://data.biodiversitydata.nl/naturalis/specimen/U.1333750                                                         
# Haan         G.A.L.          …  1                   https://data.biodiversitydata.nl/naturalis/specimen/L.1760631       1938-02-17 00:00:00.000  1938-02-17 00:00:00.000  1938-02-17 00:00:00.000
# Persoon      Herb            …  2                   https://data.biodiversitydata.nl/naturalis/specimen/L.1449322                                                         
# Kogel        T.              …  1                   https://data.biodiversitydata.nl/naturalis/specimen/L%20%200667578  1983-11-09 00:00:00.000  1983-11-09 00:00:00.000  1983-11-09 00:00:00.000
# Matthew      G.F.            …  1                   https://data.biodiversitydata.nl/naturalis/specimen/L.1517941       1979-08-05 00:00:00.000  1979-08-05 00:00:00.000  1979-08-05 00:00:00.000
```