## Store GBIF Occurrence Data Set locally

For Virtual Herbarium Germany (BGBM) see <https://doi.org/10.15468/dl.tued2e>. We saved all of it into the local data directory `data/VHde_0195853-230224095556074_BGBM/`:
- We need `occurrence.txt` as basic data file, which is 1GB large—please download it first (**it will not be in the official GitHub documentation**) or change the code here to read your special input data.

In [8]:
import os
from datetime import datetime

gbif_dataset_path="data/VHde_0195853-230224095556074_BGBM"
# join file name for saving results
this_output_tabdata_file=os.path.join(
    gbif_dataset_path, ("occurrence_recordedBy_occurrenceIDs_%s.tsv" % datetime.today().strftime('%Y%m%d'))
)

if not os.path.exists(gbif_dataset_path):
    print("Where is the folder of are GBIF occurrence data?", gbif_dataset_path, "not found")
    print("Recommendation is use a subfolder, e.g. “data/VHde_0195853-230224095556074_BGBM”")
else:
    print("OK, GBIF data found. Results will later be written to", this_output_tabdata_file)

OK, GBIF data found. Results will later be written to data/VHde_0195853-230224095556074_BGBM/occurrence_recordedBy_occurrenceIDs_20230703.tsv


## Read GBIF Occurrence Data

Get `recordedBy` of `occurrence.txt` and look into the data first, data columns aso. …

In [2]:
import pandas as pd # to read data

# Reading all at once does not work to read 1GB of data yet
occurrences = pd.read_csv(
    os.path.join(gbif_dataset_path, "occurrence.txt"), sep="\t", low_memory=False,
    nrows=1
)

print(list(occurrences.columns)) # 259 columns

['gbifID', 'abstract', 'accessRights', 'accrualMethod', 'accrualPeriodicity', 'accrualPolicy', 'alternative', 'audience', 'available', 'bibliographicCitation', 'conformsTo', 'contributor', 'coverage', 'created', 'creator', 'date', 'dateAccepted', 'dateCopyrighted', 'dateSubmitted', 'description', 'educationLevel', 'extent', 'format', 'hasFormat', 'hasPart', 'hasVersion', 'identifier', 'instructionalMethod', 'isFormatOf', 'isPartOf', 'isReferencedBy', 'isReplacedBy', 'isRequiredBy', 'isVersionOf', 'issued', 'language', 'license', 'mediator', 'medium', 'modified', 'provenance', 'publisher', 'references', 'relation', 'replaces', 'requires', 'rights', 'rightsHolder', 'source', 'spatial', 'subject', 'tableOfContents', 'temporal', 'title', 'type', 'valid', 'institutionID', 'collectionID', 'datasetID', 'institutionCode', 'collectionCode', 'datasetName', 'ownerInstitutionCode', 'basisOfRecord', 'informationWithheld', 'dataGeneralizations', 'dynamicProperties', 'occurrenceID', 'catalogNumber', 

In [3]:
# just see the first rows
occurrences = pd.read_csv(
    os.path.join(gbif_dataset_path, "occurrence.txt"), sep="\t", low_memory=False,
    usecols=["occurrenceID", "recordedBy"],
    nrows=50 # read all data results in memory kill
)
occurrences.head()

Unnamed: 0,occurrenceID,recordedBy
0,,Kurt Harz
1,,Eugen Erdner
2,,Alois Zick
3,,J. Kraenzle
4,,Hermann Poeverlein


We follow <https://towardsdatascience.com/tips-and-tricks-for-loading-large-csv-files-into-pandas-dataframes-part-2-5fc02fc4e3ab> and filter for having an occourrenceID.

For large data sets it is better to read it defining a “chunksize” (because otherwise the processor would read all at once and gets stuck):

In [4]:
import time

starttime = time.time()

chunks_occurrences = pd.read_csv(
    os.path.join(gbif_dataset_path, "occurrence.txt"), sep="\t", low_memory=False,
    usecols=["occurrenceID", "recordedBy"],
    chunksize=100000
)

print("read large data as chunk", time.time() - starttime, 'seconds')


read large data as chunk 0.014245271682739258 seconds


In [5]:
def filter_having_occurrenceID(df):
    df = df[df.occurrenceID.notnull()]
    print("filter having occurrenceID: " + str(df.shape))
    # print(df.shape)
    return df

starttime = time.time()
chunk_list = [] # used for storing dataframes
for chunk in chunks_occurrences:
    # each chunk is a dataframe
    # perform data filtering
    filtered_chunk = filter_having_occurrenceID(chunk)
    # Once the data filtering is done, append the filtered chunk to list
    chunk_list.append(filtered_chunk)

# concat all the dfs in the list in
occurrences = pd.concat(chunk_list)

print("process data having only occurrenceID took", time.time() - starttime, ' seconds')
occurrences.head()

filter having occurrenceID: (75997, 2)
filter having occurrenceID: (100000, 2)
filter having occurrenceID: (5567, 2)
filter having occurrenceID: (32993, 2)
filter having occurrenceID: (99970, 2)
filter having occurrenceID: (72362, 2)
filter having occurrenceID: (100000, 2)
filter having occurrenceID: (100000, 2)
filter having occurrenceID: (27751, 2)
filter having occurrenceID: (0, 2)
filter having occurrenceID: (0, 2)
process data having only occurrenceID took 16.236688375473022  seconds


Unnamed: 0,occurrenceID,recordedBy
18004,https://je.jacq.org/JE00003434,"Ecklon,C.F. & Zeyher,C.L.P."
18005,https://je.jacq.org/JE00003433,"Ecklon,C.F. & Zeyher,C.L.P."
18006,https://je.jacq.org/JE00003435,"Ecklon,C.F. & Zeyher,C.L.P."
18007,https://je.jacq.org/JE00003436,"Ecklon,C.F. & Zeyher,C.L.P."
18008,https://je.jacq.org/JE00003430,"Zenker,G."


In [6]:
# group and aggregate data: 
occurrences_unique=occurrences.groupby(['recordedBy']).agg(
    occurrenceID_count= ('occurrenceID', 'count'), # use count function
    occurrenceID_firstsample=('occurrenceID', lambda x: list(x)[0]) # custom function, to get the first entry
).reset_index()


occurrences_unique.tail()

Unnamed: 0,recordedBy,occurrenceID_count,occurrenceID_firstsample
52725,"Żelazny,J.",4,https://herbarium.bgbm.org/object/B100344466
52726,"Ždanova,O.",5,https://herbarium.bgbm.org/object/B100263330
52727,"Žíla,V.",3,https://herbarium.bgbm.org/object/B100009590
52728,Волкова Е.,1,https://herbarium.bgbm.org/object/B100530714
52729,"Жирова,O.",1,https://herbarium.bgbm.org/object/B100630811


In [9]:
print("Write these tabbed data into", this_output_tabdata_file)

occurrences_unique.to_csv(this_output_tabdata_file, sep='\t')

Write these tabbed data into data/VHde_0195853-230224095556074_BGBM/occurrence_recordedBy_occurrenceIDs_20230703.tsv


## Parsing with dwcagent_bin

Dependency Ruby Gem package <https://libraries.io/rubygems/dwc_agent> has to be installed and Ruby itself.

You can use the ruby script in `bin/agent_parse4tsv.rb` and change the code inside for file input and output. After that you can run it like:
```bash
cd bin/agent_parse4tsv.rb
ruby agent_parse4tsv.rb

# or if you want to measure how fast it parses use
time ruby agent_parse4tsv.rb
# real    0m41,923s
# user    0m23,390s
# sys     0m16,252s
```
