R Script that preps a dataset for a collections dashboard
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data01raw
functions
output-sample
supplementary
.gitignore
LICENSE
Master.R
README.md
collprep.Rproj
dash005DarPrep.R
dash010CatPrep.R
dash015AccPrep.R
dash020FullBind.R
dash021Where.R
dash022What.R
dash023When.R
dash024Who.R
dash025Experience.R
dash026LoansPrep.R
dash027VisitPrep.R
dash028Ecoregions.R
dash030FullExport.R
dash050InstData.R
fmnh-coll-dash2.Rproj

README.md

collections-dashboard-prep

These scripts prepare EMu and GBIF collections data for a Collections Dashboard.

Catalogue/Specimen records are combined with Accession/Storage Unit records to count catalogued and backlogged items in the collections.

How to use these scripts

  1. Clone repo locally.
  2. Match raw catalog dataset structure to /data01raw/CatDash03bu.csv
  3. Match raw accessions dataset structure to /data01raw/AccBacklogBU.csv
  4. Retrieve GBIF datasets as DwC Archive files in /data01raw/CatDwC/
  5. Run the prep scripts by running Master.R

To run Master.R in the RStudio console:

  • First, type setwd("path/to/local/repo") to set the working directory.
  • Second, type Master.R to run the prep scripts.

Notes about fields in the raw input data

Datasets here were exported from the FMNH EMu collections database. Darwin Core fields were used when possible, but not all fields mapped directly to Darwin Core fields. For example, "description" fields in accession records often includes information about "Where" as well as about "What".

The full list of ecatalogue fields is as follows:

Core and Quality-related fields:

  • irn
  • DarGlobalUniqueIdentifier
  • CatDepartment
  • DarCatalogNumber
  • DarInstitutionCode
  • DarCollectionCode
  • AdmDateInserted
  • AdmDateModified
  • DarIndividualCount
  • DarBasisOfRecord
  • DarImageURL
  • MulHasMultiMedia
  • DarCollector
  • CatLegalStatus
  • DarStateProvince

Where-related fields:

  • DarLatitude
  • DarLongitude
  • DarCountry
  • DarContinent
  • DarContinentOcean
  • DarWaterBody

Who-related fields:

  • DesEthnicGroupSubgroup_tab

What-related fields:

  • EcbNameOfObject
  • DesMaterials_tab
  • DarOrder
  • DarScientificName
  • IdeTaxonRef_tab.ClaRank
  • IdeTaxonRef_tab.ComName_tab
  • DarRelatedInformation
  • CatProject_tab
  • IdeFiledAs_tab (to be added)

WhenAge-related fields:

  • DarEarliestAge
  • DarEarliestEon
  • DarEarliestEpoch
  • DarEarliestEra
  • DarEarliestPeriod
  • AttPeriod_tab
  • DarYearCollected
  • DarMonthCollected

The full list of efmnhtransactions (accession record) fields is as follows:

Count-related fields (used for calculating backlogged items):

  • irn
  • AccCatalogue
  • AccTotalItems
  • AccTotalObjects
  • AccCount_tab
  • PriAccessionNumberRef.CatCatalog
  • PriAccessionNumberRef.DarIndividualCount
  • PriAccessionNumberRef.irn
  • PriAccessionNumberRef.DarBasisOfRecord
  • PriAccessionNumberRef.CatItemsInv

What- & Who-related fields:

  • AccDescription_tab
  • AccAccessionDescription

Where-related fields:

  • AccGeography_tab
  • AccLocality
  • AccCollectionEventRef.ColSiteRef.LocContinent_tab
  • AccCollectionEventRef.ColSiteRef.LocCountry_tab
  • AccCollectionEventRef.ColSiteRef.LocOcean_tab

Notes about fields in the output "FullDash" dataset

Where, What, WhenAge, Who

These fields are prepped in the respective dash02#Where/What/When/Who.R scripts. They broadly accommodate both cultural and natural history datasets, incorporating standard Darwin Core fields when possible. The input dataset groupings (listed above) indicate which input fields correspond to these output fields. Note: dash022What.R references the /supplementary/WhatComNames.csv lookup to join common names from ITIS with the specimen dataset (on the DarOrder field).

1) Fields prepped in dash020FullBind.R:

Quality

A ranking based on the following criteria (poor = 1; good = 9):

  • 1 = Digital accession record exists
  • 2 = Total Object (lots) > 0 OR Total Items (specimens) > 0
  • 3 = Locality Not Null
  • 4 = Catalogue # Not NULL
  • 5 = Reverse attached catalogue records Not NULL
  • 6 = Has Digital Catalogue record
  • 7 = Has Partial Data
  • 8 = PriCoordinateIndicator = Yes OR HasMultimedia = Yes
  • 9 = PriCoordinateIndicator = Yes AND HasMultimedia = Yes AND Has Full Data = Yes

Partial Data = Has 3 or 4 of the following:

  • IdeTaxonRef_tab.ClaRank = Family, Genus, Species, Subpecies or Variety
  • DarStateProvince Not NULL
  • DarCollector Not NULL
  • DarYearCollected Not NULL
  • DarCatalogNumber Not NULL

Full Data = Has all 5 of the above

RecordType

Indicates whether the record is "Catalog" or "Accession" data, and therefore part of the catalogued or backlogged items.

DarIndividualCount

The number of items catalogued, from the DarIndividualCount field of a catalogue record.

Backlog

The number of items backlogged = the number of catalogued items subtracted from the number accessioned (or inventoried) items.

TaxIDRank

The taxonomic level to which a specimen has been identified.

HasMM

A binary value where "1" = has Multimedia attached, and "0" = no Multimedia attached.

DarInstitutionCode, DarCollectionCode

The name of the institution and collection to which a record belongs. NOTE: The donut chart references the /supplementary/CollectionDomain2.csv lookup table to group collections into domains. Institutions would be invited to specify their own collection-to-domain mappings. (Standardized vocabulary needed from the community.)

2) Extra fields prepped in dash023When.R:

Department

The name of the department to which a record belongs. NOTE: dash023When.R references the /supplementary/Departments.csv lookup table to standardize department names while calculating specimen ages. Institutions would be invited to specify their own collection-to-department mappings. (This vocabulary should be standardized and/or consolidated with CollectionDomain2.csv)

URL

Collections listed in summary stats will link to these URLs. NOTE: dash023When.R references the /supplementary/CollDashEd.csv lookup table to link URLs to collections. Institutions would be invited to specify their own collections and corresponding URLs.

WhenAgeFrom/To/Mid & DarYearCollected

Numeric values for age of geology specimens & anthropolgy artifacts, or for collection year for botany & zoology specimens. Anthropological and Geological terms are mapped to numeric dates in /supplementary/WhenAttPerLUT.csv & /supplementary/WhenChronoLUTemu.csv

WhenOrder

Ordinal values between 1 and 53 to group numeric ages into time-groups; necessary for chart to function.

WhenTimeLabel

Labels corresponding to the 53 "WhenOrder" groups, ranging from 4.6 billion years ago to 2020. Loosely, ranges are grouped by geologic periods/epochs/eras prior to ~18th century dates, and grouped by decade after 18th century dates. Chart labels and corresponding date ranges are listed in /supplementary/WhenYearRanges2.csv. Range divisions were chosen in attempt to fit data to the current chart layout, but please tell us if you know of more valid/sensible alternatives.

3) Extra fields prepped in dash028Ecoregions.R:

Bioregion

dash028Ecoregions.R references the /supplementary/EcoRegionCountires.csv lookup table to map specimens to one of the WWF-defined ecoregions based on their country or ocean data. Currently, in cases where countries or oceans are in multiple ecoregions, specimens are likewise associated with multiple ecoregions.

Notes about extra output datasets

LUTs (WhatLUTB.csv, WhenAgeLUT.csv, WhereLUT.csv, WhoLUT.csv)

Lookup tables are exported for What, When, Where, and Who fields. These are used by the dashboard search fields.

WhoExperience.csv

A count of individuals in each type of staff role (Collections, Research, Volunteer, Other) in each collection. This is produced by the dash025Experience.R script, using the /data01raw/emuPartiesExp/ dataset (sample data provided), which includes NamDepartment, NamBranch, EMu Group, and NamRoles_tab fields from eparties records for emu-users.

(In EMu/eparties, retrieve EMu user records, and report the above fields)

LoanSumCount.csv

A count of total items loaned and total loans per year per collection. This is produced by the dash026LoansPrep.R script, using the /data01raw/emuLoans/ dataset (sample data provided), which includes the following fields from efmnhtransactions records for loans:

  • item counts (InvCount_tab, ObcTotalItems, ObcTotalObjects, ObuTotalItems, ObuTotalObjects, TraTotalInvoiceItems, TraTotalItemsLoaned, TraTotalItemsOutstanding)
  • loan types (InvTransactionType_tab, LoaLoanType, LoaStatus, TraTransactionType)
  • loan dates (TraDateAuthorized, TraDateProcessed)
  • department (AccCatalogue, SumDepartment)
  • description (InvDescription_tab, InvGeography_tab)

(In EMu/efmnhtransactions, retrieve loan records, and run the "DashboardTrans - Copy" report)

VisitSumCount.csv

A count of total visitors and total visits per year per collection. This is produced by the dash027VisitPrep.R script, using the /data01raw/emuConsult/ dataset (sample data provided), which includes the following fields from efmnhrepatriation records for collection visits:

  • department (SecDepartment_tab)
  • total visitors (ResNoOfVisitors, ResResearchersRef_tab[eparties].NamBriefName)
  • dates (ResCommencementDate, ResCompletionDate)
  • record type (InfRecordType)

(In EMu/efmnhrepatriation, retrieve research visit records, and run the "VisitorDays - Copy" report)

Data & Development Acknowledgements