Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert IOPAN protist data from "total database" 2009-2013 into Darwin Core #44

Closed
cnrdh opened this issue Sep 20, 2021 · 11 comments
Closed
Assignees

Comments

@cnrdh
Copy link
Member

cnrdh commented Sep 20, 2021

The "total database" contains:

~/npolar/marine-db$ cat data/deposit/iopan/protist-biodiversity/total_database_npi2009-2013.tsv | ./bin/csv-transform --ndjson | ndjson-map d.name| sort | uniq -c
   1468 "ALK09"
    961 "ALK10"
   3424 "ICE10"
   2579 "ICE12"
   1109 "MER09.01"
   1507 "MOSJ11"
   1638 "MOSJ12"
   2268 "MOSJ13"

The following are excluded, since there are alternate sources with more data.

After removal, we are left with:

~/npolar/marine-db$ cat data/input/iopan/2009-2012-2013-protist-biodiversity-iopan.ndjson| ndjson-map [d.year,d.expedition] | sort | uniq -c
   1468 [2009,"Alkekonge-2009"]
   1109 [2009,"MERCLIM-2009"]
    961 [2010,"Alkekonge-2010"]
   1638 [2012,"MOSJ2012"]
   2268 [2013,"MOSJ2013"]
@cnrdh
Copy link
Member Author

cnrdh commented Sep 21, 2021

Documentation of transform, with input/output example.

Input

~/npolar/marine-db$ cat data/deposit/iopan/protist-biodiversity/total_database_npi2009-2013.tsv| ./bin/csv-transform --ndjson | ndjson-filter 'd.name === "MOSJ13" && d.no==="1670" && d.data==="2013-07-29" && d.takson==="Thalassiosira pacifica"'
{
	"name": "MOSJ13",
	"no": "1670",
	"station ": "R10",
	"depth [m]": "0",
	"data": "2013-07-29",
	"V-taken [ml]": "10",
	"Vth filtered [L]": "32",
	"V bottle [ml]": "100",
	"Class/Phylum": "Bacillariophyceae ",
	"takson": "Thalassiosira pacifica",
	"takson_add": "",
	"Taxon_full": "Thalassiosira pacifica ",
	"AphiaID": "",
	"K": "450.02",
	"N": "6",
	"fields": "60",
	"magn": "10",
	"cells in chamb": "45.002",
	"cells in V bottle [ml]": "450.02",
	"cells in 1000 ml": "14.063125",
	"Gear": "Micro"
}

Output

~/npolar/marine-db$ cat data/input/iopan/2009-2010-2012-2013-protist-biodiversity-iopan.ndjson | ndjson-filter 'd.fieldNumber==="MOSJ13-1670" && d.scientificName==="Thalassiosira pacifica"'
{
	"maximumDepthInMeters": 0,
	"magnification": 10,
	"identifiedBy": "iopan.pl",
	"organismQuantityType": "cells/l",
	"scientificName": "Thalassiosira pacifica",
	"materialSampleID": "MOSJ13-1670@MOSJ2013",
	"year": 2013,
	"expedition": "MOSJ2013",
	"locationID": "R10",
	"fieldNumber": "MOSJ13-1670",
	"basisOfRecord": "Occurrence",
	"organismQuantity": 14.063125,
	"individualCount": 6,
	"sampleSizeValue": 0.42664770454646467,
	"sampleSizeUnit": "l",
	"occurrenceStatus": "present",
	"quantificationStatus": "verified",
	"fieldsInCount": 60,
	"maxFields": 450.02,
	"takenVolume": 10,
	"bottleVolume": 100,
	"initialVolume": 32,
	"cellsInChamber": 45.002,
	"gear": "Niskin bottle"
}

@cnrdh
Copy link
Member Author

cnrdh commented Sep 21, 2021

The input Gear is a mixed bag:

     95 "h. net"
    796 "h.net"
    818 "Micro"
   1373 "micro"
    548 "Micropl"
   1695 "Niskin"
   9627 "niskin"
      2 null

After:

     891 "Handnet"
  14063 "Niskin bottle"

@cnrdh
Copy link
Member Author

cnrdh commented Sep 21, 2021

Passes GBIF validataion
But there are quite a few taxon issues:

Taxon match higherrank: 4617
Taxon match none: 1203
Taxon match fuzzy: 35

@cnrdh
Copy link
Member Author

cnrdh commented Sep 21, 2021

No errors in quantification

~/npolar/marine-db$ cat data/input/iopan/2009-2010-2012-2013-protist-biodiversity-iopan.ndjson | ndjson-map [d.gear,d.quantificationStatus] | sort | uniq -c
    715 ["Handnet","incalculable"]
     20 ["Handnet","verified"]
     22 ["Niskin bottle","calculated"]
     13 ["Niskin bottle","incalculable"]
  12677 ["Niskin bottle","verified"]

@cnrdh cnrdh changed the title IOPAN protist data from "total database" 2019-2013 Convert IOPAN protist data from "total database" 2019-2013 into Darwin Core Sep 21, 2021
@cnrdh
Copy link
Member Author

cnrdh commented Sep 21, 2021

These occurences must be merged with sampling event metadata.
As it stands now, there are a large number of non-matches from 2010 and 2012.
Needs further work to clarify.

$ ndjson-join --left d.fieldNumber <(cat $total_database_npi | ./bin/dwc-occurrence-csv-transform  | ndjson-filter 'd.expedition !== "MOSJ2011"') $events | ndjson-filter 'd[1]===null' | ndjson-map d[0] | ndjson-map '[d.expedition]' | sort | uniq -c
    309 ["ICE2010"]
    180 ["ICE2012"]

@cnrdh cnrdh pinned this issue Sep 22, 2021
@cnrdh
Copy link
Member Author

cnrdh commented Sep 23, 2021

Puh, not as bad, the actual missing samples of those 489 lines above are just 20:

~/npolar/marine-db$ ndjson-join --left d.fieldNumber data/input/iopan/2009-2010-2012-2013-protist-biodiversity-iopan.ndjson $events | ndjson-filter 'd[1]===null' | ndjson-map d[0] | ndjson-map '[d.expedition,d.fieldNumber]' | sort | uniq -c
     34 ["ICE2010","ICE10-152"]
     32 ["ICE2010","ICE10-155"]
     16 ["ICE2010","ICE10-156"]
     13 ["ICE2010","ICE10-157"]
     13 ["ICE2010","ICE10-158"]
     18 ["ICE2010","ICE10-253"]
     27 ["ICE2010","ICE10-379"]
     36 ["ICE2010","ICE10-380"]
     38 ["ICE2010","ICE10-381"]
     46 ["ICE2010","ICE10-382"]
     21 ["ICE2010","ICE10-383"]
     15 ["ICE2010","ICE10-384"]
     30 ["ICE2012","Agneta"]
     30 ["ICE2012","Divehole"]
     14 ["ICE2012","ICE12-760"]
     15 ["ICE2012","ICE12-822"]
     25 ["ICE2012","ICE12-Core2.1.1"]
     27 ["ICE2012","ICE12-Core2.1.2"]
     20 ["ICE2012","Pond"]
     19 ["ICE2012","Ridge"]

@cnrdh
Copy link
Member Author

cnrdh commented Sep 23, 2021

XY could be resurrected by matching on locationID, eg. all ICE10-15x, are from "R4", and all other R4 from ICE10 are [22.1166, 80.605]

cat $events | grep ICE10- | grep R4 | ndjson-reduce 'p.x = [...new Set([...p.x, d.decimalLongitude])], p.y = [...new Set([...p.y, d.decimalLatitude])], p' '{x:[],y:[]}'
{"x":[22.1166],"y":[80.605]}

$ ndjson-join --left d.fieldNumber data/input/iopan/2009-2010-2012-2013-protist-biodiversity-iopan.ndjson $events | ndjson-filter 'd[1]===null' | ndjson-map d[0] | ndjson-map [d.fieldNumber,d.locationID] | sort  | uniq -c
     30 ["Agneta","Underice"]
     30 ["Divehole","5mdepth"]
     34 ["ICE10-152","R4"]
     32 ["ICE10-155","R4"]
     16 ["ICE10-156","R4"]
     13 ["ICE10-157","R4"]
     13 ["ICE10-158","R4"]
     18 ["ICE10-253","R6b"]
     27 ["ICE10-379","ICE10-16"]
     36 ["ICE10-380","ICE10-16"]
     38 ["ICE10-381","ICE10-16"]
     46 ["ICE10-382","ICE10-16"]
     21 ["ICE10-383","ICE10-16"]
     15 ["ICE10-384","ICE10-16"]
     14 ["ICE12-760","Floe1"]
     15 ["ICE12-822","Floe1"]
     25 ["ICE12-Core2.1.1","1mbelowice"]
     27 ["ICE12-Core2.1.2","undericehole"]
     20 ["Pond","1mbelowice"]
     19 ["Ridge","Floe1"]

@cnrdh
Copy link
Member Author

cnrdh commented Sep 23, 2021

Found an alternative source of ICE10 data.
Here all station R4 / bottle 15x appears to be on "2010-08-19" (some Excel date mess).

Source:
Anette Wold/Marinbiology Database/Phytoplankton taxonomy/2010/Original IOPAS/konghau_database_completeICE10.xls

@cnrdh
Copy link
Member Author

cnrdh commented Sep 23, 2021

Alternate source of ICE2012 (2670 lines of data vs 2579 in "total database" :/)

Anette Wold/Marinbiology Database/Phytoplankton taxonomy/2012/Original/ice2012wcolumndatabase.xlsx

@cnrdh
Copy link
Member Author

cnrdh commented Sep 23, 2021

Consider swapping in ICE2012 from ice2012wcolumndatabase, but needs cleaning,
erros in JSON schema validation:

      4 [{"keyword":"pattern","dataPath":".scientificName","schemaPath":"#/properties/scientificName/pattern","params":{"pattern":"^[A-Z][a-z\\s-]"},"message":"should match pattern \"^[A-Z][a-z\\s-]\""}]
     43 [{"keyword":"type","dataPath":".maximumDepthInMeters","schemaPath":"#/properties/maximumDepthInMeters/type","params":{"type":"number,null"},"message":"should be number,null"}]
      1 [{"keyword":"type","dataPath":".organismQuantity","schemaPath":"#/properties/organismQuantity/type","params":{"type":"number,null"},"message":"should be number,null"},{"keyword":"type","dataPath":".individualCount","schemaPath":"#/properties/individualCount/type","params":{"type":"number,null"},"message":"should be number,null"},{"keyword":"type","dataPath":".maximumDepthInMeters","schemaPath":"#/properties/maximumDepthInMeters/type","params":{"type":"number,null"},"message":"should be number,null"},{"keyword":"pattern","dataPath":".scientificName","schemaPath":"#/properties/scientificName/pattern","params":{"pattern":"^[A-Z][a-z\\s-]"},"message":"should match pattern \"^[A-Z][a-z\\s-]\""},{"keyword":"type","dataPath":".fieldsInCount","schemaPath":"#/properties/fieldsInCount/type","params":{"type":"integer"},"message":"should be integer"}]

@cnrdh
Copy link
Member Author

cnrdh commented Oct 15, 2021

Checking 2009 against alt. ALK2009_Mercl2009.xls that contains 1920 Niskin + 662 Micro = 2582.
Total database has 1944+633=2577 when counting 2009 in expedition [and 2617 with year equal 2009 – oh my, but these are 40 "ICE10-116" occurrences and thus excluded already]

$ cat $total_database_npi | ./bin/csv-transform --ndjson | ndjson-filter '/09/.test(d.name)' | ndjson-map d.Gear | sort | uniq -c
    633 "micro"
   1944 "niskin"

@cnrdh cnrdh changed the title Convert IOPAN protist data from "total database" 2019-2013 into Darwin Core Convert IOPAN protist data from "total database" 2009-2013 into Darwin Core Nov 9, 2021
@cnrdh cnrdh closed this as completed Nov 10, 2021
@cnrdh cnrdh unpinned this issue Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants