# Flow cytometry

Let's first load in some libraries that we are going to use today. If you are not sure whether you have the library installed, you can embed *install.packages* within an *if* statement, like this

In [1]:
if (!requireNamespace("RNetCDF", quietly = TRUE)) {
  install.packages("RNetCDF")
}
library(RNetCDF)
if (!requireNamespace("writexl", quietly = TRUE)) {
  install.packages("writexl")
}
library(writexl)

## Accessing the data

The data are available via a THREDDS server that you can find here:
https://opendap1.nodc.no/opendap/hyrax/projects/nansen_legacy/cytometry/

If you open the link above, you can navigate to different NetCDF files, and you will be linked to OPeNDAP data access forms like this one:
https://opendap1.nodc.no/opendap/hyrax/projects/nansen_legacy/cytometry/Flow_cytometry_measurements_during_Nansen_Legacy_cruise_2021704_station_P6_in_the_Northern_Barents_Sea.nc.html

OPeNDAP makes it possible to access data over the internet without you having to download the data to your computer first. You can just remove the *html* suffix from the URL above and include it in your R script in the same way that you might include some absolute filepath to a file on your computer. 

So now let's load the data into R for this cast.

In [2]:
url <- 'https://opendap1.nodc.no/opendap/hyrax/projects/nansen_legacy/cytometry/Flow_cytometry_measurements_during_Nansen_Legacy_cruise_2021704_station_P6_in_the_Northern_Barents_Sea.nc'
data <- open.nc(url)
print.nc(data)

netcdf classic {
dimensions:
	depth = 14 ;
	maxStrlen64 = 64 ;
	taxon = 18 ;
variables:
	NC_FLOAT depth(depth) ;
		NC_CHAR depth:standard_name = "depth" ;
		NC_CHAR depth:long_name = "Depth below surface in sea water" ;
		NC_CHAR depth:coverage_content_type = "coordinate" ;
		NC_CHAR depth:units = "m" ;
		NC_CHAR depth:positive = "down" ;
	NC_CHAR taxon_name(maxStrlen64, taxon) ;
		NC_CHAR taxon_name:standard_name = "biological_taxon_name" ;
		NC_CHAR taxon_name:long_name = "FCM_group" ;
		NC_INT taxon_name:string_length = 80 ;
	NC_DOUBLE abundance(depth, taxon) ;
		NC_CHAR abundance:standard_name = "number_concentration_of_biological_taxon_in_sea_water" ;
		NC_CHAR abundance:long_name = "concentration_of_organisms_of_FCM_group_per_mL_sample" ;
		NC_CHAR abundance:coverage_content_type = "physicalMeasurement" ;
		NC_CHAR abundance:units = "ml-1" ;
		NC_CHAR abundance:coordinates = "taxon_name" ;

// global attributes:
		NC_CHAR :title = "Flow cytometry measurements (abundance of virus,

The data have 3 dimensions; *depth*, *taxon* and *maxStrlen64*.

Let's look at each variable in turn as it can take some time to understand why these data have been structured like this. This is as recommended by the CF conventions, and you can read the relevant section here:
https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#taxon-names-and-identifiers

Note that our file is a bit different because we have a depth profile instead of a time series.

### Depth

This is an easy one. *depth* is just a 1D variable.

In [3]:
depth <- var.get.nc(data, "depth")
depth

### Taxon names

Here we have 2 dimension, *maxStrlen64* and *taxon*.

One question might immediately spring to mind. Why do we need 2 dimensions for the taxon names? Well, the *taxon* dimension defines the number of taxon listed in the file (18 in this case). *maxStrlen64* is the maximum number of characters in the name of each taxon.

Words are often handled like this in CF-NetCDF as some programmes don't allow you to write multiple characters into a single data value. So let's extract the taxon names into something we can use.

In [4]:
taxon_names <- var.get.nc(data, "taxon_name")
taxon_names

What has happened here?! The data seem to have been extracted as a 1 dimensional array, one value (word) per element. See, the dimensions of out *taxon_names* variable is just *18*, not *18 x 64*. RNetCDF has extracted the data into a useful format for us automatically.

In [5]:
dim(taxon_names)

### Abundance

The *abundance* data have 2 dimensions, *depth* and *taxon*.

In [6]:
abundance <- var.get.nc(data, 'abundance')
abundance

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
974,1313,766,351,196,25,10,3,38,185009.14,45521.02,135831.81,71846.44,63985.37,1917733.1,1438756.9,396709.3,82266.91
2011,950,818,115,17,6,4,0,63,185009.14,45521.02,135831.81,71846.44,63985.37,2173491.8,1669287.0,414076.8,90127.97
2397,1019,897,112,10,5,5,0,75,144241.32,38208.41,102742.23,69469.84,33272.39,2240403.0,1748995.0,402925.0,88483.0
1419,1106,929,154,23,11,6,1,79,189396.71,38574.04,144241.32,88482.63,55758.68,4696526.5,3987202.9,588665.4,120658.14
1133,978,821,118,39,9,5,0,78,205850.09,39670.93,161060.33,103290.68,57769.65,2303473.5,1762340.0,457038.4,84095.06
712,696,544,137,15,13,2,0,99,162522.85,38025.59,120109.69,82998.17,37111.52,2106032.9,1557586.8,477148.1,71297.99
425,325,254,67,4,6,2,1,35,168555.76,41499.09,122669.1,83912.25,38756.86,1804387.6,1257769.7,460694.7,85923.22
193,122,106,14,2,0,1,2,46,117733.09,36197.44,79707.5,50091.41,29616.09,1645338.2,1124314.4,424131.6,96892.14
100,75,58,14,3,1,0,2,41,112979.89,36745.89,75502.74,47531.99,27970.75,1687385.7,1206581.4,409506.4,71297.99
105,54,42,12,0,1,0,0,55,99451.55,32906.76,66179.16,42595.98,23583.18,1696526.5,1248628.9,365630.7,82266.91


So the data are easily extracted as a 2D matrix where each row represents a different depth and each column represents a different taxon.

## Creating a dataframe including all of this information together

So let's now combine all these variables into a dataframe that we can export to a CSV or XLSX file.

We want the taxon names as the column headers and the depth to be written for each row.

In [7]:
# Convert the matrix to a dataframe
abundance_df <- as.data.frame(abundance)

# Assign the taxon names as the column headers
colnames(abundance_df) <- taxon_names

# Adding a depth column
abundance_df$depth <- depth

# Move the depth column to the front
abundance_df <- abundance_df[, c(ncol(abundance_df), 1:(ncol(abundance_df)-1))]

abundance_df

depth,RedPico,RedNano,RedNanoSmall,RedNanoLarge,OraPico,RedNanoVeryLarge,OraNano,OraPicoProk,HetNano,HetProk,HetLNA,HetHNA,HetProkMedium,HetProkLarge,Virus,VirusSmall,VirusMedium,VirusLarge
<dbl[1d]>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.5,974,1313,766,351,196,25,10,3,38,185009.14,45521.02,135831.81,71846.44,63985.37,1917733.1,1438756.9,396709.3,82266.91
9.0,2011,950,818,115,17,6,4,0,63,185009.14,45521.02,135831.81,71846.44,63985.37,2173491.8,1669287.0,414076.8,90127.97
19.0,2397,1019,897,112,10,5,5,0,75,144241.32,38208.41,102742.23,69469.84,33272.39,2240403.0,1748995.0,402925.0,88483.0
30.0,1419,1106,929,154,23,11,6,1,79,189396.71,38574.04,144241.32,88482.63,55758.68,4696526.5,3987202.9,588665.4,120658.14
40.0,1133,978,821,118,39,9,5,0,78,205850.09,39670.93,161060.33,103290.68,57769.65,2303473.5,1762340.0,457038.4,84095.06
49.0,712,696,544,137,15,13,2,0,99,162522.85,38025.59,120109.69,82998.17,37111.52,2106032.9,1557586.8,477148.1,71297.99
59.0,425,325,254,67,4,6,2,1,35,168555.76,41499.09,122669.1,83912.25,38756.86,1804387.6,1257769.7,460694.7,85923.22
89.0,193,122,106,14,2,0,1,2,46,117733.09,36197.44,79707.5,50091.41,29616.09,1645338.2,1124314.4,424131.6,96892.14
119.0,100,75,58,14,3,1,0,2,41,112979.89,36745.89,75502.74,47531.99,27970.75,1687385.7,1206581.4,409506.4,71297.99
150.0,105,54,42,12,0,1,0,0,55,99451.55,32906.76,66179.16,42595.98,23583.18,1696526.5,1248628.9,365630.7,82266.91


To export the dataframe to a CSV or XLSX file

In [8]:
# Write to CSV
write.csv(abundance_df, "../data/flow_cytometry_one_cast.csv", row.names = FALSE)

# Write to XLSX using writexl
write_xlsx(abundance_df, "../data/flow_cytometry_one_cast.xlsx")

### Looping through multiple files

What if we want to loop through and access data from all the casts?

We can think of this as the human interface:
https://opendap1.nodc.no/opendap/hyrax/projects/nansen_legacy/cytometry/contents.html

And this as the machine interface:
https://opendap1.nodc.no/opendap/hyrax/projects/nansen_legacy/cytometry/catalog.xml

If you have never looked at an XML file like this before, don't worry. What we want to do is access the URL paths of each file and dump them to a list that we can loop through. We need to install another package first.

In [9]:
if (!requireNamespace("xml2", quietly = TRUE)) {
  install.packages("xml2")
}
library(xml2)

Now let's make a list of the OPeNDAP URLs for each file.

In [10]:
# Read the XML file
xml_url <- 'https://opendap1.nodc.no/opendap/hyrax/projects/nansen_legacy/cytometry/catalog.xml'
xml_content <- read_xml(xml_url)

# Extract all <thredds:access> nodes with serviceName="dap"
dap_nodes <- xml_find_all(xml_content, ".//thredds:access[@serviceName='dap']", ns = xml_ns(xml_content))

# Extract the URLs
dap_urls <- xml_attr(dap_nodes, "urlPath")

# Base URL
base_url <- "https://opendap1.nodc.no/opendap/"

# Combine base URL with URL paths to get full URLs
full_urls <- paste0(base_url, dap_urls)

# Print the full URLs
print(full_urls)

  [1] "https://opendap1.nodc.no/opendap//projects/nansen_legacy/cytometry/Flow_cytometry_measurements_during_Nansen_Legacy_cruise_2018707_station_P1_in_the_Northern_Barents_Sea.nc"                                   
  [2] "https://opendap1.nodc.no/opendap//projects/nansen_legacy/cytometry/Flow_cytometry_measurements_during_Nansen_Legacy_cruise_2018707_station_P2_in_the_Northern_Barents_Sea.nc"                                   
  [3] "https://opendap1.nodc.no/opendap//projects/nansen_legacy/cytometry/Flow_cytometry_measurements_during_Nansen_Legacy_cruise_2018707_station_P3_in_the_Northern_Barents_Sea.nc"                                   
  [4] "https://opendap1.nodc.no/opendap//projects/nansen_legacy/cytometry/Flow_cytometry_measurements_during_Nansen_Legacy_cruise_2018707_station_P4_in_the_Northern_Barents_Sea.nc"                                   
  [5] "https://opendap1.nodc.no/opendap//projects/nansen_legacy/cytometry/Flow_cytometry_measurements_during_Nansen_Legacy_cruise_201870

Now we need to use a for loop to access each file in turn. Here is a quick example of how a for loop works if you are not familiar.

In [11]:
animals <- c('pig', 'dog', 'horse')

for (animal in animals) {
    message <- paste('Hello there,',animal)
    print(message)
}

[1] "Hello there, pig"
[1] "Hello there, dog"
[1] "Hello there, horse"


Now let's write the data from each NetCDF file to a different CSV file.

In [12]:
for (url in full_urls) {
    data <- open.nc(url)
    
    depth <- var.get.nc(data, "depth")
    taxon_names <- var.get.nc(data, "taxon_name")
    abundance <- var.get.nc(data, 'abundance')
    
    # Writing the data to a dataframe
    abundance_df <- as.data.frame(abundance)
    colnames(abundance_df) <- taxon_names
    abundance_df$depth <- depth
    abundance_df <- abundance_df[, c(ncol(abundance_df), 1:(ncol(abundance_df)-1))]
    
    filename <- basename(url) # Extract the filename of the NetCDF file
    csv_filename <- sub("\\.nc$", ".csv", filename) # Replace .nc extension with .csv
    
    # Write to CSV
    # write.csv(abundance_df, paste0("../data/", csv_filename), row.names = FALSE)   
    
}

## Citing the data

Parent-child relationships have been used in this data publication. Here is the landing page for all the data from all cruises.
https://doi.org/10.21335/NMDC-1588963816

This is the recommended citation if you use all of the data or data from many different cruises.

*Oliver Müller; Elzbieta Petelenz; Tatiana Tsagkaraki; Maria Langvad; Lasse Olsen; Anna Grytaas; Stefan Thiele; Hilde Stabell; Evy Skjoldal; Selina Våge; Gunnar Bratbak (2023) Flow cytometry measurements (abundance of virus, bacteria and small protists (primarily <20μm)) during Nansen Legacy cruises https://doi.org/10.21335/NMDC-1588963816*

You can navigate to different *parts* and there is one part per cruise on the landing page. There is one citation per cruise that you can use if you cite if you are using data only from one cruise. For example

*Oliver Müller; Elzbieta Petelenz; Tatiana Tsagkaraki; Maria Langvad; Hilde Stabell; Gunnar Bratbak (2023) Flow cytometry measurements (abundance of virus, bacteria and small protists (primarily <20μm)) during Nansen Legacy cruise 2019711 (from November 28th to December 17th in 2019) in the Northern Barents Sea https://doi.org/10.21335/NMDC-2099951995*

You can also navigate further to one part per cast. You can cite one of a couple of casts if these are the only data you have used in this collection. For example

*Oliver Müller; Elzbieta Petelenz; Tatiana Tsagkaraki; Maria Langvad; Hilde Stabell; Gunnar Bratbak (2023) Flow cytometry measurements (abundance of virus, bacteria and small protists (primarily <20μm)) during Nansen Legacy cruise 2019711 (from November 28th to December 17th in 2019) at station P3 (NLEG07) in the Northern Barents Sea https://doi.org/10.21335/NMDC-2099951995_P3(NLEG07)*

## Citing this tutorial

If you find this tutorial series useful for your work, consider citing the repository:

Luke Marsden. (2024, May 24). Accessing Nansen Legacy data in R. Zenodo. https://doi.org/10.5281/zenodo.11277693

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11277693.svg)](https://doi.org/10.5281/zenodo.11277693)