# EMBO Practical Course "Advanced methods in bioimage analysis"

***

Homepage: https://www.embl.org/about/info/course-and-conference-office/events/bia23-01/

***

## Day 2 - Session 1: Image Data Management - 11:30 to 12:30 "GO!"

### Continuing from `5_Cloud` in Python!...

## Software versions used for this workshop: (TODO)

   * awscli                    1.22.87
   * dask                      2022.4.0
   * fsspec                    2022.3.0
   * napari                    0.4.15
   * numpy                     1.22.3
   * ome-zarr                  0.4.0
   * openjdk                   11.0.9.1
   * tifffile                  2022.3.25
   * zarr                      2.11.1
   * vizarr                    0.2


In [6]:
%%bash
##
## Setup & Sanity checks
##

YOURNAME=$(whoami)
WORKDIR=/scratch/${YOURNAME}/session1/
test -e ${WORKDIR} || {
    echo Please run the first the POSIX notebook first.
    exit 1
}

In [7]:
import os
YOURNAME = os.getlogin()
%env YOURNAME=$YOURNAME

env: YOURNAME=jamoore


In [8]:
%cd /scratch/{YOURNAME}/session1

/System/Volumes/Data/scratch/jamoore/session1


## License
Copyright (C) 2023 German BioImaging. All Rights Reserved.
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details. You should have received a copy of the GNU General
Public License along with this program; if not, write to the
Free Software Foundation,
Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

## 2. Looking into an OME-Zarr

mri.ome.zarr exported from ImageJ with the following macro:

```%java
run("MRI Stack");
Stack.setXUnit("mm"); // same unit for Y and Z
run("Properties...", "channels=1 slices=27 frames=1 pixel_width=1 pixel_height=1 voxel_depth=7");
run("Export Current Image To OME-ZARR...", "imagename=mri savedirectory=/tmp downsamplingmethod=Average usedefaults=true");
run("Open OME ZARR From File System...", "directory=/tmp/mri.ome.zarr");
```
<img src="mri.png" style="height:300px" />


The metadata in a Zarr fileset is stored in (hidden) files starting with ".z".

In [4]:
!find mri.ome.zarr -name ".z*"

mri.ome.zarr/.zattrs
mri.ome.zarr/.zgroup
mri.ome.zarr/s0/.zarray
mri.ome.zarr/s0/.zattrs


These are broken up into groups (folders) or arrays (data). The `.zgroup` files are fairly simple:

In [5]:
# %load mri.ome.zarr/.zgroup
{
  "zarr_format": 2
}

{'zarr_format': 2}

Each `.zattrs` file contains user-supplied metadata. OME-Zarrs use these attributes to describe how an n-dimensional Zarr array should be interpreted as an image.

In [6]:
# %load mri.ome.zarr/.zattrs
{
  "multiscales": [
    {
      "axes": [
        {
          "name": "z",
          "type": "space",
          "unit": "millimeter"
        },
        {
          "name": "y",
          "type": "space",
          "unit": "millimeter"
        },
        {
          "name": "x",
          "type": "space",
          "unit": "millimeter"
        }
      ],
      "datasets": [
        {
          "path": "s0",
          "coordinateTransformations": [
            {
              "type": "scale",
              "scale": [
                7.0,
                1.0,
                1.0
              ]
            }
          ]
        }
      ],
      "name": "mri",
      "type": "Average",
      "version": "0.4"
    }
  ]
}

{'multiscales': [{'axes': [{'name': 'z',
     'type': 'space',
     'unit': 'millimeter'},
    {'name': 'y', 'type': 'space', 'unit': 'millimeter'},
    {'name': 'x', 'type': 'space', 'unit': 'millimeter'}],
   'datasets': [{'path': 's0',
     'coordinateTransformations': [{'type': 'scale',
       'scale': [7.0, 1.0, 1.0]}]}],
   'name': 'mri',
   'type': 'Average',
   'version': '0.4'}]}

The `.zattrs` for each array can be fairly simple:

In [7]:
# %load mri.ome.zarr/s0/.zattrs
{
  "_ARRAY_DIMENSIONS": [
    "z",
    "y",
    "x"
  ]
}

{'_ARRAY_DIMENSIONS': ['z', 'y', 'x']}

The `.zarray` files specify details about storage like compression and array dimensions:

In [8]:
# %load mri.ome.zarr/s0/.zarray
{
  "shape": [
    27,
    226,
    186
  ],
  "chunks": [
    16,
    128,
    128
  ],
  "fill_value": "0",
  "dtype": "|u1",
  "filters": [],
  "dimension_separator": "/",
  "zarr_format": 2,
  "compressor": {
    "id": "gzip",
    "level": -1
  },
  "order": "C"
}

{'shape': [27, 226, 186],
 'chunks': [16, 128, 128],
 'fill_value': '0',
 'dtype': '|u1',
 'filters': [],
 'dimension_separator': '/',
 'zarr_format': 2,
 'compressor': {'id': 'gzip', 'level': -1},
 'order': 'C'}

All the other files in the tree are **"chunks"**, pieces of an array that have been written to separate files:

In [9]:
!tree mri.ome.zarr

[01;34mmri.ome.zarr[00m
└── [01;34ms0[00m
    ├── [01;34m0[00m
    │   ├── [01;34m0[00m
    │   │   ├── 0
    │   │   └── 1
    │   └── [01;34m1[00m
    │       ├── 0
    │       └── 1
    └── [01;34m1[00m
        ├── [01;34m0[00m
        │   ├── 0
        │   └── 1
        └── [01;34m1[00m
            ├── 0
            └── 1

7 directories, 8 files


The levels of this hierarchy can be interpreted as:
```
mri.ome.zarr
└── resolution-level
    └── z-chunk-index
        └── y-chunk-index
            └── x-chunk-index
```

In [31]:
!ls -ltrah mri.ome.zarr/s0/0/0/0

-rw-r--r--  1 jamoore  wheel   148K Apr  5 22:39 mri.ome.zarr/s0/0/0/0


***

## 3 Converting your data to OME-NGFF

The two basic commands are `bioformats2raw` and `raw2ometiff`. Together they provide a pipeline to scalably convert large images into OME-TIFF. The primary caveat is that they require **twice** the storage for the conversion.


### 3.1 Conversion tools

https://forum.image.sc/t/converting-whole-slide-images-to-ome-tiff-a-new-workflow/32110/4

<img src="images/blog-2019-12-converting-whole-slide-images.jpg" style="height:200px" />



In [32]:
!bioformats2raw

[31m[1mMissing required parameters: '<inputPath>', '<outputLocation>'[21m[39m[0m
Usage: [1m<main class>[21m[0m [[33m-p[39m[0m] [[33m--no-hcs[39m[0m] [[33m--[no-]nested[39m[0m] [[33m--no-ome-meta-export[39m[0m]
                    [[33m--no-root-group[39m[0m] [[33m--overwrite[39m[0m]
                    [[33m--use-existing-resolutions[39m[0m] [[33m--version[39m[0m] [[33m--debug[39m[0m
                    [=[3m<logLevel>[23m[0m]] [[33m--extra-readers[39m[0m[=[3m<extraReaders>[23m[0m[,
                    [3m<extraReaders>[23m[0m...]]]... [[33m--options[39m[0m[=[3m<readerOptions>[23m[0m[,
                    [3m<readerOptions>[23m[0m...]]]... [[33m-s[39m[0m[=[3m<seriesList>[23m[0m[,
                    [3m<seriesList>[23m[0m...]]]...
                    [[33m--additional-scale-format-string-args[39m[0m=[3m<additionalScaleForma[23m[0m
[3m                    tStringArgsCsv>[23m[0m] [[33m-c[39m[0m=[3m<compressionTy

In [33]:
import os, shutil
if os.path.exists("/tmp/trans_norm_out"):
    shutil.rmtree("/tmp/trans_norm_out")

In [34]:
%%time
!bioformats2raw --debug=OFF --progress 1885619/trans_norm.tif /tmp/trans_norm_out

[0/0]   0% [33m│                                 │[0m   0/571 (0:00:00 / ?) 
CPU times: user 278 ms, sys: 122 ms, total: 401 ms571/571 (0:00:01 / 0:00:00) [1B
Wall time: 10.4 s


In [35]:
!ls /tmp/trans_norm_out

[1m[36m0[m[m   [1m[36mOME[m[m


In [15]:
!find /tmp/trans_norm_out -name ".z*"

/tmp/trans_norm_out/.zattrs
/tmp/trans_norm_out/.zgroup
/tmp/trans_norm_out/0/.zattrs
/tmp/trans_norm_out/0/.zgroup
/tmp/trans_norm_out/0/0/.zarray


In [36]:
!ome_zarr -q info /tmp/trans_norm_out/0

/private/tmp/trans_norm_out/0 [zgroup]
 - metadata
   - Multiscales
 - data
   - (1, 1, 571, 30, 30)


## 4. Data from S3
We're going to start off by looking at some images you will likely have seen during the OMERO or IDR sessions.

**Our goal is to share these *without* using an OMERO.**

<table>
    <tr>
        <td>
            <img alt="idr0062 thumbnails" src="images/training-1.png" style="height:150px"/>
        </td>
        <td>
            <img alt="idr0062 thumbnails" src="images/training-2.png" style="height:150px"/>
        </td>
        <td>
            <img alt="idr0023 3D screenshot" src="images/training-3.png" style="height:150px"/>
        </td>
    </tr>
</table>
    
The left two images are from  the ilastik plugin guide presented by Petr: https://omero-guides.readthedocs.io/en/latest/ilastik/docs/ilastik_fiji.html

They are available in the "idr0062" project on the workshop server: https://workshop.openmicroscopy.org/webclient/?show=project-1952

The original dataset can be found in IDR study idr0062 by Blin _et al._: https://idr.openmicroscopy.org/webclient/?show=project-801

The image on the right is from idr0023 by Szymborska _et al_: http://idr.openmicroscopy.org/webclient/?show=project-52 and is **much** smaller. (Specifically, http://idr.openmicroscopy.org/webclient/?show=image-1885619)


### 4.1 Minio client

There are a number of different types of cloud storage and there are a number of tools that you can use to access your cloud storage, but here we're going to focus on a single one `mc`.

`mc` is provided by the minio project and is described as "a modern alternative to UNIX commands like ls, cat, cp, mirror, diff, find etc." The quickstart guide can be found under https://docs.minio.io/docs/minio-client-quickstart-guide.html For our purposes we'll focus on how to use it to upload and manage data in S3.

In [17]:
!mc

NAME:
  mc - MinIO Client for cloud storage and filesystems.

USAGE:
  mc [FLAGS] COMMAND [COMMAND FLAGS | -h] [ARGUMENTS...]

COMMANDS:
  alias      set, remove and list aliases in configuration file
  ls         list buckets and objects
  mb         make a bucket
  rb         remove a bucket
  cp         copy objects
  mirror     synchronize object(s) to a remote site
  cat        display object contents
  head       display first 'n' lines of an object
  pipe       stream STDIN to an object
  share      generate URL for temporary access to an object
  find       search for objects
  sql        run sql queries on objects
  stat       show object metadata
  mv         move objects
  tree       list buckets and objects in a tree format
  du         summarize disk usage recursively
  retention  set retention for object(s)
  legalhold  manage legal hold for object(s)
  diff       list differences in object name, size, and date between two buckets
  rm         re

### 4.2 Connections

The minio project provides a safe space for you to learn about S3: https://play.minio.io:9000/minio/ Here we've used the `mc` command to find the access information:

 * **"AccessKey"** is basically a user name.
 * **"SecretKey"** is basically a password. 
 * The URL is our **"endpoint"**, which differentiates it from the S3 servers provided by Amazon.

You can log in to the webpage and explore what the many other users have upload at https://play.minio.io:9000/minio/

The other two important concepts are:
 * **"buckets"** which is roughly like a shared namespace with permissions
 * and **"keys"** which will get to in a second.

In [37]:
!mc config host list play

[m[36;1mplay
[0m[33m  URL       : https://play.min.io
[0m[36m  AccessKey : Q3AM3UQ867SPQQA43P2F
[0m[36m  SecretKey : zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG
[0m[34m  API       : S3v4
[0m[36m  Path      : auto
[0m
[0m

### 4.3 Using `mc` with a public S3 bucket 

In [38]:
!mc config host add public https://uk1s3.embassy.ebi.ac.uk "" ""

[m[32mAdded `public` successfully.[0m
[0m

In [42]:
!mc ls public/idr/zarr/v0.3/

[m[32m[2022-04-06 15:58:45 CEST][0m[33m     0B[0m[36;1m 9836842.zarr/[0m
[0m[m[32m[2022-04-06 15:58:45 CEST][0m[33m     0B[0m[36;1m idr0040A/[0m
[0m[m[32m[2022-04-06 15:58:45 CEST][0m[33m     0B[0m[36;1m idr0051A/[0m
[0m[m[32m[2022-04-06 15:58:45 CEST][0m[33m     0B[0m[36;1m idr0052A/[0m
[0m[m[32m[2022-04-06 15:58:45 CEST][0m[33m     0B[0m[36;1m idr0075A/[0m
[0m[m[32m[2022-04-06 15:58:45 CEST][0m[33m     0B[0m[36;1m idr0079A/[0m
[0m[m[32m[2022-04-06 15:58:45 CEST][0m[33m     0B[0m[36;1m idr0094A/[0m
[0m[m[32m[2022-04-06 15:58:45 CEST][0m[33m     0B[0m[36;1m idr0095B/[0m
[0m[m[32m[2022-04-06 15:58:45 CEST][0m[33m     0B[0m[36;1m idr0109A/[0m
[0m

In [43]:
!mc cp public/idr/share/gbi2022/1885619/trans_norm.tif /tmp/

..._norm.tif:  2.04 MiB / 2.04 MiB  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  3.93 MiB/s 0s[0m[0m[m[32;1m[m[32;1m[m[32;1m[m[32;1m

In [44]:
!ls -ltrah /tmp/trans_norm.tif

-rw-r--r--  1 jamoore  wheel   2.0M Apr  6 15:59 /tmp/trans_norm.tif


***

## 5. Publishing your data with S3 ⚠️

You can then move the generated output to S3. **Note: this won't work unless you have a configured bucket named "uk1".**

In [46]:
!time mc cp --recursive /tmp/trans_norm_out/0/ play/gbi2022/1885619/trans_norm.ome.zarr/

.../0/99/0/0:  773.27 KiB / 773.27 KiB  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  132.28 KiB/s 5s[0m[0m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m
real	0m7.297s
user	0m2.158s
sys	0m2.216s


If you are using binder, you may need to access the link directly:

https://hms-dbmi.github.io/vizarr/?source=https://uk1s3.embassy.ebi.ac.uk/idr/share/gbi2022/1885619/trans_norm.ome.zarr/

In [24]:
from IPython.display import IFrame
IFrame(f"https://hms-dbmi.github.io/vizarr/?source=https://uk1s3.embassy.ebi.ac.uk/idr/share/gbi2022/1885619/trans_norm.ome.zarr/", width=700, height=350)

This viewer is [vizarr](https://github.com/hms-dbmi/vizarr) from the [Gehlenborg lab](http://gehlenborglab.org/) at Harvard Medical School. It can be accessed at https://hms-dbmi.github.io/vizarr for example to access data from the IDR: [link](http://hms-dbmi.github.io/vizarr/v0.1?source=https%3A%2F%2Fuk1s3.embassy.ebi.ac.uk%2Fidr%2Fzarr%2Fv0.1%2F6001240.zarr).

## 6. Extras (time-permitting)

### 6.1 A larger example (idr0062)

If you are using binder, you may need to access the link directly:

https://hms-dbmi.github.io/vizarr/?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.1/6001240.zarr

In [25]:
from IPython.display import IFrame
IFrame("http://hms-dbmi.github.io/vizarr/v0.1?source=https%3A%2F%2Fuk1s3.embassy.ebi.ac.uk%2Fidr%2Fzarr%2Fv0.1%2F6001240.zarr", width=700, height=350)

### 6.2 Renaming

Another important distinction to filesystems is that though it looks like hello is in a directory, you should really think of the entire string after the bucket just as a "key".

In [26]:
!mc mv --recursive uk1/idr/share/gbi2022/1885619/trans_norm.ome.zarr/ uk1/idr/share/gbi2022/1885619/renamed.ome.zarr/

.../0/99/0/0:  773.27 KiB / 773.27 KiB  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  144.06 KiB/s 5s[0m[0m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m

### 6.3 Other resources

<table>
    <tr>
        <td>
            <a href="https://downloads.openmicroscopy.org/presentations/2020/Dundee/Workshops/NGFF/zarr_diagram/">
<img src="images/resources-1.png" alt="Screenshot of the Zarr diagram from OME2020" style="height:200px"/>
            </a>
        </td>
        <td>
<a href="https://downloads.openmicroscopy.org/presentations/2020/Dundee/Workshops/NGFF/zarr_diagram/">Diagram for how data moves</a>
        </td>
    </tr>
    <tr>
        <td>
      <a href="https://blog.openmicroscopy.org/file-formats/community/2020/11/04/zarr-data/">      
<img src="images/resources-2.png" alt="Screenshot of the Zarr diagram from OME2020" style="height:200px"/>
            </a>
        </td>
        <td>
<a href="https://blog.openmicroscopy.org/file-formats/community/2020/11/04/zarr-data/">Blog post for an easy way to publish OME-Zarr files</a>
        </td>
    </tr>
</table>    

### 6.4 Trying more with minio's play

Note, however, that play buckets get deleted every 24 hours.

In [27]:
!mc mb --ignore-existing play/gbi2022

[m[32;1mBucket created successfully `play/gbi2022`.[0m
[0m

In [28]:
!mc policy set public play/gbi2022

[m[32;1mAccess permission for `play/gbi2022` is set to `public`[0m
[0m

In [29]:
!mc cp -r /tmp/trans_norm_out/0/ play/gbi2022/trans_norm.ome.zarr/

.../0/99/0/0:  773.27 KiB / 773.27 KiB  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  134.17 KiB/s 5s[0m[0m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m[m[32;1m

Now run: `napari https://play.minio.io/gbi2022/trans_norm.ome.zarr`

## 7. Take homes

<br/>
<big><big>
    <ol>
        <li>
The simplicity & transparency of Zarr files makes them ideal for exploration & the cloud. 
        </li>
         <br/>
        <li>
The primary downside is that working with many small files can introduce bottlenecks for uploading (& even deleting).
        </li>
        <br/>
        <li>
Working with S3 is very different from a file system, fewer (GUI) tools exist, and each S3 implementation may be slightly different.
        </li>
        <br/>
        <li>
The benefits in sharing potential (and in some cases cost-savings) can be significant, especially if there's an enabled ecosystem that works for you.
        </li>
    </ol>
</big></big>

## License
Copyright (C) 2023 German BioImaging. All Rights Reserved.
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details. You should have received a copy of the GNU General
Public License along with this program; if not, write to the
Free Software Foundation,
Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.