# Section 4: Data - A modern approach 
![Pangeo Logo](images/pangeo_logo_small.png)

[Pangeo Website](https://pangeo.io/)

## A Historical Digression - Reading ancient texts

![Codex Sinaticus](https://www.bl.uk/britishlibrary/~/media/bl/global/dst%20discovering%20sacred%20texts/collection%20items/codexsinaiticus-add_ms_43725_f244v.jpg?w=608&h=342&hash=11BF1F0A1DE8CAC524DE050F912D7AD7)

Original manuscripts had no spacs, paragraphs

Meaning through whitespace not invented by Python language!!

Simply reading out what was written was difficult requiring certain knowledge and skil

Talk about Lecktors (readers)

We want people reading a document to require minimal skills in the act of reading and focus on skills around the content they are reading.

Similarly, people reading our data today. we want to require minimal skills in the technical details of reading data to focus on the skills around the understanding the content of the dataset.

https://pangeo.io/data.html

## Analysis Ready Data


* Metadata rich. Users should be provided an object with metadata which can be used
to define operations over (e.g: mean over ensemble realization).
* Analysis can be readily achieved without knowledge of the underlying storage of the
data (e.g: file paths, chunking etc – note efficiency considerations)
* Users should not have to manually interact with the storage system, this should be
handled automatically as necessary. Ideally the storage should be abstracted away
from the user.
* Simple analysis (e.g: descriptive statistics, subsetting) should be possible with a
minimum number of additional lines of code, and on the basis of metadata. (e.g:
mean(dataset, axis=’time’) or similar)
* Custom analytics should be possible by creating functions which take in an analysis
ready dataset and return another.


### Metadata
what is metadata and why is it important

## Cloud optimised Data

* Metadata knowledge should be known with low latency, and without pulling from a
large number of individual objects (low cost operation). It should be consolidated in
some way.
* Subsets (or chunks) of data should be retrievable either using a whole object fetch,
or easily computable byte-range requests, in order to leverage large scale parallelism
and avoid unnecessary network transfers.
* Considerations of concept of “eventual consistency” – if an object in cloud storage is
updated from one node and another node later reads the same object, the state
change may not have propagated.
* For performance, processing on ARCO data should ideally be possible to construct
lazily, so that no processing is carried out until explicitly requested by the user. Any
computation should also be processed close to the data, in order to avoid
unnecessary network traffic.


## Challenges
* Cloud requirements may be different from HPC requirements. What may work well
on one platform may not work well on another.
* Unlikely to be one complete solution/concrete implementation – instead need to
agree principles such as use of open standards and interoperability.
* Solution should be language agnostic as much as possible. Informatics Lab largely
Python focussed, R and scala are also important in the data science community, and
Fortran/C++ in use across the Met Office. (Possibly also Java in Technology?)
* Even within one language (e.g: Python), there might be many possible “analysis
ready” objects owing to different libraries. For numeric arrays, numpy is standard, for
tabular data, pandas datasets (dask wrapping both), for ND arrays, xarray datasets
are an emerging standard, but MO prefers to use Iris for domain specific functionality.
Interoperability between Iris/xarray not brilliant.
* Users need training – order of operations can be important and have significant cost
implications. e.g: in xarray/dask, compute -> subset vs subset -> compute! Especially
worrying if using auto-scaling groups.
* Data should ideally be discoverable. Consider how to “publish” datasets, and make
them findable – which metadata are important? It should be possible for all users to
publish data to prevent duplication of effort/multiple redundant copies (metadata
might need to include e.g: responsible owner, chunking strategy, data location
(MASS vs cloud)). (consider e.g: catalogues and search)

The broader “common data platform”:
* Should be platform agnostic (as much as possible), with seamless working across
on-prem and cloud solutions (including federated/multi cloud). NB: noting data
locality/data gravity issues.
* For performance and cost reasons, computation is often tied to data due to data
gravity – makes multi-cloud challenging. Especially if it is desirable to abstract
away/hide this information from end users (cost implications

### Lazy Loading

In [None]:
# demo lazy loading of data in iris

## Data types and tools

Introduce gridded and tabular data

### Tabular Data - Pandas

In [None]:
# demo loading accessing and manipulating tabular data in pandas
# dataset XBT

## Gridded data -iris and xarray

In [None]:
# demo loading accessing and manipulating gridded data in iris/xarray (we can borrow this material from introduct)

# datasets - era5
# datasets - UKV
# dataset CMIP6

## Sharing data - FAIR principles

Findable
Acessible
Interoperable
Reproducible

### Data Catalogues

What is a catalogue
what are the advantages?

Example cataslogues
* AWS earth - https://aws.amazon.com/earth/ 
* pangeo catalog - https://catalog.pangeo.io/
* STAC - https://stacspec.org/

In [None]:
# demo loading an intake catalogue

In [1]:
# demo creating an intake catalogue