# Section 4: Data - A modern approach 
![Pangeo Logo ](images/pangeo_logo_small.png)

[Pangeo Website](https://pangeo.io/)

It's all very well having a distributed, massively parallel scientific compute platform available. If we don't use it properly, our tasks won't make best use of the infrastructure. There are two primary blockers to making best use of the platform: poorly written source code and poorly strcutured datasets.  Hopefully with the tools and libraries that are now available for scientific computing, such as dask, users of the platform in general, and specifically our first use case of Scientific Analyst or Researcher. We can write code that clearly communicates our intentions to some one reading the code, while underlying libraries ensure that it is efficiently executed by the compute platform (meeting the goal of separation of concerns). The next challenge is the data we work with. One might think the data is what it is, and it's up to the code to deal with approiately, but that is not the case. Those who create the dataset, primarily the second use case of Data Engineer, can do a lot to make it easier to use later. As we shall see, the challenge of presenting data so it is easy to consume and understand is an old one.

## A Historical Digression - Reading ancient texts

![Codex Sinaticus](https://www.bl.uk/britishlibrary/~/media/bl/global/dst%20discovering%20sacred%20texts/collection%20items/codexsinaiticus-add_ms_43725_f244v.jpg?w=608&h=342&hash=11BF1F0A1DE8CAC524DE050F912D7AD7)

When we look at text, we see a lot of element that we may not think much about that add a lot to our understanding of what the text contains. For example, there are punctuation marks to divide the text into sentences and phrases. In addition, there are gaps left to indicate paragraphs, quotes, lists etc. Using whitespace for meaning was by no means invented by Python! As you can see from the accompanying pictures, text in ancient manuscripts often had none of these helpful elements. In some languages the vowels were not even explictly specified All all the text was nominally there, but actually reading it, particuarly reading it out loud, required a lot of skill to interpret what was recorded on the page. There was a proffession of *lektors* who were proffessional readers, because the skills required for reading any substantial text required a lot of training and practice. Much oif the burden of reading was placed on the consumer of the manuscript rather than the the producer.

Over time our way of writing has evolved to provide as much help for the reader as possible, so that the person reading can focus on the actual content of the document they are readfing, rather than the skill of reading itself. We expect a lot more of the writer, but they are the one who best knows the context and the meaning and so are best placed to help other interpret the text through whitespace, puntuation etc.

## The Goal - Data for reproducible, shareable research
That "fun" historical digression does have a point relevant to this course, that is that we want to make it as ewasy as possible for our data to be used by researchers and for data producers to provide as much help as possible in accesing, loading and analysing the data. It is not enough that that the raw data is present, it needs to be described by sufficient metadata and strcutured for effcient access. Data consumers should require skills in the domain the data describes to understasnd the contents, but require as  little as possible in terms of skills in the areas of data and software engineering. 

We might think of different levels of descriptiveness and ease of access for datasets.

* *Raw Data* - A series of text or binary files with the data values but no description of what they mean e.g. units. Finding the data you need and pulling it together into a coherent dataset requires knwloedge from elsewhere.
* *Described data files* - Data is provided as a series of files which contain a description of the data contained within (metadata). Accesing a whole dataset requires figuring out paths to many different files, which may not have consistent metadata across files. 
* *Described dataset* - All data in a dataset is accessed through a single descriptor and contains all descriptions necessary for some one skilled in the domain the data describes to interpret the data. The  " behind the scenes" structures may still be as a series of files and directories, but the user consumes the datasets as a unified whole.


https://pangeo.io/data.html

In creating a dataset we should ensure our data follows the FAIR principle, that it is *Findable*, *Accesible*, *Interoperable* and *Reusable*, and is *Analysis-Ready* and *Cloud Optimised*. Lets dig a bit deeper into what those terms really mean and what we can do to ensure we are following those principles.

## FAIR data

The FAIR principles are intended to ensure that it is easy for other to use the data that we produce so that research can build on the work of others as erfficiently as possible rather than endlessly reinventing the wheel. The principles state that data should be:

* **Findable** - A researcher should easily be able to find the data related that exists related to the problem that they are working on. This relies on sufficiently detailed description of datasets being contained in the metadata and being accessible without reading the whole dataset.
* **Accessible** - Once a researcher has found a dataset that they have determined will be a useful input to their pipeline in addressing a research question, they should easily be able to access that data.
* **Interoperable**  - The user should be able to load the data into the tool of their choice and integrate teir use of the data with the rest of their research pipeline.
* **Reusable** - Data can easily be used by others who have not played a part in creating it, and ideally should be usable by those who not specialists in the particular domain of the data.



More Info:
* https://www.go-fair.org/fair-principles/
* https://www.nature.com/articles/sdata201618

## Analysis Ready Data

The concept of analysis ready data is closely aligned with the FAIR principles but takes in particular the reusable and interoperable principles further. When asking if data is *analysis-ready*, we are really asking if once I have loaded the dataset in my favourite, is it ready to use in my analysis or for training a statistical or machine-learning model, or do I have do a lot of prepreprocessing to get it in to a state where it is ready to use in one of these ways. Data tat is analysis ready should have the following attributes:

* *Metadata rich*. Users should be provided an object with metadata which can be used
to define operations over (e.g: mean over ensemble realization).
* *Hidden infrastructure* - Analysis can be readily achieved without knowledge of the underlying storage of the
data (e.g: file paths, chunking etc – note efficiency considerations)
  * Users should not have to manually interact with the storage system, this should be
handled automatically as necessary. Ideally the storage should be abstracted away
from the user.
* *Simple analysis with simple code* - Basic analysis of a subset of the data (e.g: descriptive statistics, subsetting) should be possible with a minimum number of additional lines of code, and on the basis of metadata. 
  * e.g: `mean(dataset, axis=’time’)`
* *Supports modular analsysis* - Custom analytics should be possible by creating functions which take in an analysis ready dataset and return another.


This is still a new concept and as such all the implications and meanings are still being worked out. Partly this is because deciding whether something is analysis ready requires knowing something about what analsysis is to be performed. 

## Cloud optimised Data

* Metadata knowledge should be known with low latency, and without pulling from a
large number of individual objects (low cost operation). It should be consolidated in
some way.
* Subsets (or chunks) of data should be retrievable either using a whole object fetch,
or easily computable byte-range requests, in order to leverage large scale parallelism
and avoid unnecessary network transfers.
* Considerations of concept of “eventual consistency” – if an object in cloud storage is
updated from one node and another node later reads the same object, the state
change may not have propagated.
* For performance, processing on ARCO data should ideally be possible to construct
lazily, so that no processing is carried out until explicitly requested by the user. Any
computation should also be processed close to the data, in order to avoid
unnecessary network traffic.




The more one optimises for a specific, the less optimised it is likely to be for other cases. So one wants to make data ready for as broad a spectrum of possible uses as one can, but also focusing on optimising for the most common use cases while not excluding 

## Challenges
* Cloud requirements may be different from HPC requirements. What may work well
on one platform may not work well on another.
* Unlikely to be one complete solution/concrete implementation – instead need to
agree principles such as use of open standards and interoperability.
* Solution should be language agnostic as much as possible. Informatics Lab largely
Python focussed, R and scala are also important in the data science community, and
Fortran/C++ in use across the Met Office. (Possibly also Java in Technology?)
* Even within one language (e.g: Python), there might be many possible “analysis
ready” objects owing to different libraries. For numeric arrays, numpy is standard, for
tabular data, pandas datasets (dask wrapping both), for ND arrays, xarray datasets
are an emerging standard, but MO prefers to use Iris for domain specific functionality.
Interoperability between Iris/xarray not brilliant.
* Users need training – order of operations can be important and have significant cost
implications. e.g: in xarray/dask, compute -> subset vs subset -> compute! Especially
worrying if using auto-scaling groups.
* Data should ideally be discoverable. Consider how to “publish” datasets, and make
them findable – which metadata are important? It should be possible for all users to
publish data to prevent duplication of effort/multiple redundant copies (metadata
might need to include e.g: responsible owner, chunking strategy, data location
(MASS vs cloud)). (consider e.g: catalogues and search)

The broader “common data platform”:
* Should be platform agnostic (as much as possible), with seamless working across
on-prem and cloud solutions (including federated/multi cloud). NB: noting data
locality/data gravity issues.
* For performance and cost reasons, computation is often tied to data due to data
gravity – makes multi-cloud challenging. Especially if it is desirable to abstract
away/hide this information from end users (cost implications

### Lazy Loading

In [None]:
# demo lazy loading of data in iris

## Data types and tools

Introduce gridded and tabular data

### Tabular Data - Pandas

In [None]:
# demo loading accessing and manipulating tabular data in pandas
# dataset XBT

## Gridded data -iris and xarray

In [None]:
# demo loading accessing and manipulating gridded data in iris/xarray (we can borrow this material from introduct)

# datasets - era5
# datasets - UKV
# dataset CMIP6

## Sharing data - FAIR principles

Findable
Acessible
Interoperable
Reproducible

### Data Catalogues

What is a catalogue
what are the advantages?

Example cataslogues
* AWS earth - https://aws.amazon.com/earth/ 
* pangeo catalog - https://catalog.pangeo.io/
* STAC - https://stacspec.org/

In [None]:
# demo loading an intake catalogue

In [1]:
# demo creating an intake catalogue