# Section 1: The History and Philosophy of Scientific Computing

![Pangeo Logo](images/pangeo_logo_small.png)

In this series of lectures, we will be considering not just what a Pangeo platform is and how to create it, but also the context in which the need for such a platform arose and why existing platforms and paradigms were (and are) not addressing the needs of scientific computing then and now. Section 1 will be focused on what user of scientific compouting systems really need and what systems they have been provided with so far that at best only partially adress their needs, and in a worst case distort the nature of the scientific enquiry so that fits inside thecnology available, rather than technology we create being driven by the needs of scientific enquiry. We will start by looking at a few typical use cases of scientific computing systems. We will then describe existing solutions and examine their shortcomings.

## What is a scientific computing system?

A scientific computing system is one that is primarily used for creating, processing and analysing scentific data from observatios and simulations for purpose of developing scientific knowledge and understanding. 

## Key concepts and components of scientific system

First lets define the necessary components of a scientific computing system.
![Conceptual Elements of a Scientific Computing System x](images/SystemElements_conceptual.png)

* *User interaction device* - The user needs a way to interact with system to develop and initate workflows and interact with data through visualisations.
* *Computing resource* - At the core of a platform are some pieces of silicon (it is still all silicon for the time being) which performs many operations per second. For a time, these pieces of silicon evolved to do more and more computation withoutn having to do much except wait and regularly buy new silicon. Due to fundamental physical limits, rather than a particular chip being faster, it can do more computation in parallel. So to get more from our silicon, we have to do computing a little differently. After converging on CPUs for all compute for a time, we are once again seeing compute diverge into specialised silicon forms such as GPUs, *field-programmable fate arrays* (FPGAs) and *application-specific integrated circuits* (ASIC). 
* *Data Storage* - Users are consuming, producing, transforming and analysing data. Volumes of datas have faster that our ability to use, often because operations are O(n^2) or O(n^3).
* *Software tools and libraries* Users need software that translates the high-level operations they want to perform on data into operations on silicon. Any particular project uses a variety of general tools, general science tools and domain specific tools. An indiavidual may be working with several, sometimes mutually incompatible toolsets and need to be able to switch between them. Conversely a project may want the same toolset on multiple different platforms.
* *Source code* one of the primary forms of expressing scientific knowledge in the current era is as computer code. Theories and equations are translated into source code. Scientists need to be able to work easily and efficiently with source code in a way that facilitates robust, reproducible science rather than hindering it. 
* *Results communication* Communicating the results in variety of ways is increasingly recognised as part of the research process rather than a separate activity, and not just in applied or particularly high impact and visibility research e.g. climate change. Scientific compute systems should also facilitate sharing of key results and the elements that underly those results in an secure, intuitive and elegant way.

![Scientific Research Loop](images/researchLoop.png)

Some key concepts associated with this description
* **burst workflow** - Scientific research generally involves formulating a question that will use data and compute to help answer, then using the compute to find the answer, then reflecting on the results and formulating the next question. Compute usage is intense, but ideally for a short duration, so you can reflect on the results as soon as possible after formulating a question, ansd continuing around the loop to gain knowledge and understanding.
* **interactivity** - The reason we're doing large operations on big data is ultimately to interact with and reflect on the results.  As data volumes are so large we can just look at all the data directly, we need sophisticated analysis techniques to present reduced, processed forms to work with, which ultimately we do interact with through visualisation, because humans are visual creatures.
* **scaling** - We want to be able to throw any size problem at out compute and it scales up the compute resource for processing the data to the size of the data, so we can do processing of any size data quickly andintuitively, without waiting for long periods.

## Use cases of Scientific Computing

Talking in the abstract about a platform is difficult and note very helpful when what we want from our platform is to enable  and optimise the experiences for specific usesd. So lets think about some typical users of a scientific compute platform.

These are based on use cases described in this blog post about Pangeo by Niall Robinson:
https://medium.com/informatics-lab/what-do-we-want-from-a-dream-data-platform-as-a-service-c38558c25f29

### Use case 1 - Scientific Analyst

This is perhaps the most straightforward use case and perhaps the first one that comes to mind when wer think of scientific computing. This is the person that wants to produce some analysis using observations, simulations and models of some sort to use results in a paper or report.  This person may want to
* find relevant data
* run analysis at scale on large dataset
* train stats/machine-learning model
* visualise subsets or reductions of dataset


### Use case 2 - Data Engineer

This user produces data to be used by other users. They want to be able to make data available robustly, reliably and as soon as possible. They don't want data updates or downloads contrained by local usage patterns or compute constraints.

* Produce a new version of a dataset
* Create a new dataset which aggregates, subsets or transforms one or more other datasets.
* make data available that is Analysis-ready Cloud Optimised (ARCO) and follows FAIR priciples (Findable, Accesible, Interoperable, Reproducible).


### Use case 3 - Software Engineer

This user develops the tools used by the other use cases. They need to be able to develop tools in the same environment in which they will be deployed and test the tools on realistic test cases using real world data.

* frequently test new developments with a suite of test cases
* test new user environments with existing software
* optimise specific use cases from other users where performance is particularly important
* debug problems experienced by other users, requiring running in the same environment as the user.

### Use Case 4 - Application Developer

This user is looking to present data and analysis to a customer, which may be specific people or organisations or the general public. Tasks they might engage in include:

* create a dashboard for one or more datasets
* create custom reports based on model and observational data

### Uses not covered

There are certain use cases that we're not discussing here. For example we've not talked about running large scale physics based models. This is a very demanding, specialised case within scientific computing for which specialised hardware in the form of a HPC is required. We are focusing on a more general purpose computing platform.

## Previous models of computing

Before we talk about the Pnageo computing models, let's look at two other models of computing and consider in what ways they do or do not meet the needs of the users we have introduced.

## The desktop model of computing

The desktop model of scientific computing is the original and often still the default way of approaching scientific analysis doing a computer. While the need specialised hardware for running simulations in the form of supercomputers has long been recognised, the analysis of the data produced by simulations, which is how value is actually extracted from the invstment in simulation hardware and software, is often ignored from a compute perspective. What are the elements of this model?
* Compute - users have desktops which may be powerful. Each user has a dedicated desktop.
* Data - data is downloaded from local or remote servers and stored on the desktop for access
* Libraries and tools - either this is left entirely to each user, or the  environment is locked down and the user has no control 

What are the advantages and disadvatanges of this model?

*Advantages*
* fixed up front cost
* advantage - dedicated resource per user
* easy to work iteractively and intuitively

*Disadvantages*
* limited scale of compute,for bigger jobs only option is to wait longer.
* as technology advances over the course of the project, the user is unable to take advtange of those developemnts
* limited scale of data due to need to store and load on a desktop
* high data transfer costs, especially in time costs, due to slow transfer from core to leaf nodes of network
* lots of understanding required of compute required by non-specialists to do compute at any scale (separation of concerns).
* difficulty sharing code and tools
* sharing of resources within a team or project is very adhoc and can be technically challenging
* sharing of results is disconnected from the rest of the work


As a result of these disavantages, the way in which we do scientific computing and thus our scientific inquiry itself it shaped not by the fundamental problem we are solving or the answers we are seeking, but by how our computers work, limiting our questions to those that fit inside our small boxes. This is the wrong way around, we our compute to be designed to make answering fundamental questions and solving fundamental problem easier.

![Desktop model of computing](images/DesktopModel_diagram.png)

## Scheduled Linux clusters

The next step up from individual desktops is to rather pool resources into a cluster. Access to the compute resource is through a scheduler to ensure all users get a fair share of the compute resources. With pooling of compute and storage, there can be more scaling to meet the requirements of specific jobs, so rather than waiting for a long time for a large analysis job, more resources can be requested for a short time.

*Elements*:
* Compute -  a cluster of compute nodes (each vaguely analagous to a powerful PC), where each user can request a portion of a node or multiple nodes for a period of time based on the needs of the particular problem they are working on.
* Data - Netwroked storage shared across the nodes. Key data can be cached locally, with only 1 copy needed for multiple users.
* Tools and libraries -  managed environments can be provided, reducing setup time for users.

*Advantages*:
* pooling of compute and storasge for greaster flexibility to meet demands 
* great utilisation of resources (return on investment)
* known upfront cost
* can be expanded/upgraded 
* central management of environment facilitates getting started
* sharing within team and project is facilitated by shared storage, compute and environment

*Disadvantages*:
* generally requires in house expertise to manage greater than desktop model
* batch scheduling generally does not support interactivity well or requires inefficient use of resources.
* central environment management can also become a barrier due to fast shifting requirements for projects
* clusters need to be secured so do not facilitate external sharing of results or wider collaboration

![A Diagram of a scientific computing ecosystem built around a scheduled Linux Cluster](images/LinuxScheduledClusterModel_diagram.png)

### Models & Use Cases
Having considered each model abstractly, less consider some specific of how each model helps or hinders each of our specific imaginary users.

*Analyst*
* Desktop supports a natural interactive workflow for the researcher, but limits the size of the question can that be considered
* A cluster requires the researcher to learn a lot about getting stuff running on a cluster that eats away at time for the research they are interested in
* A cluster forces an unnatural workflow on the researcher

*Data Engineer*
* Desktop model seprates out data production and sharing, making itlikely that users will note have the lastest version of a dataset that the data engineer has produced
* Producing data is a burst workflow, so helped by a cluster
* Sharing within a team/project is also helped by a cluster, but sharing with a wider interested community is not

*Software Engineer*
* The (traditional) desktop model makes developing and testing for a particular evironment difficult when the users of the softwasre you write may have quite a different environment. 
* This makes it difficult to ensure the software gives reporoducible results and also makes it diificult to reproduce the bugs and problems of userss.
* A cluster makes it easier to deploy for the team, but sharing your software more widely is difficult

*Application Developer*
* Desktop model does not support deploying shared applications, so other servers are required that are quite separate from the usual scientific compute infrastructure. This can be a big "science to services"
* Cluster can support internal application, but usually not configured for an application server model, so still usually requires some server outsider usual research compute.
* Cluster can be a backend for generating results, but turnaround time is usually too slow for proper interactivity.


## Existing model: the verdict

We can see from our discussion of use cases and existing models we have a fundamental tension between two requirements an intituive, interactive workflow for perform the compute required for scientific research and the need to process data at scale to do that same research.

Our desktop supports the interactivity but with hard limits on scale that is only circumvented by waiting a long time, so removing the interactivity. In our linux cluster, we now have much greater compute power available that can be deployed as required for particular jobs and projects. A key costs for this batch compute power is that it tend to be not very interactive. Whereas on the desktop you could easily iterate through a problem firing off computing tasks when you need and quickly visualising or othwerwise interacting with the result, the batch model means you have to structure your jobs a certain way, send it off to be computed, wait for it to run and return and then follow some other process to interact with the results. 

In both cases, our research questions and process are being squashed and shrunk by the need to fit the problem to the compute. This is partly because neighter system has been designed for the requirements of today's scientific computing. We want a system that combines the easy interactivity of working on a local desktop with the ability to scale provided by a compute cluster. How might we start designing and building such a system?