diff --git a/01_history_philosophy_scientific_computing.ipynb b/01_history_philosophy_scientific_computing.ipynb index 247c6ac..b6fb7d3 100644 --- a/01_history_philosophy_scientific_computing.ipynb +++ b/01_history_philosophy_scientific_computing.ipynb @@ -2,10 +2,11 @@ "cells": [ { "cell_type": "markdown", - "id": "swedish-granny", + "id": "coordinated-sunglasses", "metadata": {}, "source": [ "# Section 1: The History and Philosophy of Scientific Computing\n", + "![Pangeo Logo](images/pangeo_logo_small.png)\n", "\n", "What is a Pangeo? Is it :\n", "* a computing platform\n", @@ -17,12 +18,14 @@ "\n", "It is, in a sense, the last one, though *Pangeo as a Lifestyle* (PaaL) is still in the early stages of development. Some users may interact with Pangeo through a particular implementation and get the sense it is a platform or software package, if one were to pick just one of those, it would be a community that has come together to develop tools and recipes for tackling challenges around big data in scientific computing. The mindset is also important, because there is no one set of tools to address this, and the tools keep evolving. Instead it is about approaching those challenges in a certain way.\n", "\n", - "In this series of lectures, we will be considering not just what a Pangeo platform is and how to create it, but also the context in which the need for such a platform arose and why existing platforms and paradigms were (and are) not addressing the needs of scientific computing then and now. Section 1 will be focused on what user of scientific compouting systems really need and what systems they have been provided with so far that at best only partially adress their needs, and in a worst case distort the nature of the scientific enquiry so that fits inside thecnology available, rather than technology we create being driven by the needs of scientific enquiry. We will start by looking at a few typical use cases of scientific computing systems. We will then describe existing solutions and examine their shortcomings." + "In this series of lectures, we will be considering not just what a Pangeo platform is and how to create it, but also the context in which the need for such a platform arose and why existing platforms and paradigms were (and are) not addressing the needs of scientific computing then and now. Section 1 will be focused on what user of scientific compouting systems really need and what systems they have been provided with so far that at best only partially adress their needs, and in a worst case distort the nature of the scientific enquiry so that fits inside thecnology available, rather than technology we create being driven by the needs of scientific enquiry. We will start by looking at a few typical use cases of scientific computing systems. We will then describe existing solutions and examine their shortcomings.\n", + "\n", + "Pangeo community website: https://pangeo.io/" ] }, { "cell_type": "markdown", - "id": "clean-antique", + "id": "obvious-developer", "metadata": {}, "source": [ "## Key concepts and components of scientific system\n", @@ -32,7 +35,7 @@ }, { "cell_type": "markdown", - "id": "classified-terror", + "id": "vital-civilian", "metadata": {}, "source": [ "First lets define the necessary components of a scientific computing system.\n", @@ -46,7 +49,7 @@ }, { "cell_type": "markdown", - "id": "waiting-comparison", + "id": "expired-estimate", "metadata": {}, "source": [ "Some key concepts associated with this description:\n", @@ -57,7 +60,7 @@ }, { "cell_type": "markdown", - "id": "oriental-request", + "id": "headed-public", "metadata": {}, "source": [ "## Use cases of Scientific Computing" @@ -65,7 +68,7 @@ }, { "cell_type": "markdown", - "id": "capital-joining", + "id": "arbitrary-template", "metadata": {}, "source": [ "Talking in the abstract about a platform is difficult and note very helpful when what we want from our platform is to enable and optimise the experiences for specific usesd. So lets think about some typical users of a scientific compute platform.\n", @@ -76,7 +79,7 @@ }, { "cell_type": "markdown", - "id": "atlantic-object", + "id": "settled-rates", "metadata": {}, "source": [ "### Use case 1 - Scientific Analyst\n", @@ -92,7 +95,7 @@ }, { "cell_type": "markdown", - "id": "turkish-steps", + "id": "specialized-regulation", "metadata": {}, "source": [ "### Use case 2 - Data Engineer\n", @@ -106,7 +109,7 @@ }, { "cell_type": "markdown", - "id": "acquired-offer", + "id": "frank-handling", "metadata": {}, "source": [ "### Use case 3 - Software Engineer\n", @@ -121,7 +124,7 @@ }, { "cell_type": "markdown", - "id": "stopped-jerusalem", + "id": "reasonable-clarity", "metadata": {}, "source": [ "### Use Case 4 - Application Developer\n", @@ -134,7 +137,7 @@ }, { "cell_type": "markdown", - "id": "understood-ferry", + "id": "lined-puppy", "metadata": {}, "source": [ "### Uses not covered\n", @@ -144,7 +147,7 @@ }, { "cell_type": "markdown", - "id": "indie-gateway", + "id": "awful-apollo", "metadata": {}, "source": [ "## Previous models of computing\n", @@ -154,7 +157,7 @@ }, { "cell_type": "markdown", - "id": "promotional-participation", + "id": "fresh-employer", "metadata": {}, "source": [ "## The desktop model of computing\n", @@ -187,7 +190,7 @@ }, { "cell_type": "markdown", - "id": "vulnerable-accessory", + "id": "aerial-ancient", "metadata": {}, "source": [ "![Desktop model of computing ](images/DesktopModel_diagram.png)" @@ -196,14 +199,14 @@ { "cell_type": "code", "execution_count": null, - "id": "interior-package", + "id": "motivated-welsh", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", - "id": "lined-command", + "id": "living-fortune", "metadata": {}, "source": [ "## Scheduled Linux clusters\n", @@ -233,7 +236,7 @@ }, { "cell_type": "markdown", - "id": "exceptional-birmingham", + "id": "occasional-filling", "metadata": {}, "source": [ "![A Diagram of a scientific computing ecosystem built around a scheduled Linux Cluster ](images/LinuxScheduledClusterModel_diagram.png)" @@ -241,7 +244,7 @@ }, { "cell_type": "markdown", - "id": "amino-intensity", + "id": "million-india", "metadata": {}, "source": [ "### Models & Use Cases\n", @@ -270,7 +273,7 @@ }, { "cell_type": "markdown", - "id": "quiet-commerce", + "id": "southwest-hospital", "metadata": {}, "source": [ "## Existing model: the verdict\n", @@ -291,7 +294,7 @@ { "cell_type": "code", "execution_count": null, - "id": "usual-hebrew", + "id": "immune-bracelet", "metadata": {}, "outputs": [], "source": [] @@ -299,9 +302,9 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "pangeo_lectures", "language": "python", - "name": "python3" + "name": "pangeo_lectures" }, "language_info": { "codemirror_mode": { @@ -313,7 +316,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.9" + "version": "3.7.10" } }, "nbformat": 4, diff --git a/02_building_blocks_of_scientific_computing-Copy1.ipynb b/02_building_blocks_of_scientific_computing-Copy1.ipynb index de0e10c..4127b58 100644 --- a/02_building_blocks_of_scientific_computing-Copy1.ipynb +++ b/02_building_blocks_of_scientific_computing-Copy1.ipynb @@ -2,20 +2,22 @@ "cells": [ { "cell_type": "markdown", - "id": "exterior-eugene", + "id": "reflected-arthur", "metadata": {}, "source": [ "# Section 2: The building blocks of a Scientific Computing Platform\n", "\n", "![Pangeo Logo](images/pangeo_logo_small.png)\n", "\n", + "[Pangeo Website](https://pangeo.io/)\n", + "\n", "In the first section we introduced some typical users of a scientific compute platfo\n", "rm and typical tasks that such users may wish to perform on such a platform. We then looked at 2 models for delivering the compute capability that users require, the desktop model and the cluster model. Each of these has its advantages and disadvantages. The next step is to consider how we could design and build a platform that combines the advantages of different systems and removes (as much as possible) the disadvtages. In this notebook we will discuss those principles by looking at the key goals of what we'll call the *Pangeo model of Scientific Computing*." ] }, { "cell_type": "markdown", - "id": "cellular-terror", + "id": "continental-thong", "metadata": {}, "source": [ "![Pangeo model of scientific computing](images/PangeoModel_diagram.png)" @@ -23,7 +25,7 @@ }, { "cell_type": "markdown", - "id": "intellectual-announcement", + "id": "located-chess", "metadata": {}, "source": [ "## Goal 1 - An interactive platform that scales\n", @@ -33,7 +35,7 @@ }, { "cell_type": "markdown", - "id": "nasty-lawsuit", + "id": "infectious-american", "metadata": {}, "source": [ "### Affordable data storage\n", @@ -50,7 +52,7 @@ }, { "cell_type": "markdown", - "id": "emerging-imagination", + "id": "social-acrobat", "metadata": {}, "source": [ "### Distributed compute\n", @@ -62,7 +64,7 @@ }, { "cell_type": "markdown", - "id": "auburn-manor", + "id": "scenic-warner", "metadata": {}, "source": [ "There are multiple levels of distribution:\n", @@ -74,7 +76,7 @@ }, { "cell_type": "markdown", - "id": "specialized-manitoba", + "id": "cathedral-crawford", "metadata": {}, "source": [ "This distributed nature of compute is another factor in creating **cloud-optimised** datasets, so our data facilitates our exploitation of this massively distributed computed resource rather than hinders it." @@ -82,7 +84,7 @@ }, { "cell_type": "markdown", - "id": "progressive-storage", + "id": "placed-consortium", "metadata": {}, "source": [ "### Scalable compute\n", @@ -93,7 +95,7 @@ }, { "cell_type": "markdown", - "id": "wound-result", + "id": "internal-maine", "metadata": {}, "source": [ "### Elastic compute\n", @@ -107,7 +109,7 @@ }, { "cell_type": "markdown", - "id": "introductory-edwards", + "id": "criminal-reporter", "metadata": {}, "source": [ "### Interactive workflow\n", @@ -122,7 +124,7 @@ }, { "cell_type": "markdown", - "id": "metric-infection", + "id": "partial-stereo", "metadata": {}, "source": [ "## Goal 2 - Reproducible Research" @@ -130,7 +132,7 @@ }, { "cell_type": "markdown", - "id": "informational-copying", + "id": "ultimate-turner", "metadata": {}, "source": [ "Computing underlies in some way many of the scientific results published in peer-reviewed jounral, through statistical analysis, visulaisation of data, simulations etc. The source code that produces the published results is thus a key output of the scientific process. In order to trust the scientific result derived from the source code and the undelying libraries, we need to trust that the code is doing what it set out to do (verification) and that what the software aims to do is scientitfically correct (validation) \n", @@ -144,7 +146,7 @@ }, { "cell_type": "markdown", - "id": "impossible-avenue", + "id": "geographic-geometry", "metadata": {}, "source": [ "## Goal 3 - Shareable research" @@ -152,7 +154,7 @@ }, { "cell_type": "markdown", - "id": "crude-carrier", + "id": "realistic-watershed", "metadata": {}, "source": [ "The goal of shareable research is closely aligned with the reproducible research and is made possible by many of the same components. Historically sharing of research has been done just through what is published in the peer-reviewed journal. The element underpinning those results have not been shared by default. Today we recognise that the elements that underpin the published results, including source code and environments used to generate the results, are also very important to other researchers and if we want to share our results, we want interested parties to see all elements of that research. We need to be able to share our code and data. Other researchers can pick apart our research pipeline and interrogate each part themselves. This is partly to review research results, but also so elements that are useful can be pulled and reused in other research that shares common elements. A common, easily accessible platform for computing makes it much easier to share research pipelines with others" @@ -160,7 +162,7 @@ }, { "cell_type": "markdown", - "id": "sunset-commissioner", + "id": "union-sport", "metadata": {}, "source": [ "## Goal 4 - Cost effective compute" @@ -168,7 +170,7 @@ }, { "cell_type": "markdown", - "id": "centered-trust", + "id": "corporate-starter", "metadata": {}, "source": [ "In many ways the barriers to a scalable, interactiver platform suitable for today's scientific computing come down to one concept: cost. We could, to a certain extent, create such a platform by spending enough money to give each scientist a large enough desktop, though in many domains the size of data to be processed makes even this impractical. Cost limits are usually hit first. Our Pangeo platform needs to provide the interactive compute which scales elastically to meet demand at a cost that is affordadble for your average researcher/research institution. Cloud computing service have made this possible. One pays only for the resources that are actually used while running your job. When the job is finished, the resource is relinquished and no further cost is incurred (for compute, any data stored needs to continue to be stored usually, which incurs an ongoing cost, but this is relatively small)." @@ -176,7 +178,7 @@ }, { "cell_type": "markdown", - "id": "romantic-settle", + "id": "covered-universe", "metadata": {}, "source": [ "## Goal 5 - Separation of concerns\n", @@ -187,7 +189,7 @@ }, { "cell_type": "markdown", - "id": "emotional-springfield", + "id": "brilliant-beaver", "metadata": {}, "source": [ "## Building a real system\n", @@ -197,15 +199,17 @@ }, { "cell_type": "markdown", - "id": "sublime-china", + "id": "right-andorra", "metadata": {}, "source": [ + "Pangeo Architecture: https://pangeo.io/architecture.html\n", + "\n", "As you might expect, we not are the first people to ever think about this. Why, then, do we think we might have a better chance of doing a better job? What has changed? There has been a significant development in the past 20 years in terms of availability of compute and how we access and pay for it that makes a different model of scientific compute, which it is worth discussing briefly before describing the goals of our Pangeo platform." ] }, { "cell_type": "markdown", - "id": "thirty-armstrong", + "id": "tropical-pencil", "metadata": {}, "source": [ "### Cloud computing\n", @@ -224,7 +228,7 @@ }, { "cell_type": "markdown", - "id": "prostate-poetry", + "id": "angry-delaware", "metadata": {}, "source": [ "### A stack of standard tools\n", @@ -234,9 +238,9 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "pangeo_lectures", "language": "python", - "name": "python3" + "name": "pangeo_lectures" }, "language_info": { "codemirror_mode": { @@ -248,7 +252,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.9" + "version": "3.7.10" } }, "nbformat": 4, diff --git a/03_scalable_interactive_compute_pangeo_implementation.ipynb b/03_scalable_interactive_compute_pangeo_implementation.ipynb index a87eabe..bab0b21 100644 --- a/03_scalable_interactive_compute_pangeo_implementation.ipynb +++ b/03_scalable_interactive_compute_pangeo_implementation.ipynb @@ -2,130 +2,1550 @@ "cells": [ { "cell_type": "markdown", - "id": "indoor-retail", + "id": "prime-graph", "metadata": {}, "source": [ - "# Section 3: The Pangeo Approach - An Implementation Guide" + "# Section 3: The Pangeo Approach - An Implementation Guide\n", + "![Pangeo Logo](images/pangeo_logo_small.png)\n", + "\n", + "[Pangeo Website](https://pangeo.io/)" + ] + }, + { + "cell_type": "markdown", + "id": "solid-invitation", + "metadata": {}, + "source": [ + "![A stack of technoogies](images/pangeoStackElements_buildYourOwn.png)" + ] + }, + { + "cell_type": "markdown", + "id": "unexpected-oxygen", + "metadata": {}, + "source": [ + "In this section we will look at specific tools that allow you to build your Pangeo Implementation. As the diagram shows, a specific implementation of the Pangeo does not contain a specific stack of tools, or even one of a selection of specific monolithic stacks. Rather there are certain categories for which a specific tool must be choose, but each of the tools should be able to be swapped in and out depending on the requirements of the specific implementation. The categories of tools include:\n", + "* *Compute platform* - Where will the actual computation take place. Options - HPC, public cloud provider (AWS, Azure, Alibaba, Digital Ocean), private cloud (e.g. European Weather Cloud, JASMIN), Cluster, Local Machine\n", + " * Compute mode - how will compute be triggered. Options - Interactiver notebooks, batch scheduler, serverless architecture (e.g. AWS lambda)\n", + " * How will my compute be scaled elastically? Options - kubernetes, AWS ECS, Dask Cloud Provider\n", + "* *Data storage* - Where will the data be stored? Options - Distributed cloud storage, Relational Database, Data warehouse\n", + " * Data format - What format will the data be stored in? Options - NetCDF, CSV, RDS, Zarr, TileDB\n", + " * Data model - What will handle interpreting data and metadata as a cohesive data model?\n", + " * Data arrays - What will handle the raw processing of arrays of numbers? Options: dask, numpy\n", + "* *Interaction* - How will I interact with the compute and data? Options - Jupyter notebook, dashboard website\n", + "* *Environment management* - What will create the software environment for my research? Options: conda, pip, docker containers\n" + ] + }, + { + "cell_type": "markdown", + "id": "mobile-stockholm", + "metadata": {}, + "source": [ + "## Cloud computing\n", + "\n", + "The advent of cloud computing services allows us to provision the compute services we need for scientific in a completely different way, and one that is more suitable for the sort of workflows and expertise that we can expect an average researcher to have. We can use the different core services of the cloud providers for the different elements of Pangeo system in different ways. Although you can generally swap different elements in and out of the stack of a particular Pangeo implementation, you generally need to choose one platform provider e.g. AWS, Azure etc. as it is either necesarry or optimal for these to work together. We will now take a look at what cloud services we might use in our stack. \n" + ] + }, + { + "cell_type": "markdown", + "id": "freelance-civilization", + "metadata": {}, + "source": [ + "### Low level services\n", + "\n", + "When setting up a compute platform we start by thinking in terms of low-level components of CPUs and storage space. All major *cloud service providers* (CSPs) have similar comparable offerings in this space. The ability to easily provision computing resources like is often called **infrastructure as a service**.\n", + "\n", + "https://en.wikipedia.org/wiki/Infrastructure_as_a_service\n", + "\n", + "The table below shows the names of the comparable services on different platforms.\n", + " \n", + " \n", + "Service / Provider | AWS | Azure | GCP \n", + "--- |--- |--- |--- \n", + "Compute (VM) | EC2 | Azure VM | Compute Engine\n", + "Object Storage | S3 | Blob Storage | Cloud Storage\n", + "\n", + "Comparison of offerings: http://comparecloud.in\n", + "\n", + "Our Pangeo implementation will use the APIs provided by CSPs to quickly obtain the resources needed to spin up our platform and configure them for appropriate access, interoperability and security." + ] + }, + { + "cell_type": "markdown", + "id": "composed-scholar", + "metadata": {}, + "source": [ + "### High-level service - Platform as a Service etc.\n", + "\n", + "As the offerings from CSP have developed, new more specialised services have been created. Using low-level services, users have to set up all aspects of the environment for their particular application, choosing appropriate configurations for sharing, security, reliability etc. Usually this means specialised software engineers or infrastructure engineers to make this happen. For a large organisation, there are sufficient people and skills to maintain the goal of separation of concerns, but this is not true for smaller groups and organisations. Instead, one can use higher level services where the technical details are taken care of. Increasingly higher-level service components, sch as data warehousing and machine learning platforms where low-level configuration is taken care of, are being part of the software stack for Pangeo implementations. \n", + "\n", + "https://en.wikipedia.org/wiki/Platform_as_a_service\n", + "\n", + "Service / Provider | AWS | Azure | GCP \n", + "--- |--- |--- |--- \n", + "Machine Learning | Sagemaker | AzureML | DataLab / Cloud AutoML\n", + "Database | RDS | Azure SQQL DB | Cloud SQL\n", + "Data Warehouse | \n", + "Query aaS | Athena | Data Lake Analytics | BigQuery" + ] + }, + { + "cell_type": "markdown", + "id": "offensive-romance", + "metadata": {}, + "source": [ + "We also have 3rd party providers of these higher level services, building value-added layers on the low-level infrastrucutre of major CSPs to deliver specialised services, for example database solutions ([TileDB](https://tiledb.com/) or [MongoDB](https://www.mongodb.com/cloud)) or machine learning platforms ([Determined AI](https://determined.ai/enterprise/)\n", + "\n", + "CSPs liken the development of cloud computing to the development of an integrated grid for electricity distribution. In the early days of electricity each factory had their own generators and required expertise in electrical engineering. With a electricity grid, central suppliers provide the electricity and the associated expertise to run it. This is the direction that computing is going in. The trade-off is that while higher level services are easier to get started with and use, they often are less portable resulting in vendor lock-in. So we balance the convenience of higher-level services in our Pangeo implementation with the goals of reproducible, shareable research which favour open-source tools deployed on low-level services." + ] + }, + { + "cell_type": "markdown", + "id": "opponent-harvey", + "metadata": {}, + "source": [ + "### More information on cloud providers\n", + "\n", + "* AWS\n", + "* Azure\n", + "* Digital Ocean\n", + "\n", + "Comparison of cloud providers: https://www.varonis.com/blog/aws-vs-azure-vs-google/" + ] + }, + { + "cell_type": "markdown", + "id": "curious-influence", + "metadata": {}, + "source": [ + "## Creating and sharing the tool stack\n", + "\n", + "One of the challenges of computing platforms is setting up the right environment of tools and libraries to support the scientific research being done, while avoiding this task consuming all the researcher's time. There have been substantial developments in this space which make this task easier and support the goals of reproducible and shareable research and aid in separation of concerns.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "undefined-fellowship", + "metadata": {}, + "source": [ + "### Environment managers - pip and conda\n", + "\n", + "Particularly in the python ecosystem, tools such as *pip* and *conda* allow one to specify the tools to deploy on a particular compute instance as a file, allowing an **infrastructure as code** (IaC) approach to tools. As with cloud resource provision, complete specification of the configuration as a file allows others to reproduce the environment and thus reproduce the scientific research. This is not always yet as easy and trouble free as we would hope, but these tools have gone a long way towards this goal and are often used as part of a Pangeo implemntation to configure the research environment.\n", + "\n", + "Additional info\n", + "* pip https://pip.pypa.io/en/stable/\n", + "* conda https://docs.conda.io/en/latest/\n" + ] + }, + { + "cell_type": "markdown", + "id": "worldwide-departure", + "metadata": {}, + "source": [ + "## Containers\n", + "\n", + "Another similar tool is *containers*. These are essentially lightweight virtual machines intended for running a single task efficiently and at scale. As with environemnt managers, you completely configure an *image* through a cofiguration which specifies what should be installed and configured inside the container. You then build a particular *instance* of your container from the image. One can build hundred of instances to run in parallel. Compute jobs can then be distirbuted among these containers at a task or distribution level, to make use of the massively parallel, distributed nature of the compute and data storage infrastructure. Over time, repositories of ready made containers have built up, so a researcher should not need to do much configuration to get started.\n", + "\n", + "\n", + "\n", + "Additional information\n", + "* docker: library for creating and running containers - https://docker.com \n", + "* Docker Hub: library of ready to use containers - https://hub.docker.com/" + ] + }, + { + "cell_type": "markdown", + "id": "lyric-begin", + "metadata": {}, + "source": [ + "## Orchestration\n", + "\n", + "The challenge in distributed computing is always getting the many individual workers to coordinate the work they are doing. Before they start doing any actual work, the cluster of workers must be set up appropriately from the cloud resourced we have requested to enable this inter-task communication. This is the job of orchestration software. Once again, we use an Infrastructure as code apporach to specify how many workers we want and how they should be configured and the orchestration software then acquires and sets up the resources, such as cloud VMs running containers.\n", + "\n", + "Additional info:\n", + "* Kubernetes - https://kubernetes.io/\n", + "* AWS Elastic Container Service - https://aws.amazon.com/ecs/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc\n", + "* Azure Container services - https://azure.microsoft.com/en-gb/product-categories/containers/" + ] + }, + { + "cell_type": "markdown", + "id": "military-devices", + "metadata": {}, + "source": [ + "## Task Distribution\n", + "\n", + "With our compute cluster set up and running, we then need a way to execute our tasks in a distributed fashion. We need a way to handling splitting our large dataset into sub-domains where a particular operation can be performed separately on each sub-domain or a separate compute worker. One library that does this is dask." + ] + }, + { + "cell_type": "markdown", + "id": "significant-advancement", + "metadata": {}, + "source": [ + "Dask is a task scheduling library which support **lazy execution**. This means that it doesn't actually do any calculations until it needs to. So when you string together a series of operations, for example \n", + "\n", + "* load data\n", + "* extract subset for a country or region\n", + "* calculate mean for each year for country\n", + "* plot annual means\n", + "\n", + "The calculation will only be triggered when you try to plot the data, as it then needs the actual number. Before that point it creates a [*task graph*](https://docs.dask.org/en/latest/graphs.html), describing all the tasks that need to computed and which tasks are dependant on other tasks. When it decides it needs the results, all of the elements in the graph are calcualted in the order required by the dependencies. " + ] + }, + { + "cell_type": "markdown", + "id": "crucial-commitment", + "metadata": {}, + "source": [ + "How does it do this is a massively parallel way to speed up execution? There are three parts to the dask compute resources\n", + "\n", + "* a client - usually the computer we are interacting on \n", + "* a scheduler - the instance the divides up the task and communicates with the workers\n", + "* workers - compute instances doing the actual work sent to them by the scheduler.\n", + "\n", + "When computation is triggered, the scheduler figures out how to assign jobs to workers in the correct order according to the task graph, and then gets the results back from the workers. The task graph will split up operation into parallel operations (*task distribution*) as well as splliting large array by chunk (*data distribution*). Each task is a separate node in the graph and will be sent to a separate worker. This allows for massive, elastic, scaling of compute organised interactively." + ] + }, + { + "cell_type": "markdown", + "id": "persistent-justice", + "metadata": {}, + "source": [ + "Additional Information:\n", + "* dask -https://dask.org/\n", + "* dask distributed - https://distributed.dask.org/en/latest/\n", + "* dask cloud provider https://cloudprovider.dask.org/en/latest/" + ] + }, + { + "cell_type": "markdown", + "id": "defensive-syntax", + "metadata": {}, + "source": [ + "## Distributed computing with dask - A Demonstration\n", + "\n", + "At last we come to a demonstration of actually computing with actual source code. This will show how we can do a fairly simple mean calculationm on a large array. This seem quite simple, but is similar to many of the questions we want to ask, which are string together a series of simple operation, often something subset by time and location, calculate mean,min and max. The challenge is doing so an an ensemble of global climate predictions for 100 years! So what does this look like on dask.\n", + "\n", + "We start by creating a client object, which also creates connected scheduler and worker objects. We're creating this locally, but this could be connected to a cluster on any sort of infrastructure:\n", + "* local machine\n", + "* cloud cluster\n", + "* on-premises cluster\n", + "* HPC\n", + "\n", + "The cluster absracts away the details of the implemtations." ] }, { "cell_type": "code", - "execution_count": null, - "id": "appreciated-crest", + "execution_count": 15, + "id": "caring-writer", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "import dask.distributed" + ] }, { "cell_type": "markdown", - "id": "supposed-billy", + "id": "political-moore", "metadata": {}, "source": [ - "## Cloud computing" + "Our client shows some details and provides us with a dashboard we can look at, which for a local cluster is at `localhost:8787/status`" ] }, { "cell_type": "code", - "execution_count": null, - "id": "taken-freeze", + "execution_count": 16, + "id": "consolidated-rebel", "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

Client

\n", + "\n", + "
\n", + "

Cluster

\n", + "
    \n", + "
  • Workers: 4
  • \n", + "
  • Cores: 8
  • \n", + "
  • Memory: 33.73 GB
  • \n", + "
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "client = dask.distributed.Client()\n", + "client" + ] }, { "cell_type": "markdown", - "id": "quick-regulation", + "id": "composite-beijing", "metadata": {}, "source": [ - "## Containers" + "Now we can set up our computation. In this case we want to find the mean of an array. To use dask for our computation, we use a dask array data structure rtather than a standard numpy array. Dask aims to present the same interface for major data type, for example\n", + "* numpy array - dask array\n", + "* pandas dataframe - dask dataframe\n", + "\n", + "You can also manually create a graph through creating *delayed* functions through the dask API, where normal python functions are added to a task graph to be executed later. Here we are using the dask data types, and this will construct our task graph for us." ] }, { "cell_type": "code", - "execution_count": null, - "id": "nasty-portrait", + "execution_count": 17, + "id": "controlling-alert", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "import dask.array" + ] }, { - "cell_type": "markdown", - "id": "narrative-stephen", + "cell_type": "code", + "execution_count": 19, + "id": "worldwide-interest", "metadata": {}, + "outputs": [], "source": [ - "## orchestration - kubernetes" + "my_array = dask.array.random.random((1000, 1000), chunks=(100, 100))" ] }, { "cell_type": "code", - "execution_count": null, - "id": "collect-undergraduate", + "execution_count": 20, + "id": "statutory-secondary", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Array Chunk
Bytes 8 B 8 B
Shape () ()
Count 239 Tasks 1 Chunks
Type float64 numpy.ndarray
\n", + "
\n", + "\n", + "
" + ], + "text/plain": [ + "dask.array" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_array.mean()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "adult-coordination", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "my_array.mean().compute()" + ] }, { "cell_type": "markdown", - "id": "floating-philadelphia", + "id": "polar-potato", "metadata": {}, "source": [ - "## distribution - Dask" + "![Dask Task Graph](images/dask_taskGraph.png)\n", + "This is what the graph looks like for our small operation. Each chunk is a node in the graph, and then the scheduler gathers together the result to present in our notebook.\n", + "\n", + "Task graphs can also be visualised using the Graphviz library: \n", + " * https://docs.dask.org/en/latest/graphviz.html\n", + " \n", + "With those technologies, our scientific compute platform is able to scale effectively to deal with an arbitrarily big dataset by adding additional distributed workers and increasing the data distribution by splitting processing into more tasks in the task graph, which can be executed in parallel by our distributed workers. This is similar in some ways to a scheduled cluster system, with the difference being that the way it is triggered can be transparent to the user, who simply writes code in a natural way to perform operations on arrays, and dask distributes the works transparently.\n", + "\n", + "Video demonstration of scaling with a dask cluster:\n", + "* https://www.youtube.com/watch?v=R2xntfsDxtA" + ] + }, + { + "cell_type": "markdown", + "id": "functional-saying", + "metadata": {}, + "source": [ + "## Interactivity & portability\n", + "\n", + "What about interacting with our system. A core goal of our system was a high level of rich interaction that provided an intuitive experience for the scientific researcher, augmenting rather than interrupting their train of thought as they consider their current research question. The desktop naturally provides an interactive workflow, but doesn't scale easily. We need a solution that is hosted close to the scaleable compute. A browser-based solution mean that we can have a graphically-rich, interactive experience in the cloud, using our desktop/laptop/tablet as a *thin-client*. The most common tool of this sort is the **Jupyter notebook**, which is how this material is being presented.\n", + "\n", + "Jupyter notebooks are an example of [*literate programming*](https://en.wikipedia.org/wiki/Literate_programming) paradigm described by Computer Scientist Donald Knuth, where the presentation of the computer code is accompanied by an examplanation of what it does, and the order and structure of presentation is focused on the human audience for the code, rather than the computer which will execute it.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "surprised-cheat", + "metadata": {}, + "source": [ + "### Components and structure\n", + "Jupyter notebooks are composed of cells, which can be code or markdown. Through this they contain 3 elements\n", + "* *code*: in a code cell - you run a cell to execute all code in the cell. \n", + " * code executes in a specified *kernel*. \n", + "* *results*: below a code cell - when a code cell is run, it may produce result, as text an image such as a plot or rich content specified as html, which might be interactive.\n", + "* *markdown fragments including text, images etc.*: a markdown cell - Explantation of the code and results is found in markdown cells which should interspersed with the code and results." + ] + }, + { + "cell_type": "markdown", + "id": "smooth-elephant", + "metadata": {}, + "source": [ + "For examples of how notebooks can be used as part of an interactive computing experience, there are many galleries with real world demonstrations including:\n", + "* Pangeo gallery http://gallery.pangeo.io/\n", + "* Iris examples https://scitools-iris.readthedocs.io/en/latest/generated/gallery/index.html" + ] + }, + { + "cell_type": "markdown", + "id": "focused-spirituality", + "metadata": {}, + "source": [ + "## Visualisation and Dashboarding\n", + "\n", + "A key component of interactivity is to visualise the data we are using and the results we produce. A Pangeo implemtation will include tools in the software stack for visulisation. Common visualisation tools used with Pangeo include:\n", + "* [Matplotlib]() Widely-used Python library for plotting, primarily creates static output.\n", + "* [Bokeh](https://docs.bokeh.org/en/latest/index.html) A python library to create javascript web pages for interactive visualisations.\n", + "* [Plotly](https://plotly.com/python/) A eco-sytem for creating data science focused apps and visualisations\n", + "\n", + "Combined with the scaling provided by dask, we can easily load and analyse large datasets and viosualisae the results. More than this we can use interactive tools with this capcity to generate results on demand. " + ] + }, + { + "cell_type": "markdown", + "id": "catholic-principle", + "metadata": {}, + "source": [ + "Lets look at an example of analysing climate data from the Coupled Model Intercomparison Project 6th phase (CMIP6) on the fly using the stack we have built up.\n", + "\n", + "http://gallery.pangeo.io/repos/pangeo-gallery/cmip6/intake_ESM_example.html\n", + "\n", + "We are loading our data through an *intake catalog*, which will investigate further in section 4, which is focused on data." ] }, { "cell_type": "code", - "execution_count": null, - "id": "cleared-copyright", + "execution_count": 1, + "id": "figured-syndrome", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "import intake" + ] }, { "cell_type": "code", - "execution_count": null, - "id": "overhead-species", + "execution_count": 2, + "id": "pharmaceutical-lender", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "import numpy\n", + "import iris\n", + "import matplotlib.pyplot\n", + "import iris.quickplot" + ] }, { "cell_type": "code", - "execution_count": null, - "id": "civic-winner", + "execution_count": 3, + "id": "reported-karaoke", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "

pangeo-cmip6 catalog with 6851 dataset(s) from 442624 asset(s):

\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
unique
activity_id17
institution_id35
source_id85
experiment_id160
member_id550
table_id37
variable_id709
grid_label10
zstore442624
dcpp_init_year60
version619
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "cat_url = \"https://storage.googleapis.com/cmip6/pangeo-cmip6.json\"\n", + "col = intake.open_esm_datastore(cat_url)\n", + "col" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "analyzed-athens", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "

pangeo-cmip6 catalog with 6851 dataset(s) from 442624 asset(s):

\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
unique
activity_id17
institution_id35
source_id85
experiment_id160
member_id550
table_id37
variable_id709
grid_label10
zstore442624
dcpp_init_year60
version619
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "col" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "liquid-equipment", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "--> The keys in the returned dictionary of datasets are constructed as follows:\n", + "\t'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + " \n", + " \n", + " 100.00% [2/2 00:00<00:00]\n", + "
\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "tasmax_dwd_ssp585 = col.search(\n", + " experiment_id=[\"ssp585\"],\n", + " variable_id=\"tasmax\",\n", + " grid_label=\"gn\",\n", + " institution_id='DWD',\n", + ").to_dataset_dict()['ScenarioMIP.DWD.MPI-ESM1-2-HR.ssp585.Amon.gn'].tasmax.to_iris()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "amateur-palestine", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "
Air Temperature (K)--timelatitudelongitude
Shape11032192384
Dimension coordinates
\ttime-x--
\tlatitude--x-
\tlongitude---x
Auxiliary coordinates
\tmember_idx---
Scalar coordinates
\theight2.0 m
Attributes
\tcell_measuresarea: areacella
\tcommentmaximum near-surface (usually, 2 meter) air temperature (add cell_method...
Cell methods
\tmean: area
\tmaximum within days: time
\tmean over days: time
\n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tasmax_dwd_ssp585" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "hindu-movement", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "import iris.analysis" + ] }, { - "cell_type": "markdown", - "id": "quick-transaction", + "cell_type": "code", + "execution_count": 9, + "id": "continuing-hampshire", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + " \n", + "\n", + "\n", + "\n", + "\n", + " \n", + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "
Air Temperature (K)--time
Shape11032
Dimension coordinates
\ttime-x
Auxiliary coordinates
\tmember_idx-
Scalar coordinates
\theight2.0 m
\tlatitude51 degrees_north
\tlongitude-3 degrees_east
Attributes
\tcell_measuresarea: areacella
\tcommentmaximum near-surface (usually, 2 meter) air temperature (add cell_method...
Cell methods
\tmean: area
\tmaximum within days: time
\tmean over days: time
\n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sample_points = [('latitude', 51), ('longitude', -3)]\n", + "tasmax_dwd_ssp585_timeseries = tasmax_dwd_ssp585.interpolate(sample_points, iris.analysis.Linear())\n", + "tasmax_dwd_ssp585_timeseries" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "solar-canvas", "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], "source": [ - "## Interactivity & portability - Jupyter Labs" + "fig1 = matplotlib.pyplot.figure('cmip6_ts', figsize=(26,10))\n", + "ax1 = fig1.add_subplot(1,1,1)\n", + "ax1.plot([str(i1) for i1 in tasmax_dwd_ssp585_timeseries.coord('time').cells()][:20], tasmax_dwd_ssp585_timeseries[0,:].data[:20])" ] }, { "cell_type": "code", - "execution_count": null, - "id": "marine-continent", + "execution_count": 12, + "id": "nasty-requirement", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "import bokeh\n", + "import bokeh.plotting\n", + "import bokeh.io" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "brave-bowling", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + " \n", + " Loading BokehJS ...\n", + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/javascript": [ + "\n", + "(function(root) {\n", + " function now() {\n", + " return new Date();\n", + " }\n", + "\n", + " var force = true;\n", + "\n", + " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", + " root._bokeh_onload_callbacks = [];\n", + " root._bokeh_is_loading = undefined;\n", + " }\n", + "\n", + " var JS_MIME_TYPE = 'application/javascript';\n", + " var HTML_MIME_TYPE = 'text/html';\n", + " var EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", + " var CLASS_NAME = 'output_bokeh rendered_html';\n", + "\n", + " /**\n", + " * Render data to the DOM node\n", + " */\n", + " function render(props, node) {\n", + " var script = document.createElement(\"script\");\n", + " node.appendChild(script);\n", + " }\n", + "\n", + " /**\n", + " * Handle when an output is cleared or removed\n", + " */\n", + " function handleClearOutput(event, handle) {\n", + " var cell = handle.cell;\n", + "\n", + " var id = cell.output_area._bokeh_element_id;\n", + " var server_id = cell.output_area._bokeh_server_id;\n", + " // Clean up Bokeh references\n", + " if (id != null && id in Bokeh.index) {\n", + " Bokeh.index[id].model.document.clear();\n", + " delete Bokeh.index[id];\n", + " }\n", + "\n", + " if (server_id !== undefined) {\n", + " // Clean up Bokeh references\n", + " var cmd = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", + " cell.notebook.kernel.execute(cmd, {\n", + " iopub: {\n", + " output: function(msg) {\n", + " var id = msg.content.text.trim();\n", + " if (id in Bokeh.index) {\n", + " Bokeh.index[id].model.document.clear();\n", + " delete Bokeh.index[id];\n", + " }\n", + " }\n", + " }\n", + " });\n", + " // Destroy server and session\n", + " var cmd = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", + " cell.notebook.kernel.execute(cmd);\n", + " }\n", + " }\n", + "\n", + " /**\n", + " * Handle when a new output is added\n", + " */\n", + " function handleAddOutput(event, handle) {\n", + " var output_area = handle.output_area;\n", + " var output = handle.output;\n", + "\n", + " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", + " if ((output.output_type != \"display_data\") || (!Object.prototype.hasOwnProperty.call(output.data, EXEC_MIME_TYPE))) {\n", + " return\n", + " }\n", + "\n", + " var toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", + "\n", + " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", + " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", + " // store reference to embed id on output_area\n", + " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", + " }\n", + " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", + " var bk_div = document.createElement(\"div\");\n", + " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", + " var script_attrs = bk_div.children[0].attributes;\n", + " for (var i = 0; i < script_attrs.length; i++) {\n", + " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", + " toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n", + " }\n", + " // store reference to server id on output_area\n", + " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", + " }\n", + " }\n", + "\n", + " function register_renderer(events, OutputArea) {\n", + "\n", + " function append_mime(data, metadata, element) {\n", + " // create a DOM node to render to\n", + " var toinsert = this.create_output_subarea(\n", + " metadata,\n", + " CLASS_NAME,\n", + " EXEC_MIME_TYPE\n", + " );\n", + " this.keyboard_manager.register_events(toinsert);\n", + " // Render to node\n", + " var props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", + " render(props, toinsert[toinsert.length - 1]);\n", + " element.append(toinsert);\n", + " return toinsert\n", + " }\n", + "\n", + " /* Handle when an output is cleared or removed */\n", + " events.on('clear_output.CodeCell', handleClearOutput);\n", + " events.on('delete.Cell', handleClearOutput);\n", + "\n", + " /* Handle when a new output is added */\n", + " events.on('output_added.OutputArea', handleAddOutput);\n", + "\n", + " /**\n", + " * Register the mime type and append_mime function with output_area\n", + " */\n", + " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", + " /* Is output safe? */\n", + " safe: true,\n", + " /* Index of renderer in `output_area.display_order` */\n", + " index: 0\n", + " });\n", + " }\n", + "\n", + " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", + " if (root.Jupyter !== undefined) {\n", + " var events = require('base/js/events');\n", + " var OutputArea = require('notebook/js/outputarea').OutputArea;\n", + "\n", + " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", + " register_renderer(events, OutputArea);\n", + " }\n", + " }\n", + "\n", + " \n", + " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", + " root._bokeh_timeout = Date.now() + 5000;\n", + " root._bokeh_failed_load = false;\n", + " }\n", + "\n", + " var NB_LOAD_WARNING = {'data': {'text/html':\n", + " \"
\\n\"+\n", + " \"

\\n\"+\n", + " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", + " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", + " \"

\\n\"+\n", + " \"
    \\n\"+\n", + " \"
  • re-rerun `output_notebook()` to attempt to load from CDN again, or
  • \\n\"+\n", + " \"
  • use INLINE resources instead, as so:
  • \\n\"+\n", + " \"
\\n\"+\n", + " \"\\n\"+\n", + " \"from bokeh.resources import INLINE\\n\"+\n", + " \"output_notebook(resources=INLINE)\\n\"+\n", + " \"\\n\"+\n", + " \"
\"}};\n", + "\n", + " function display_loaded() {\n", + " var el = document.getElementById(\"1002\");\n", + " if (el != null) {\n", + " el.textContent = \"BokehJS is loading...\";\n", + " }\n", + " if (root.Bokeh !== undefined) {\n", + " if (el != null) {\n", + " el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n", + " }\n", + " } else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(display_loaded, 100)\n", + " }\n", + " }\n", + "\n", + "\n", + " function run_callbacks() {\n", + " try {\n", + " root._bokeh_onload_callbacks.forEach(function(callback) {\n", + " if (callback != null)\n", + " callback();\n", + " });\n", + " } finally {\n", + " delete root._bokeh_onload_callbacks\n", + " }\n", + " console.debug(\"Bokeh: all callbacks have finished\");\n", + " }\n", + "\n", + " function load_libs(css_urls, js_urls, callback) {\n", + " if (css_urls == null) css_urls = [];\n", + " if (js_urls == null) js_urls = [];\n", + "\n", + " root._bokeh_onload_callbacks.push(callback);\n", + " if (root._bokeh_is_loading > 0) {\n", + " console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", + " return null;\n", + " }\n", + " if (js_urls == null || js_urls.length === 0) {\n", + " run_callbacks();\n", + " return null;\n", + " }\n", + " console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", + " root._bokeh_is_loading = css_urls.length + js_urls.length;\n", + "\n", + " function on_load() {\n", + " root._bokeh_is_loading--;\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n", + " run_callbacks()\n", + " }\n", + " }\n", + "\n", + " function on_error(url) {\n", + " console.error(\"failed to load \" + url);\n", + " }\n", + "\n", + " for (let i = 0; i < css_urls.length; i++) {\n", + " const url = css_urls[i];\n", + " const element = document.createElement(\"link\");\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.rel = \"stylesheet\";\n", + " element.type = \"text/css\";\n", + " element.href = url;\n", + " console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " const hashes = {\"https://cdn.bokeh.org/bokeh/release/bokeh-2.3.0.min.js\": \"HjagQp6T0/7bxYTAXbLotF1MLAGWmhkY5siA1Gc/pcEgvgRPtMsRn0gQtMwGKiw1\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.3.0.min.js\": \"ZEPPTjL+mdyqgIq+/pl9KTwzji8Kow2NnI3zWY8+sFinWP/SYJ80BnfeJsa45iYj\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.3.0.min.js\": \"exLqv2ACDRIaV7ZK1iL8aGzGYQvKVuT3U2CT7FsQREBxRah6JrkVCoFy0koY1YqV\"};\n", + "\n", + " for (let i = 0; i < js_urls.length; i++) {\n", + " const url = js_urls[i];\n", + " const element = document.createElement('script');\n", + " element.onload = on_load;\n", + " element.onerror = on_error.bind(null, url);\n", + " element.async = false;\n", + " element.src = url;\n", + " if (url in hashes) {\n", + " element.crossOrigin = \"anonymous\";\n", + " element.integrity = \"sha384-\" + hashes[url];\n", + " }\n", + " console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", + " document.head.appendChild(element);\n", + " }\n", + " };\n", + "\n", + " function inject_raw_css(css) {\n", + " const element = document.createElement(\"style\");\n", + " element.appendChild(document.createTextNode(css));\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " \n", + " var js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-2.3.0.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.3.0.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.3.0.min.js\"];\n", + " var css_urls = [];\n", + " \n", + "\n", + " var inline_js = [\n", + " function(Bokeh) {\n", + " Bokeh.set_log_level(\"info\");\n", + " },\n", + " function(Bokeh) {\n", + " \n", + " \n", + " }\n", + " ];\n", + "\n", + " function run_inline_js() {\n", + " \n", + " if (root.Bokeh !== undefined || force === true) {\n", + " \n", + " for (var i = 0; i < inline_js.length; i++) {\n", + " inline_js[i].call(root, root.Bokeh);\n", + " }\n", + " if (force === true) {\n", + " display_loaded();\n", + " }} else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(run_inline_js, 100);\n", + " } else if (!root._bokeh_failed_load) {\n", + " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", + " root._bokeh_failed_load = true;\n", + " } else if (force !== true) {\n", + " var cell = $(document.getElementById(\"1002\")).parents('.cell').data().cell;\n", + " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", + " }\n", + "\n", + " }\n", + "\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", + " run_inline_js();\n", + " } else {\n", + " load_libs(css_urls, js_urls, function() {\n", + " console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n", + " run_inline_js();\n", + " });\n", + " }\n", + "}(window));" + ], + "application/vnd.bokehjs_load.v0+json": "\n(function(root) {\n function now() {\n return new Date();\n }\n\n var force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n \n\n \n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n var NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"

\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"

\\n\"+\n \"
    \\n\"+\n \"
  • re-rerun `output_notebook()` to attempt to load from CDN again, or
  • \\n\"+\n \"
  • use INLINE resources instead, as so:
  • \\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"
\"}};\n\n function display_loaded() {\n var el = document.getElementById(\"1002\");\n if (el != null) {\n el.textContent = \"BokehJS is loading...\";\n }\n if (root.Bokeh !== undefined) {\n if (el != null) {\n el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(display_loaded, 100)\n }\n }\n\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error(url) {\n console.error(\"failed to load \" + url);\n }\n\n for (let i = 0; i < css_urls.length; i++) {\n const url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n const hashes = {\"https://cdn.bokeh.org/bokeh/release/bokeh-2.3.0.min.js\": \"HjagQp6T0/7bxYTAXbLotF1MLAGWmhkY5siA1Gc/pcEgvgRPtMsRn0gQtMwGKiw1\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.3.0.min.js\": \"ZEPPTjL+mdyqgIq+/pl9KTwzji8Kow2NnI3zWY8+sFinWP/SYJ80BnfeJsa45iYj\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.3.0.min.js\": \"exLqv2ACDRIaV7ZK1iL8aGzGYQvKVuT3U2CT7FsQREBxRah6JrkVCoFy0koY1YqV\"};\n\n for (let i = 0; i < js_urls.length; i++) {\n const url = js_urls[i];\n const element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.async = false;\n element.src = url;\n if (url in hashes) {\n element.crossOrigin = \"anonymous\";\n element.integrity = \"sha384-\" + hashes[url];\n }\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n \n var js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-2.3.0.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.3.0.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.3.0.min.js\"];\n var css_urls = [];\n \n\n var inline_js = [\n function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\n function(Bokeh) {\n \n \n }\n ];\n\n function run_inline_js() {\n \n if (root.Bokeh !== undefined || force === true) {\n \n for (var i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }\n if (force === true) {\n display_loaded();\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n var cell = $(document.getElementById(\"1002\")).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n\n }\n\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(css_urls, js_urls, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));" + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "bokeh.io.output_notebook()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "revolutionary-replacement", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/javascript": [ + "(function(root) {\n", + " function embed_document(root) {\n", + " \n", + " var docs_json = {\"cf918a2f-e7fb-4598-adde-0b4882c85a73\":{\"defs\":[{\"extends\":null,\"module\":null,\"name\":\"DataModel\",\"overrides\":[],\"properties\":[]}],\"roots\":{\"references\":[{\"attributes\":{\"below\":[{\"id\":\"1012\"}],\"center\":[{\"id\":\"1015\"},{\"id\":\"1019\"}],\"height\":400,\"left\":[{\"id\":\"1016\"}],\"renderers\":[{\"id\":\"1037\"}],\"title\":{\"id\":\"1040\"},\"toolbar\":{\"id\":\"1027\"},\"width\":1000,\"x_range\":{\"id\":\"1004\"},\"x_scale\":{\"id\":\"1008\"},\"y_range\":{\"id\":\"1006\"},\"y_scale\":{\"id\":\"1010\"}},\"id\":\"1003\",\"subtype\":\"Figure\",\"type\":\"Plot\"},{\"attributes\":{},\"id\":\"1008\",\"type\":\"LinearScale\"},{\"attributes\":{},\"id\":\"1045\",\"type\":\"BasicTickFormatter\"},{\"attributes\":{\"bottom_units\":\"screen\",\"fill_alpha\":0.5,\"fill_color\":\"lightgrey\",\"left_units\":\"screen\",\"level\":\"overlay\",\"line_alpha\":1.0,\"line_color\":\"black\",\"line_dash\":[4,4],\"line_width\":2,\"right_units\":\"screen\",\"syncable\":false,\"top_units\":\"screen\"},\"id\":\"1026\",\"type\":\"BoxAnnotation\"},{\"attributes\":{\"overlay\":{\"id\":\"1026\"}},\"id\":\"1022\",\"type\":\"BoxZoomTool\"},{\"attributes\":{},\"id\":\"1044\",\"type\":\"AllLabels\"},{\"attributes\":{},\"id\":\"1020\",\"type\":\"PanTool\"},{\"attributes\":{},\"id\":\"1021\",\"type\":\"WheelZoomTool\"},{\"attributes\":{\"line_color\":\"#1f77b4\",\"x\":{\"field\":\"x\"},\"y\":{\"field\":\"y\"}},\"id\":\"1035\",\"type\":\"Line\"},{\"attributes\":{},\"id\":\"1004\",\"type\":\"DataRange1d\"},{\"attributes\":{},\"id\":\"1041\",\"type\":\"AllLabels\"},{\"attributes\":{},\"id\":\"1040\",\"type\":\"Title\"},{\"attributes\":{\"data\":{\"x\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023,1024,1025,1026,1027,1028,1029,1030,1031],\"y\":{\"__ndarray__\":\"DgqMQ4hwjEPrHI1DpWWOQ5eaj0ObCpFDzxCRQyXYkEN+opBDzg+PQ9iqjUPhzYtD1qmMQ1EnjEPUC4xDqcGNQ8RykEMFX5FDS3ORQ/LNkEMsqpBDzTOPQ+P/i0MLUIxDOkSMQ43TjENFxIxDPNeNQ/H5j0P8NZFD5RKTQ+oSkUOjKZBDifKOQ3ZfjUPNkoxDaYmMQ/5ljENB5ItDhtiNQwwokEP8MJFDw5+RQwV6k0PsyZBDdgOPQ3eQjUPoBo1DtteLQ056ikO9BY1DGcqPQyRXkEOM25FDXy+RQz0fkkNxpo9DC4WOQ01CjUP00YxDb1yMQ6Hfi0MW+YxDWZiNQybkjkOoEZFDVSeSQwxFkUNsq5BDbFOOQ5k5jUNvWotD+wqMQ6dKjUMHRY1DHbOMQ8Qwj0NiOZFDU0GSQwo0kkO11ZBDHZyPQ4ryjEM9fItDz9iMQ+bWjEM1iYxDe5WNQ8wTj0NtKJJDHMCSQ9kjkkOatY9DoYuOQ/AojUNdLIxDDFqLQxcPjUOzKoxD0YqNQw3Qj0M9BpFD3FeSQ+UkkUPKsZBDJESPQ3CpjUO/1YxD/2KLQzpZi0NCP4xDZK+OQz1ejkN0KJFDpXCSQ07FkUPCnJBDnjKPQ5S8jUNwkIxDw9aLQzMOjEOHPI1DHN+OQ2mgj0OUBJND3K2SQ8ofkkN4opBD6iyQQ0jgjEO5r4tDKfCLQ9PNjENeZIxDhIaNQ2Y2j0OyV5BDyuCRQ4s/kUMrG5FD/FWPQ94RjkPGUYxDt3WMQ+1mjEOzD41D0NGPQ0USj0MibJBDwKSRQ65ekUMKMpRDBe2PQ238jEN2noxDyECLQ8bri0NtrYxDkPyNQ2yXj0NQ5JFDoDKSQ2MFkkP3mpFDRQeQQ7ENjkMvPI1Dxi+MQxgkjUOmN41DgfCNQwhfkEOFKpJD2VyTQ52MkkPq6pFDU0mRQ2eVjUPkbYtDyiuMQ+I3i0NBE41DjCOPQ0L6j0N6hpJDJzCUQyRulENW15FD9H6PQ1UBjkMXSY1DvxKMQ2E9jUPNpY1Dnv6NQ9jsj0Oyi5JDPnWSQ+Ark0NSiZBD00qPQ2oWjUN2AY1DP9qNQ1HOjUP6Eo1DjuWNQxv6jkOGxZFDRCKTQz2IkkP9mZBDK3iPQ0NkjUNpwoxD/FCMQ6r8i0O0/IxDAc6PQ4Z2kENNy5FDrt+RQ4yHkUN3tJBDJriPQz+GjEOd4YxDo8mLQ2RkikMnvotDbOuNQ5sNj0MTxpFDN6iSQwwVk0OW85BDMM6OQxXkjUNdcoxD3SSNQ5Uci0P8/4tDtB2PQxu6j0OW15FDfLSRQ+KkkUO+NpJDxCePQzbgjUMXy4xDItmLQzPvjEMYs4xD2qCNQw/8j0NR5pBDTaqQQ0QwkUPHR5BDjqeOQz60jEM1I41D0xaMQ4QDi0OOUIxDhHWNQ87gjkO3qZBDF/mQQ1J2kUOSDJBDpV6PQ0cNjkOYno1DZkGMQ5iojEObh4xDk8qNQ5yzkEPhk5NDHSqTQ38Uk0PsA5JDKS+PQw2wjUOvuoxDNFmMQy4MjUN5mY1DBiCOQ5ojkEN0qZFDVuaSQwEfkkNfhpBDK2GOQ8sAjUOHSI1D+sWLQ/ajjEOvAI1D1GSNQ6Jyj0PNC5JDEN6RQ10YkkOedJBD2PmOQ+hAjUMSY4xDqqmLQyVxi0P7ZIxDVvaNQ/QEkEPTmZBD4O+RQ/yGkkOV7pBDecmPQ7WrjUPEIY1DzGSKQ2CMikMaiYtDk1SOQ0Drj0NRvJBDxiSSQzG8kUPEUpBDBG6PQ994jUNShIxDgz2LQ8Tpi0N5e4xDeQWOQxMzj0MlxJFD0KGRQzYfkkOeipFDRx2PQ2PajUPkfYxDzIiLQxKUjEPShIxDgxOOQ3IokUMVtpJD4tyUQ63ek0N6IpND7VKQQzrqjEM5NY1DKwqNQ74+i0MvBY1Diy+OQ+r9kEP985FDtCaTQ+Q5k0P4HZNDXkaPQ+YYjkNOJ41D0a2MQ/Zri0PQ4YtDtHONQ28fj0MaXZFDX7eSQ0ZIkkOWyJFDsDKPQ4UmjkOsmIxDM3uMQ22ZjEPqR4xD9K6NQ1Zlj0PVjZFDElCUQ7QOkkOc3pJDBH+PQ6zZjUN2CYxD+YyMQ5q0jEPUQI1DplGOQ3cXkEOqWJBDaJSRQzmHk0PtOJBDK6ePQ4NWjUMne4xDK5SLQ2EIi0NzxoxDarmOQ1k1j0M0MZFDr6SSQ6vUk0PZEpFDSgWPQ1QYjkMNso1D1iCNQ8a1jENn9oxDFcSNQxmykEO26pFDrQCSQwGFkUNlmZBDBTWPQ7VVjUMYaI1DMFiMQyDNi0OzW4xDnNuOQyPdj0PvgJNDAxuUQzqblEP/ZpND08iOQ/fvjUOVyoxD5omLQxWhi0PwX4xDWPCOQ8GYkEPnaZJDJ0GSQ/52kkNWv5FDBfWOQ2NojUPAPYxDYTmMQxg2jEPw/Y1D2JKQQ8aEj0OsAZJDdySSQ/CJkUNIPJJDxRGPQyZWjUPqHoxDk9iLQ9xgjEPbm41DLgCPQ+dhkEOtFJJD7PuSQ3rhkUMQkZFDdjiPQxRijkOwbY1DOECMQwtjjEOIGoxDAp+NQ32skEMIxZFDAYmSQ/ZKk0MIzJFD/MCPQ0G2jUNJtYxDfpOKQ47OjEN/yoxDFJSPQ0OOj0Px2pBDZA2UQ/IdlENjWJNDnLePQ2MWj0P9KYtDWimNQysxjUNQTI1DyvGNQwiBj0P1ppFD/HqSQ9P2kkNO7pFDPfCPQ8VwjUOuQ41D/jaNQ6sAjUOQXo1DBWyPQ7/skkN6H5JDy82VQ5obk0NgtpJDRw6PQxqUjkN1NY1DLEGLQ+F/jEPcF41DSSaOQzmWkEOPiJFDyH6TQy53kkNZK5FDTq6PQ6rXjUO8boxDtPGLQ9wajEMIyItDIQ6OQ+uMj0O9UpJDqOWSQ/dok0MMMZNDf2+QQ7Y3jUP5m41DaNeMQx0ijEOMUI5D+ReOQ7VpkUPzJZNDSTuWQ+5Qk0NQNJNDxc+PQ0b0jUPURY1DEGyMQwchjUPZ2YxDs7KNQ6/1j0MprJFD1NaSQ3B8kkO3yJJDqSKQQ7TZjUPfi4xDuv2MQ5m8jEOL2Y1DvIuNQ1zUj0MU+5BD4F2SQ0bokUNaSZJDhMSRQ2icjkOFrI1D9ByMQx2/i0NxzY1Dr3aOQ6wKkUMSPpFDE0SVQ+ZAlEMyRJJD2OyPQ2rgjUN6UY1D7lyMQ9yyjEPVsI1DRuiOQ4xakENgd5JD7/ORQ6xNkkMBWZFDG4ePQ+iljUNH2oxDz/yMQw+YjEOEFY1D/wmQQ4prj0PTlZFD1DOSQ2ZrlEP+TJND2TmQQ527jUOllYtDJ7OMQwz2jEP3SY1DKlOOQ/JEkUPTMJNDGamTQwmmkkM7YpJDQ9yRQ5qJjkOa1YxD1KyMQ9oGi0PiWY1D7DSOQ2rSj0PXEZFDL22SQ9tKkUOGr5FDZmaPQyCgjUNOVYxDyR+LQ9ohjEOfDY1D6QKOQ6TKjkPWzZFDBDGSQ4zAkkOhaJFDKS+QQ+mGjkPAM4xDg3yMQzWKjEN7I41D/VeOQ0a8kEOk4JFD/w2TQ7xLkkNS7JFD+YmQQ91djkMPRI1DYtyMQxO3jEOYjoxDeEmOQ+IokUOjAJFDW4aSQ2gCkkNmTZFDDtKPQ/mwjUPq+41DBpmNQ4U+jUN7v4xDyDeOQyvokUMIn5FDyYiTQ8L/kkN6/ZFDyCCRQyeVjkNawo1DrpuLQ7Gri0Nnt4xDqRePQ4b5j0OwvZFDfSOVQ4WwlEPcXpND3d2PQ95hjkNFt4xDyOaMQ79gjEPaYo1DoEuOQ+LckUNsM5JDuuaRQ7XjkUOqNJJDuGePQ7FRjkOZF41Dv3uMQwchjEOWyYxDK2OOQ9X5j0M6QpJDZbyRQ1g7kkOZPJFDFVCPQ9nyjUM9toxDuWqMQ8gIjUPAso1Du9+OQ3uzkEP2gZJDq9WTQ1lXk0Ne4pJDODqRQ7L/jkMKKY5D0taNQ//Ei0NJz4xDsNOOQ7IGkENIv5FDqPSSQ/Iek0NEnJJDGW+QQ/zLjkNwCI5DgWmMQ7gijEPTnoxDrbmPQ4MrkUME9pJD0DiUQ0GYkkOvCZNDIPiPQ4cHj0Owj41DeWKNQ9FUjUPEuY1DhlGOQ/03kUOadpFDTqmTQ3+slEMrAZRD2leQQ3VfjkPWgI1DHSWNQ7NAjUP9cI1DVrqQQxCzkUMK0ZJDSZOUQxx7k0PUS5NDNIKQQ9HujUPr0I5D74iNQ4iHjEPaho1DgPOPQy50kEPofpFDgsiSQ/3tk0NgI5JDeTyQQ1tbjkMM5Y1DFLyMQ6NEjUODaIxDlH6OQ4VEkEO3iZNDiz6SQ7b7k0NJv5JDtJaRQ1yzjkO9H41DbV6LQ5+xjUNifo1DAJePQ62xj0NEvJJDsDCSQ8uFkkNYPJFDQ8OPQ3GOjkOfe41D3/qMQwebi0OR4Y1DT/2OQ6AMkkPFEJNDTQeUQ2NglUOX15JDysWPQ0ZrjUPyfY5D8EONQwGNjEMhz4xDUmCOQ61dkENaLZFDlPySQxbSkkN2uJJDDAyRQ4KVjUMbbY1DshKNQxSAjUNXOI1Dm5GOQ2dpkEPjZZFDQqWSQy/Fk0NJkpNDcPyQQ6n2jkNkno5DjKuLQzi1jEN2d41DoXmQQ30MkEPtWZFDk0qSQ0X3kkMPD5FDSCSQQ2xJjkPfN41DTnWNQyALi0P0ZYxDyMaPQx7pkENtKZJDJF+UQ+0AlEMJVpRDfN6QQ8LcjkOb7o1DebGMQ78MjUNQzI1DUBKQQ98WkUMMp5NDn8CUQ+e1k0Nm25FDKRaPQ3OqjkMaW41DHEuMQ/pgjEMFu4xDobePQ1uvkUNxAZNDo6WVQ84LlEOa65JD1DyQQ7CwjkMf2IxDDLqMQ+HMi0NiJY5D74eOQ6b+j0OxmZFDGwqTQwswk0Mh9pJDK22QQxMYjkO9qYxDGFKNQ0OHjUPfXY5DogWPQyDBj0PgW5NDqaiUQ/JdlUN8Q5RDI32QQ9T3jkN6zo1Dg/GMQ3QLjUP01YxDZXGPQ0uekUOYR5JDO5+SQ1DHk0OtuZFD+vSQQ32tjkPaOo5D9O2LQ1JLjENaFI5DLiKQQ0zVkEO3V5NDVu+SQwAvlEMolJJDQ2mQQ91fjkNcfY1DHbWMQy88jUOIsI5DOZOPQ7hykUPxZ5JDUvWUQzA0lkOwxZJDrXmQQ/gBjkPCiI1De5WMQ5E5i0MD8oxD7fqNQwFVkUMZypFDRKCUQ62klEM+eZJDCuePQzipjUNv+41DuDaNQ72kjUOxWY1Dz5iPQ7U9kkP4EJNDqlqUQ31XlEOxU5NDDGGQQwxaj0Nwn41DB22NQ+wvjEMhGo5DKlWPQ4dmkUMQy5JDlTqTQx3Ik0Mqg5JDcIeQQzHOjkOCLI5DGveNQycKjEMt2Y1D2syNQ09mkEOEO5JD2F6TQ1nikkNhLZJDww+QQ9q4jUMGNI1DL2SMQ7QIjUOWnYxDUOiOQ1NqkkOKoJNDdECVQ0Z6lEMR8ZJDie6PQxjNjkMatI5D\",\"dtype\":\"float32\",\"order\":\"little\",\"shape\":[1032]}},\"selected\":{\"id\":\"1048\"},\"selection_policy\":{\"id\":\"1049\"}},\"id\":\"1034\",\"type\":\"ColumnDataSource\"},{\"attributes\":{},\"id\":\"1006\",\"type\":\"DataRange1d\"},{\"attributes\":{},\"id\":\"1042\",\"type\":\"BasicTickFormatter\"},{\"attributes\":{\"line_alpha\":0.1,\"line_color\":\"#1f77b4\",\"x\":{\"field\":\"x\"},\"y\":{\"field\":\"y\"}},\"id\":\"1036\",\"type\":\"Line\"},{\"attributes\":{},\"id\":\"1017\",\"type\":\"BasicTicker\"},{\"attributes\":{\"formatter\":{\"id\":\"1045\"},\"major_label_policy\":{\"id\":\"1044\"},\"ticker\":{\"id\":\"1013\"}},\"id\":\"1012\",\"type\":\"LinearAxis\"},{\"attributes\":{},\"id\":\"1024\",\"type\":\"ResetTool\"},{\"attributes\":{},\"id\":\"1010\",\"type\":\"LinearScale\"},{\"attributes\":{\"active_multi\":null,\"tools\":[{\"id\":\"1020\"},{\"id\":\"1021\"},{\"id\":\"1022\"},{\"id\":\"1023\"},{\"id\":\"1024\"},{\"id\":\"1025\"}]},\"id\":\"1027\",\"type\":\"Toolbar\"},{\"attributes\":{},\"id\":\"1013\",\"type\":\"BasicTicker\"},{\"attributes\":{\"axis\":{\"id\":\"1012\"},\"ticker\":null},\"id\":\"1015\",\"type\":\"Grid\"},{\"attributes\":{},\"id\":\"1048\",\"type\":\"Selection\"},{\"attributes\":{\"source\":{\"id\":\"1034\"}},\"id\":\"1038\",\"type\":\"CDSView\"},{\"attributes\":{\"axis\":{\"id\":\"1016\"},\"dimension\":1,\"ticker\":null},\"id\":\"1019\",\"type\":\"Grid\"},{\"attributes\":{\"formatter\":{\"id\":\"1042\"},\"major_label_policy\":{\"id\":\"1041\"},\"ticker\":{\"id\":\"1017\"}},\"id\":\"1016\",\"type\":\"LinearAxis\"},{\"attributes\":{},\"id\":\"1023\",\"type\":\"SaveTool\"},{\"attributes\":{},\"id\":\"1049\",\"type\":\"UnionRenderers\"},{\"attributes\":{\"data_source\":{\"id\":\"1034\"},\"glyph\":{\"id\":\"1035\"},\"hover_glyph\":null,\"muted_glyph\":null,\"nonselection_glyph\":{\"id\":\"1036\"},\"view\":{\"id\":\"1038\"}},\"id\":\"1037\",\"type\":\"GlyphRenderer\"},{\"attributes\":{},\"id\":\"1025\",\"type\":\"HelpTool\"}],\"root_ids\":[\"1003\"]},\"title\":\"Bokeh Application\",\"version\":\"2.3.0\"}};\n", + " var render_items = [{\"docid\":\"cf918a2f-e7fb-4598-adde-0b4882c85a73\",\"root_ids\":[\"1003\"],\"roots\":{\"1003\":\"391cfc12-6f93-4b9e-8f6c-9bcb8d53aa17\"}}];\n", + " root.Bokeh.embed.embed_items_notebook(docs_json, render_items);\n", + "\n", + " }\n", + " if (root.Bokeh !== undefined) {\n", + " embed_document(root);\n", + " } else {\n", + " var attempts = 0;\n", + " var timer = setInterval(function(root) {\n", + " if (root.Bokeh !== undefined) {\n", + " clearInterval(timer);\n", + " embed_document(root);\n", + " } else {\n", + " attempts++;\n", + " if (attempts > 100) {\n", + " clearInterval(timer);\n", + " console.log(\"Bokeh: ERROR: Unable to run BokehJS code because BokehJS library is missing\");\n", + " }\n", + " }\n", + " }, 10, root)\n", + " }\n", + "})(window);" + ], + "application/vnd.bokehjs_exec.v0+json": "" + }, + "metadata": { + "application/vnd.bokehjs_exec.v0+json": { + "id": "1003" + } + }, + "output_type": "display_data" + } + ], + "source": [ + "p = bokeh.plotting.figure(plot_width=1000, plot_height=400)\n", + "\n", + "# add a line renderer\n", + "# p.line([str(i1) for i1 in tasmax_dwd_ssp585_timeseries.coord('time').cells()][:20], tasmax_dwd_ssp585_timeseries[0,:].data[:20])\n", + "p.line(range( tasmax_dwd_ssp585_timeseries[0,:].data.shape[0]), tasmax_dwd_ssp585_timeseries[0,:].data)\n", + "\n", + "# p.line([1, 2, 3, 4, 5], [6, 7, 2, 4, 5], line_width=2)\n", + "bokeh.plotting.show(p)" + ] + }, + { + "cell_type": "markdown", + "id": "approximate-mercy", + "metadata": {}, + "source": [ + "## Existing example installations \n", + "\n", + "Pangeo has been installed in many different places around the world. Here are some example installations\n", + "* Informatics Lab research deployment - AWS, Azure\n", + "* Cheyenne HPC\n", + "* JASMIN academic computing service\n", + "\n", + "Other services are available that contain many of the elements of a Pangeo implementation:\n", + "* AWS sagemaker, Azure ML\n", + "* Jupyter lab running on a local computer\n" + ] }, { "cell_type": "markdown", - "id": "olive-board", + "id": "bizarre-siemens", "metadata": {}, "source": [ - "## Dashboarding" + "## Getting started\n", + "\n", + "If you want to set up your own Pangeo instance, the Pangeo community has lots of different recipes and examples for doing, available through the Pangeo Community Website, and community help available through Discourse:\n", + "\n", + "* Deployment guide: https://pangeo.io/setup_guides/index.html\n", + "* Pangeo Discourse https://discourse.pangeo.io/" ] } ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "pangeo_lectures", "language": "python", - "name": "python3" + "name": "pangeo_lectures" }, "language_info": { "codemirror_mode": { @@ -137,7 +1557,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.9" + "version": "3.7.10" } }, "nbformat": 4, diff --git a/04_data_a_modern_approach.ipynb b/04_data_a_modern_approach.ipynb index 87944c4..cdbd0bc 100644 --- a/04_data_a_modern_approach.ipynb +++ b/04_data_a_modern_approach.ipynb @@ -2,170 +2,5961 @@ "cells": [ { "cell_type": "markdown", - "id": "simplified-wings", + "id": "developed-electric", "metadata": {}, "source": [ - "# Section 4: Data - A modern approach " + "# Section 4: Data - A modern approach \n", + "![Pangeo Logo ](images/pangeo_logo_small.png)\n", + "\n", + "[Pangeo Website](https://pangeo.io/)" + ] + }, + { + "cell_type": "markdown", + "id": "upper-marketing", + "metadata": {}, + "source": [ + "It's all very well having a distributed, massively parallel scientific compute platform available. If we don't use it properly, our tasks won't make best use of the infrastructure. There are two primary blockers to making best use of the platform: poorly written source code and poorly strcutured datasets. Hopefully with the tools and libraries that are now available for scientific computing, such as dask, users of the platform in general, and specifically our first use case of Scientific Analyst or Researcher it willl be easy to set up and use for interactive computing at scale. We can write code that clearly communicates our intentions to some one reading the code, while underlying libraries ensure that it is efficiently executed by the compute platform (meeting the goal of separation of concerns). The next challenge is the data we work with. One might think the data is what it is, and it's up to the code to deal with approiately, but that is not the case. Those who create the dataset, primarily the second use case of Data Engineer, can do a lot to make it easier to use later. The challenge of presenting data so it is easy to consume and understand is an old one." + ] + }, + { + "cell_type": "markdown", + "id": "british-bennett", + "metadata": {}, + "source": [ + "## An Historical Digression - Reading ancient texts\n", + "\n", + "![Codex Sinaticus](https://www.bl.uk/britishlibrary/~/media/bl/global/dst%20discovering%20sacred%20texts/collection%20items/codexsinaiticus-add_ms_43725_f244v.jpg?w=608&h=342&hash=11BF1F0A1DE8CAC524DE050F912D7AD7)\n", + "\n", + "When we look at text, we see a lot of element that we may not think much about that add a lot to our understanding of what the text contains. For example, there are punctuation marks to divide the text into sentences and phrases. In addition, there are gaps left to indicate paragraphs, quotes, lists etc. Using whitespace for meaning was by no means invented by Python! As you can see from the accompanying pictures, text in ancient manuscripts often had none of these helpful elements. In some languages the vowels were not even explictly specified All all the text was nominally there, but actually reading it, particuarly reading it out loud, required a lot of skill to interpret what was recorded on the page. There was a proffession of *lektors* who were proffessional readers, because the skills required for reading any substantial text required a lot of training and practice. Much oif the burden of reading was placed on the consumer of the manuscript rather than the the producer.\n", + "\n", + "Over time our way of writing has evolved to provide as much help for the reader as possible, so that the person reading can focus on the actual content of the document they are readfing, rather than the skill of reading itself. We expect a lot more of the writer, but they are the one who best knows the context and the meaning and so are best placed to help other interpret the text through whitespace, puntuation etc." + ] + }, + { + "cell_type": "markdown", + "id": "antique-heaven", + "metadata": {}, + "source": [ + "## The Goal - Data for reproducible, shareable research\n", + "That \"fun\" historical digression does have a point relevant to this course, that is that we want to make it as ewasy as possible for our data to be used by researchers and for data producers to provide as much help as possible in accesing, loading and analysing the data. It is not enough that that the raw data is present, it needs to be described by sufficient metadata and strcutured for effcient access. Data consumers should require skills in the domain the data describes to understasnd the contents, but require as little as possible in terms of skills in the areas of data and software engineering. \n" + ] + }, + { + "cell_type": "markdown", + "id": "choice-centre", + "metadata": {}, + "source": [ + "![Data Description Levels](images/data_description_levels.png)" + ] + }, + { + "cell_type": "markdown", + "id": "synthetic-corporation", + "metadata": {}, + "source": [ + "We might think of different levels of descriptiveness and ease of access for datasets.\n", + "\n", + "* *Raw Data* - A series of text or binary files with the data values but no description of what they mean e.g. units. Finding the data you need and pulling it together into a coherent dataset requires knwloedge from elsewhere.\n", + "* *Described data files* - Data is provided as a series of files which contain a description of the data contained within (metadata). Accesing a whole dataset requires figuring out paths to many different files, which may not have consistent metadata across files. \n", + "* *Described dataset* - All data in a dataset is accessed through a single descriptor and contains all descriptions necessary for some one skilled in the domain the data describes to interpret the data. The \" behind the scenes\" structures may still be as a series of files and directories, but the user consumes the datasets as a unified whole.\n" + ] + }, + { + "cell_type": "markdown", + "id": "assumed-adjustment", + "metadata": {}, + "source": [ + "https://pangeo.io/data.html" + ] + }, + { + "cell_type": "markdown", + "id": "successful-packing", + "metadata": {}, + "source": [ + "In creating a dataset we should ensure our data follows the FAIR principle, that it is *Findable*, *Accesible*, *Interoperable* and *Reusable*, and is *Analysis-Ready* and *Cloud Optimised*. Lets dig a bit deeper into what those terms really mean and what we can do to ensure we are following those principles." + ] + }, + { + "cell_type": "markdown", + "id": "working-junction", + "metadata": {}, + "source": [ + "## FAIR data\n", + "\n", + "The FAIR principles are intended to ensure that it is easy for other to use the data that we produce so that research can build on the work of others as erfficiently as possible rather than endlessly reinventing the wheel. The principles state that data should be:\n", + "\n", + "* **Findable** - A researcher should easily be able to find the data that exists related to the problem that they are working on. This relies on sufficiently detailed description of datasets being contained in the metadata and being accessible without reading the whole dataset.An emerging technology is *data search engines*, which aim to make finding data as easy as finding a web page is through web search engines.\n", + "* **Accessible** - Once a researcher has found a dataset that they have determined will be a useful input to their pipeline in addressing a research question, they should easily be able to access that data.\n", + "* **Interoperable** - The user should be able to load the data into the tool of their choice and integrate with the rest of their research pipeline.\n", + "* **Reusable** - Data can easily be used by others who have not played a part in creating it, and ideally should be usable by those who not specialists in the particular domain of the data.\n" + ] + }, + { + "cell_type": "markdown", + "id": "intermediate-hometown", + "metadata": {}, + "source": [ + "More Info:\n", + "* https://www.go-fair.org/fair-principles/\n", + "* https://www.nature.com/articles/sdata201618" + ] + }, + { + "cell_type": "markdown", + "id": "dimensional-convention", + "metadata": {}, + "source": [ + "## Analysis Ready Data\n", + "\n", + "The concept of analysis ready data is closely aligned with the FAIR principles but takes in particular the reusable and interoperable principles further. When asking if data is *analysis-ready*, we are really asking if once I have loaded the dataset in my favourite, is it ready to use in my analysis or for training a statistical or machine-learning model, or do I have do a lot of prepreprocessing to get it in to a state where it is ready to use in one of these ways. Data that is analysis ready should have the following attributes (thanks to Aaron Hopkinson for the following descriptions):" + ] + }, + { + "cell_type": "markdown", + "id": "copyrighted-attendance", + "metadata": {}, + "source": [ + "* *Metadata rich*. Users should be provided an object with metadata which can be used\n", + "to define operations over (e.g: mean over ensemble realization).\n", + "* *Hidden infrastructure* - Analysis can be readily achieved without knowledge of the underlying storage of the\n", + "data (e.g: file paths, chunking etc – note efficiency considerations)\n", + " * Users should not have to manually interact with the storage system, this should be\n", + "handled automatically as necessary. Ideally the storage should be abstracted away\n", + "from the user.\n", + "* *Simple analysis with simple code* - Basic analysis of a subset of the data (e.g: descriptive statistics, subsetting) should be possible with a minimum number of additional lines of code, and on the basis of metadata. \n", + " * e.g: `mean(dataset, axis=’time’)`\n", + "* *Supports modular analsysis* - Custom analytics should be possible by creating functions which take in an analysis ready dataset and return another.\n" + ] + }, + { + "cell_type": "markdown", + "id": "saved-helmet", + "metadata": {}, + "source": [ + "This is still a new concept and as such all the implications and meanings are still being worked out. Partly this is because deciding whether something is analysis ready requires knowing something about what analsysis is to be performed. For many use cases there are common requirements from a dataset to make it usable without further processing. We want this to be true of analysis ready data for as many use cases as possible." + ] + }, + { + "cell_type": "markdown", + "id": "incomplete-cement", + "metadata": {}, + "source": [ + "## Cloud optimised Data\n", + "\n", + "As we've seen, the cloud infrastructure underpinning a Pangeo is different to the structure of the desktop or cluster models. We need to make sure that the way we strcuture our datasets enables efficient access for distributed access patterns. We call data that is presented in this is way *cloud optimised*. Some of the characteristics of cloud optmised data include (thanks to Aaron Hopkinson for the following descriptions):" + ] + }, + { + "cell_type": "markdown", + "id": "protected-message", + "metadata": {}, + "source": [ + "* *Fast metadata acesss* Metadata knowledge should be known with low latency, and without pulling from a large number of individual objects (low cost operation). It should be consolidated in\n", + "some way.\n", + "* *Lazy access* For performance, processing on ARCO data should ideally be possible to construct lazily, so that no processing is carried out until explicitly requested by the user. Any computation should also be processed close to the data, in order to avoid unnecessary network traffic.\n", + "* *Data appropriate chunking* - Subsets (or chunks) of data should be retrievable either using a whole object fetch, or easily computable byte-range requests, in order to leverage large scale parallelism and avoid unnecessary network transfers.\n", + "* *Consistency* - Considerations of concept of “eventual consistency” – if an object in cloud storage is updated from one node and another node later reads the same object, the state change may not have propagated.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "acting-fiber", + "metadata": {}, + "source": [ + "\n", + "The more one optimises for a specific, the less optimised it is likely to be for other cases. This is especially true for chunking. So one wants to make data access efficient for as broad a spectrum of possible uses as one can, but also focusing on optimising for the most common use cases. " + ] + }, + { + "cell_type": "markdown", + "id": "changed-porter", + "metadata": {}, + "source": [ + "### Challenges\n", + "\n", + "These principles are a good starting point for thinking about how we present data. These do not present easy answers for exactly what should be done. There are often choices to be made and different requirements to balance. For example optimal chunking is specific to the way the data is accessed, so there is some judgement that is required in term of what is best for a particular dataset. Other challenges include:\n", + "* *Different platforms* Cloud requirements may be different from HPC requirements. What may work well on one platform may not work well on another.\n", + "* *Language agnostic* - How do we ensure that our data can be accessed from tols written in different programming languages.\n", + "* *Flexible, interoperable standards* Unlikely to be one complete solution/concrete implementation – instead need to agree principles such as use of open standards and interoperability.\n" + ] + }, + { + "cell_type": "markdown", + "id": "alert-hypothetical", + "metadata": {}, + "source": [ + "## Data types and tools\n", + "\n", + "Lets look at some practical examples of data. In weather and climate science, we distinguish between two sorts of data which are stored and accessed a bit differently, so we'll look at the specific of each separately. The two data types are\n", + "* *tabular data* - Data stored in rows and columns (a bit like a spreadsheet or database architecture).\n", + "* *Gridded data* - Data is stored as a multiple dimensional array, where the dimensions of the array can include latitude, longitude, height, data time, generation time, ensemble member number etc." + ] + }, + { + "cell_type": "markdown", + "id": "renewable-missouri", + "metadata": {}, + "source": [ + "### Tabular Data - XBT Project\n", + "\n", + "Tabular data is the sort of data often stored in a traditional relational database system or a spreadsheet. Important charactereristics of tabular data:\n", + "* Each column is a variable/field/feature\n", + "* Each row is a measurement/observation/record/data point\n", + "* Metadata often contained as column headers, but can also be stored as columns in the table\n", + "* not a conveneient way to store metadata for the whole table.\n", + "* each table is often stored as a separate file\n", + "* common file formats: CSV, parquet, SQL database\n", + "\n", + "In this example we will look at an ocean temperature dataset gathered by the World Ocean Database. It contains temperature profiles through a depth of the ocean at a particular location for each measurement, taken using eXpendable BathyThermogrpahs (XBTs). This has been the subject of a machine learning project at the Met Office to fill in missing data using ML techniques.\n", + "\n", + "More information:\n", + "* XBT project repository - https://github.com/MetOffice/XBTs_classification\n", + "* World Ocean Database - https://www.ncei.noaa.gov/products/world-ocean-database" ] }, { "cell_type": "code", - "execution_count": null, - "id": "secondary-specification", + "execution_count": 1, + "id": "horizontal-cinema", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "import pandas" + ] }, { - "cell_type": "markdown", - "id": "nervous-trinity", + "cell_type": "code", + "execution_count": 2, + "id": "cosmetic-trinity", "metadata": {}, + "outputs": [], "source": [ - "## Analysis Ready Data" + "data_paths = [f's3://xbt-data/csv_with_imeta/xbt_{year}' for year in range(1966,2015)]" ] }, { "cell_type": "code", - "execution_count": null, - "id": "vocal-guatemala", + "execution_count": 3, + "id": "peripheral-brain", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "xbt_1966 = pandas.read_csv(data_paths[0])" + ] }, { - "cell_type": "markdown", - "id": "individual-gospel", + "cell_type": "code", + "execution_count": 4, + "id": "equivalent-difference", "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0Unnamed: 0.1countrylatlondateyearmonthdayinstitute...instrumentmodelmanufacturertemperature_profiletemperature_quality_flagdepth_profilemax_depthdepth_quality_flagimeta_appliedid
000UNITED STATES32.966667-117.633331196604121966412US NAVY SHIPS OF OPPORTUNITY...XBT: T4 (SIPPICAN)T4SIPPICAN[16.153478622436523, 15.893913269042969, 15.69...[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...[-1.1172676086425781, 6.054140567779541, 7.078...466.892670[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...12052528
111UNITED STATES33.016666-118.116669196604131966413US NAVY SHIPS OF OPPORTUNITY...XBT: T4 (SIPPICAN)T4SIPPICAN[16.353145599365234, 16.32319450378418, 15.873...[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...[-1.1095428466796875, 0.9393265247344971, 4.01...466.852051[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...12052529
222UNITED STATES33.066666-118.466667196604141966414US NAVY SHIPS OF OPPORTUNITY...XBT: T4 (SIPPICAN)T4SIPPICAN[15.165132522583008, 15.165132522583008, 14.81...[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][-1.1249465942382812, 6.047884464263916, 12.19...70.602089[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]12052530
333UNITED STATES32.700001-118.666664196604141966414US NAVY SHIPS OF OPPORTUNITY...XBT: T4 (SIPPICAN)T4SIPPICAN[15.115216255187988, 15.115216255187988, 14.96...[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...[-1.1200714111328125, 12.199983596801758, 13.2...466.907410[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...12052531
444UNITED STATES32.933334-117.916664196604141966414US NAVY SHIPS OF OPPORTUNITY...XBT: T4 (SIPPICAN)T4SIPPICAN[15.923863410949707, 15.923863410949707, 15.67...[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...[-1.1018257141113281, 14.260088920593262, 17.3...466.811493[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...12052532
..................................................................
174517451745UNITED STATES28.716667-145.6666721966123119661231US DOC NOAA NMFS (MONTEREY; CA)...XBT: T4 (SIPPICAN)T4SIPPICAN[20.515398025512695, 20.415565490722656, 20.11...[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][-1.3544235229492188, 51.20085525512695, 128.4...474.280792[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]02054254
174617461746UNITED STATES18.650000106.98333019661231196612310...XBT: T4 (SIPPICAN)T4SIPPICAN[23.70085334777832, 23.501188278198242, 23.501...[0 0 0 0 0][-1.4307975769042969, 8.897333145141602, 19.22...50.191364[0 0 0 0 0]13411243
174717471747UNITED STATES29.700001-143.6166691966123119661231US DOC NOAA NMFS (MONTEREY; CA)...XBT: T4 (SIPPICAN)T4SIPPICAN[20.016233444213867, 19.816566467285156, 18.21...[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][-1.3290176391601562, 94.42085266113281, 95.45...459.816345[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]02054256
174817481748UNITED STATES30.299999-142.3666691966123119661231US DOC NOAA NMFS (MONTEREY; CA)...XBT: T4 (SIPPICAN)T4SIPPICAN[19.616901397705078, 19.51706886291504, 18.219...[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][-1.317779541015625, 106.7505874633789, 107.77...471.017609[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]02054257
174917491749UNITED STATES18.583334106.90000219661231196612310...XBT: T4 (SIPPICAN)T4SIPPICAN[23.201688766479492, 23.30152130126953, 23.301...[0 0 0 0 0 0 0][-1.4271049499511719, 8.900050163269043, 19.22...101.760010[0 0 0 0 0 0 0]13411246
\n", + "

1750 rows × 22 columns

\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 Unnamed: 0.1 country lat lon \\\n", + "0 0 0 UNITED STATES 32.966667 -117.633331 \n", + "1 1 1 UNITED STATES 33.016666 -118.116669 \n", + "2 2 2 UNITED STATES 33.066666 -118.466667 \n", + "3 3 3 UNITED STATES 32.700001 -118.666664 \n", + "4 4 4 UNITED STATES 32.933334 -117.916664 \n", + "... ... ... ... ... ... \n", + "1745 1745 1745 UNITED STATES 28.716667 -145.666672 \n", + "1746 1746 1746 UNITED STATES 18.650000 106.983330 \n", + "1747 1747 1747 UNITED STATES 29.700001 -143.616669 \n", + "1748 1748 1748 UNITED STATES 30.299999 -142.366669 \n", + "1749 1749 1749 UNITED STATES 18.583334 106.900002 \n", + "\n", + " date year month day institute ... \\\n", + "0 19660412 1966 4 12 US NAVY SHIPS OF OPPORTUNITY ... \n", + "1 19660413 1966 4 13 US NAVY SHIPS OF OPPORTUNITY ... \n", + "2 19660414 1966 4 14 US NAVY SHIPS OF OPPORTUNITY ... \n", + "3 19660414 1966 4 14 US NAVY SHIPS OF OPPORTUNITY ... \n", + "4 19660414 1966 4 14 US NAVY SHIPS OF OPPORTUNITY ... \n", + "... ... ... ... ... ... ... \n", + "1745 19661231 1966 12 31 US DOC NOAA NMFS (MONTEREY; CA) ... \n", + "1746 19661231 1966 12 31 0 ... \n", + "1747 19661231 1966 12 31 US DOC NOAA NMFS (MONTEREY; CA) ... \n", + "1748 19661231 1966 12 31 US DOC NOAA NMFS (MONTEREY; CA) ... \n", + "1749 19661231 1966 12 31 0 ... \n", + "\n", + " instrument model manufacturer \\\n", + "0 XBT: T4 (SIPPICAN) T4 SIPPICAN \n", + "1 XBT: T4 (SIPPICAN) T4 SIPPICAN \n", + "2 XBT: T4 (SIPPICAN) T4 SIPPICAN \n", + "3 XBT: T4 (SIPPICAN) T4 SIPPICAN \n", + "4 XBT: T4 (SIPPICAN) T4 SIPPICAN \n", + "... ... ... ... \n", + "1745 XBT: T4 (SIPPICAN) T4 SIPPICAN \n", + "1746 XBT: T4 (SIPPICAN) T4 SIPPICAN \n", + "1747 XBT: T4 (SIPPICAN) T4 SIPPICAN \n", + "1748 XBT: T4 (SIPPICAN) T4 SIPPICAN \n", + "1749 XBT: T4 (SIPPICAN) T4 SIPPICAN \n", + "\n", + " temperature_profile \\\n", + "0 [16.153478622436523, 15.893913269042969, 15.69... \n", + "1 [16.353145599365234, 16.32319450378418, 15.873... \n", + "2 [15.165132522583008, 15.165132522583008, 14.81... \n", + "3 [15.115216255187988, 15.115216255187988, 14.96... \n", + "4 [15.923863410949707, 15.923863410949707, 15.67... \n", + "... ... \n", + "1745 [20.515398025512695, 20.415565490722656, 20.11... \n", + "1746 [23.70085334777832, 23.501188278198242, 23.501... \n", + "1747 [20.016233444213867, 19.816566467285156, 18.21... \n", + "1748 [19.616901397705078, 19.51706886291504, 18.219... \n", + "1749 [23.201688766479492, 23.30152130126953, 23.301... \n", + "\n", + " temperature_quality_flag \\\n", + "0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... \n", + "1 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... \n", + "2 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] \n", + "3 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... \n", + "4 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... \n", + "... ... \n", + "1745 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] \n", + "1746 [0 0 0 0 0] \n", + "1747 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] \n", + "1748 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] \n", + "1749 [0 0 0 0 0 0 0] \n", + "\n", + " depth_profile max_depth \\\n", + "0 [-1.1172676086425781, 6.054140567779541, 7.078... 466.892670 \n", + "1 [-1.1095428466796875, 0.9393265247344971, 4.01... 466.852051 \n", + "2 [-1.1249465942382812, 6.047884464263916, 12.19... 70.602089 \n", + "3 [-1.1200714111328125, 12.199983596801758, 13.2... 466.907410 \n", + "4 [-1.1018257141113281, 14.260088920593262, 17.3... 466.811493 \n", + "... ... ... \n", + "1745 [-1.3544235229492188, 51.20085525512695, 128.4... 474.280792 \n", + "1746 [-1.4307975769042969, 8.897333145141602, 19.22... 50.191364 \n", + "1747 [-1.3290176391601562, 94.42085266113281, 95.45... 459.816345 \n", + "1748 [-1.317779541015625, 106.7505874633789, 107.77... 471.017609 \n", + "1749 [-1.4271049499511719, 8.900050163269043, 19.22... 101.760010 \n", + "\n", + " depth_quality_flag imeta_applied \\\n", + "0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... 1 \n", + "1 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... 1 \n", + "2 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 1 \n", + "3 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... 1 \n", + "4 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... 1 \n", + "... ... ... \n", + "1745 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 0 \n", + "1746 [0 0 0 0 0] 1 \n", + "1747 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 0 \n", + "1748 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 0 \n", + "1749 [0 0 0 0 0 0 0] 1 \n", + "\n", + " id \n", + "0 2052528 \n", + "1 2052529 \n", + "2 2052530 \n", + "3 2052531 \n", + "4 2052532 \n", + "... ... \n", + "1745 2054254 \n", + "1746 3411243 \n", + "1747 2054256 \n", + "1748 2054257 \n", + "1749 3411246 \n", + "\n", + "[1750 rows x 22 columns]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "### Metadata" + "xbt_1966" ] }, { "cell_type": "code", - "execution_count": null, - "id": "future-freeze", + "execution_count": 5, + "id": "pretty-seventh", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "import matplotlib\n", + "import matplotlib.pyplot" + ] }, { - "cell_type": "markdown", - "id": "rotary-scanning", + "cell_type": "code", + "execution_count": 6, + "id": "generic-census", "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], "source": [ - "## Cloud optimised Data" + "fig1 = matplotlib.pyplot.figure('xbt_profile', figsize=(16,10))\n", + "ax1 = fig1.add_subplot(1,1,1, title='Depth vs temperature')\n", + "_ = ax1.plot(eval(xbt_1966.loc[0,'depth_profile']),eval(xbt_1966.loc[0,'temperature_profile']))" ] }, { "cell_type": "code", - "execution_count": null, - "id": "black-billy", + "execution_count": 7, + "id": "amber-freeze", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "import dask.dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "horizontal-november", + "metadata": {}, + "outputs": [], + "source": [ + "#dask.dataframe.read_csv(data_paths)" + ] }, { "cell_type": "markdown", - "id": "orange-killing", + "id": "forced-fantasy", "metadata": {}, "source": [ - "### Lazy Loading" + "Things to note about the data engineering of this dataset\n", + "* data split by year for parallel access\n", + "* data currently accessed through mutliple files, would be better to present as a coherent dataset that abstracts away file access particulars\n", + "* minimal metadata in columns headers, columns such as country, institution etc. are metadata where depth and temperature profiles are the data." + ] + }, + { + "cell_type": "markdown", + "id": "crude-mandate", + "metadata": {}, + "source": [ + "## Gridded data\n", + "\n", + "Gridded data has data stored as a multi-dimensional array. This usually covers a geographic area for a period of time, so will include at latitude, longitude and time as dimensions of the array. Some important characteristics of gridded data include:\n", + "* Separate physical phenomena e.g. temperature, wind speed, precipitation, are separate arrays \n", + "* In addition to latutitude, longitude and time, array dimensions might include ensemble member number, height (for representing the 3D atmosphere at multiple levels), forecast time (when the data was created) and others depending on context.\n", + "* Metadata is stored in a separate data structure for each array\n", + "\n" ] }, { "cell_type": "code", - "execution_count": null, - "id": "acting-respect", + "execution_count": 9, + "id": "reverse-newspaper", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "import boto3\n", + "import botocore\n", + "import datetime\n", + "import matplotlib.pyplot as plt\n", + "import os.path\n", + "import s3fs" + ] }, { - "cell_type": "markdown", - "id": "greek-voluntary", + "cell_type": "code", + "execution_count": 10, + "id": "military-black", "metadata": {}, + "outputs": [], "source": [ - "## Data types and tools" + "import xarray as xr\n", + "import matplotlib.pyplot\n", + "import iris\n", + "import iris.quickplot\n", + "import cartopy.crs" ] }, { "cell_type": "code", - "execution_count": null, - "id": "dressed-instruction", + "execution_count": 11, + "id": "applied-culture", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "era5_bucket = 'era5-pds'\n", + "\n", + "# AWS access / secret keys required\n", + "s3 = boto3.resource('s3')\n", + "bucket = s3.Bucket(era5_bucket)\n", + "\n", + "# No AWS keys required\n", + "client = boto3.client('s3', config=botocore.client.Config(signature_version=botocore.UNSIGNED))" + ] }, { - "cell_type": "markdown", - "id": "geological-consistency", + "cell_type": "code", + "execution_count": 12, + "id": "dangerous-straight", + "metadata": {}, + "outputs": [], + "source": [ + "year_list = []\n", + "paginator = client.get_paginator('list_objects')\n", + "result = paginator.paginate(Bucket=era5_bucket, Delimiter='/')\n", + "for prefix in result.search('CommonPrefixes'):\n", + " year_list += [prefix.get('Prefix')]" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "direct-chess", "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "There are 19 objects available for January, 2019\n", + "--\n", + "2019/01/data/air_pressure_at_mean_sea_level.nc\n", + "2019/01/data/air_temperature_at_2_metres.nc\n", + "2019/01/data/air_temperature_at_2_metres_1hour_Maximum.nc\n", + "2019/01/data/air_temperature_at_2_metres_1hour_Minimum.nc\n", + "2019/01/data/dew_point_temperature_at_2_metres.nc\n", + "2019/01/data/eastward_wind_at_100_metres.nc\n", + "2019/01/data/eastward_wind_at_10_metres.nc\n", + "2019/01/data/integral_wrt_time_of_surface_direct_downwelling_shortwave_flux_in_air_1hour_Accumulation.nc\n", + "2019/01/data/lwe_thickness_of_surface_snow_amount.nc\n", + "2019/01/data/northward_wind_at_100_metres.nc\n", + "2019/01/data/northward_wind_at_10_metres.nc\n", + "2019/01/data/precipitation_amount_1hour_Accumulation.nc\n", + "2019/01/data/sea_surface_temperature.nc\n", + "2019/01/data/sea_surface_wave_from_direction.nc\n", + "2019/01/data/sea_surface_wave_mean_period.nc\n", + "2019/01/data/significant_height_of_wind_and_swell_waves.nc\n", + "2019/01/data/snow_density.nc\n", + "2019/01/data/surface_air_pressure.nc\n", + "2019/01/main.nc\n" + ] + } + ], "source": [ - "### Tabular Data - Pandas" + "keys = []\n", + "date = datetime.date(2019,1,1) # update to desired date\n", + "prefix = date.strftime('%Y/%m/')\n", + "\n", + "response = client.list_objects_v2(Bucket=era5_bucket, Prefix=prefix)\n", + "response_meta = response.get('ResponseMetadata')\n", + "\n", + "if response_meta.get('HTTPStatusCode') == 200:\n", + " contents = response.get('Contents')\n", + " if contents == None:\n", + " print(\"No objects are available for %s\" % date.strftime('%B, %Y'))\n", + " else:\n", + " for obj in contents:\n", + " keys.append(obj.get('Key'))\n", + " print(\"There are %s objects available for %s\\n--\" % (len(keys), date.strftime('%B, %Y')))\n", + " for k in keys:\n", + " print(k)\n", + "else:\n", + " print(\"There was an error with your request.\")" ] }, { "cell_type": "code", - "execution_count": null, - "id": "fifteen-allowance", + "execution_count": 14, + "id": "identical-essay", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "fs1 = s3fs.S3FileSystem()" + ] }, { - "cell_type": "markdown", - "id": "lyric-turner", + "cell_type": "code", + "execution_count": 15, + "id": "dimensional-throat", "metadata": {}, + "outputs": [], "source": [ - "## Gridded data -iris and xarray" + "path1 = f's3://{era5_bucket}/2019/01/data/air_temperature_at_2_metres.nc'" ] }, { "cell_type": "code", - "execution_count": null, - "id": "approximate-interference", + "execution_count": 16, + "id": "buried-classification", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "fileObj1 = fs1.open(path1)" + ] }, { - "cell_type": "markdown", - "id": "packed-heading", + "cell_type": "code", + "execution_count": 17, + "id": "accessory-roller", "metadata": {}, + "outputs": [], "source": [ - "## Sharing data - FAIR principles" + "temp_201901 = xr.open_dataset(fileObj1, engine='h5netcdf')\n" ] }, { "cell_type": "code", - "execution_count": null, - "id": "solved-rental", + "execution_count": 18, + "id": "going-report", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.Dataset>\n",
+       "Dimensions:                      (lat: 721, lon: 1440, time0: 744)\n",
+       "Coordinates:\n",
+       "  * lon                          (lon) float32 0.0 0.25 0.5 ... 359.5 359.8\n",
+       "  * lat                          (lat) float32 90.0 89.75 89.5 ... -89.75 -90.0\n",
+       "  * time0                        (time0) datetime64[ns] 2019-01-01 ... 2019-0...\n",
+       "Data variables:\n",
+       "    air_temperature_at_2_metres  (time0, lat, lon) float32 ...\n",
+       "Attributes:\n",
+       "    source:       Reanalysis\n",
+       "    institution:  ECMWF\n",
+       "    title:        ERA5 forecasts
" + ], + "text/plain": [ + "\n", + "Dimensions: (lat: 721, lon: 1440, time0: 744)\n", + "Coordinates:\n", + " * lon (lon) float32 0.0 0.25 0.5 ... 359.5 359.8\n", + " * lat (lat) float32 90.0 89.75 89.5 ... -89.75 -90.0\n", + " * time0 (time0) datetime64[ns] 2019-01-01 ... 2019-0...\n", + "Data variables:\n", + " air_temperature_at_2_metres (time0, lat, lon) float32 ...\n", + "Attributes:\n", + " source: Reanalysis\n", + " institution: ECMWF\n", + " title: ERA5 forecasts" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "temp_201901" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "disturbed-maldives", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "temp_201901_array = temp_201901.air_temperature_at_2_metres" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "opponent-ballot", + "metadata": {}, + "outputs": [], + "source": [ + "europe_temp = temp_201901.loc[{'time0':slice('2019-01-05','2019-01-07'), 'lat':slice(60,40),'lon':slice(0,15)}].air_temperature_at_2_metres" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "black-think", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.DataArray 'air_temperature_at_2_metres' (time0: 72, lat: 81, lon: 61)>\n",
+       "[355752 values with dtype=float32]\n",
+       "Coordinates:\n",
+       "  * lon      (lon) float32 0.0 0.25 0.5 0.75 1.0 ... 14.0 14.25 14.5 14.75 15.0\n",
+       "  * lat      (lat) float32 60.0 59.75 59.5 59.25 59.0 ... 40.75 40.5 40.25 40.0\n",
+       "  * time0    (time0) datetime64[ns] 2019-01-05 ... 2019-01-07T23:00:00\n",
+       "Attributes:\n",
+       "    least_significant_digit:  [1]\n",
+       "    standard_name:            air_temperature\n",
+       "    units:                    K\n",
+       "    long_name:                2 metre temperature\n",
+       "    nameECMWF:                2 metre temperature\n",
+       "    shortNameECMWF:           2t\n",
+       "    nameCDM:                  2_metre_temperature_surface\n",
+       "    product_type:             analysis
" + ], + "text/plain": [ + "\n", + "[355752 values with dtype=float32]\n", + "Coordinates:\n", + " * lon (lon) float32 0.0 0.25 0.5 0.75 1.0 ... 14.0 14.25 14.5 14.75 15.0\n", + " * lat (lat) float32 60.0 59.75 59.5 59.25 59.0 ... 40.75 40.5 40.25 40.0\n", + " * time0 (time0) datetime64[ns] 2019-01-05 ... 2019-01-07T23:00:00\n", + "Attributes:\n", + " least_significant_digit: [1]\n", + " standard_name: air_temperature\n", + " units: K\n", + " long_name: 2 metre temperature\n", + " nameECMWF: 2 metre temperature\n", + " shortNameECMWF: 2t\n", + " nameCDM: 2_metre_temperature_surface\n", + " product_type: analysis" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "europe_temp" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "european-theater", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "\n", + " \n", + " \n", + "\n", + "
Air Temperature (K)timelatitudelongitude
Shape728161
Dimension coordinates
\ttimex--
\tlatitude-x-
\tlongitude--x
Attributes
\tleast_significant_digit[1]
\tnameCDM2_metre_temperature_surface
\tnameECMWF2 metre temperature
\tproduct_typeanalysis
\tshortNameECMWF2t
\n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "europe_temp_cube = europe_temp.to_iris()\n", + "europe_temp_cube" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "individual-geneva", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "masked_array(data=[276.5, 276.6875, 276.875, 276.75, 276.6875, 276.6875,\n", + " 276.6875, 276.6875, 276.5625, 276.625, 277.1875, 277.5,\n", + " 277.8125, 277.9375, 278.0, 277.875, 277.25, 276.9375,\n", + " 276.875, 276.875, 277.0, 277.0, 277.0625, 277.3125,\n", + " 277.5, 277.5, 277.5625, 277.625, 277.6875, 278.0,\n", + " 278.3125, 278.1875, 278.3125, 278.3125, 278.8125,\n", + " 279.4375, 280.0625, 280.375, 280.75, 280.8125,\n", + " 280.9375, 280.75, 280.5625, 280.3125, 280.1875, 280.0,\n", + " 281.1875, 281.25, 281.4375, 281.3125, 281.125,\n", + " 280.8125, 280.6875, 280.5, 280.5625, 280.4375,\n", + " 280.4375, 280.6875, 281.1875, 281.75, 282.375,\n", + " 282.5625, 282.5625, 282.625, 282.75, 283.0, 283.0625,\n", + " 282.9375, 282.8125, 282.5625, 282.6875, 282.8125],\n", + " mask=[False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False,\n", + " False, False, False, False, False, False, False, False],\n", + " fill_value=1e+20,\n", + " dtype=float32)" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "europe_temp_cube.interpolate([('latitude', 51.0),('longitude', 1)],iris.analysis.Linear()).data" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "hazardous-cassette", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "fig1 = matplotlib.pyplot.figure(figsize=(8,16))\n", + "ax1 = fig1.add_subplot(1,1,1,projection=cartopy.crs.PlateCarree())\n", + "iris.quickplot.contourf(europe_temp_cube[0,:,:],axes=ax1)\n", + "ax1.coastlines()" + ] }, { "cell_type": "markdown", - "id": "genuine-sandwich", + "id": "random-deadline", "metadata": {}, "source": [ - "### Data Catalogues" + "### Data Catalogues\n", + "\n", + "We have looked at some datasets which we have accessed primarily as individual files. This is the second level in the levels of data abstraction presented earlier. The user still needs to know a lot about reading files and then joining the data read in from files into a coherent object in memory. We want to move to presenting data not just as a series of dozens or even millions of files that user has to join together. We want to absract away the storage details and present the user with a **dataset**. We would ideally like to group similar datasets together to make them accessible through **data catalogues**. One tolols for doing this is the *Intake catalog* library. \n", + "\n", + "This is a framework for creating catalogues of datasets and accessing them in such a way that user gets back from the catalogue an object in memory that is ready for analysis. This is usually a *lazy object*, where the metadata is loaded so there is sufficient information to schedule computation through somewthing like a dask task grpah, but the actual data wikll be loaded when the computation of the task graph is triggered. At that time the data that is actually needed for the computation will downloaded, which each distributed worker downloading only the data needed for their part in the task graph. Catalogues also support searching and subsetting, so you can find the data in the catalogue easily and then access only the data that you need.\n", + "\n", + "Intake has 2 main parts\n", + "* *Catalog yaml files* - A yaml file describing the data in the catalog.\n", + " * These can be nested, so a catalog can be composed of the contents of several other catalogues.\n", + "* *Driver python modules* - The accessing of data is handled through driver\n", + " * Intake supplies drivers in the standard install for the most common data types e.g. CSV\n", + " * Many different formats of data can be supported through this framework through implementation\n", + " * Driver can include custom functionality needed to support different modes of accessing particular sorts of data.\n", + "\n", + "Catalogs can be installed as one does a normal python package through conda or pip or similar. This will add components to intake\n", + " \n", + "More information:\n", + "* Intake https://intake.readthedocs.io/en/latest/\n", + "* Intake ESM - A package for accessing earth system data\n", + "* Met Office Weather Forecast Model Intake Catalog - https://github.com/informatics-lab/intake_informaticslab" ] }, { "cell_type": "code", - "execution_count": null, - "id": "organized-specific", + "execution_count": 25, + "id": "known-arkansas", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "import intake" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "choice-spirituality", + "metadata": {}, + "outputs": [], + "source": [ + "def print_cat_items(cat, indent=0):\n", + " if(isinstance(cat, intake.catalog.Catalog)):\n", + " for cat_item in list(cat):\n", + " print(\" \"*indent + cat_item)\n", + " print_cat_items(cat[cat_item], indent+2)" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "pressing-switch", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "met_office\n", + " air_quality\n", + " air_quality_hourly\n", + " air_quality_daily\n", + " weather_forecasts\n", + " mogreps_uk\n", + " single_level\n", + " height_level\n", + " pressure_level\n", + " depth_level\n", + " mogreps_g\n", + " single_level\n", + " height_level\n", + " pressure_level\n", + " depth_level\n", + " weather_continuous_timeseries\n", + " ukv_daily_timeseries\n", + " ukv_hourly_timeseries\n" + ] + } + ], + "source": [ + "print_cat_items(intake.cat)" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "excellent-serum", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.Dataset>\n",
+       "Dimensions:                                                  (forecast_period: 127, forecast_reference_time: 408, latitude: 960, longitude: 1280, realization: 18)\n",
+       "Coordinates:\n",
+       "  * forecast_period                                          (forecast_period) timedelta64[ns] ...\n",
+       "  * forecast_reference_time                                  (forecast_reference_time) datetime64[ns] ...\n",
+       "  * latitude                                                 (latitude) float64 ...\n",
+       "  * longitude                                                (longitude) float64 ...\n",
+       "  * realization                                              (realization) int64 ...\n",
+       "Data variables: (12/24)\n",
+       "    cloud_amount_of_low_cloud                                (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array<chunksize=(1, 1, 18, 960, 1280), meta=np.ndarray>\n",
+       "    cloud_amount_of_medium_cloud                             (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array<chunksize=(1, 1, 18, 960, 1280), meta=np.ndarray>\n",
+       "    cloud_amount_of_total_cloud                              (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array<chunksize=(1, 1, 18, 960, 1280), meta=np.ndarray>\n",
+       "    fog_fraction_at_screen_level                             (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array<chunksize=(1, 1, 18, 960, 1280), meta=np.ndarray>\n",
+       "    hail_fall_accumulation-PT01H                             (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array<chunksize=(1, 1, 18, 960, 1280), meta=np.ndarray>\n",
+       "    height_ASL_at_freezing_level                             (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array<chunksize=(1, 1, 18, 960, 1280), meta=np.ndarray>\n",
+       "    ...                                                       ...\n",
+       "    temperature_at_screen_level_min-PT01H                    (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array<chunksize=(1, 1, 18, 960, 1280), meta=np.ndarray>\n",
+       "    temperature_at_surface                                   (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array<chunksize=(1, 1, 18, 960, 1280), meta=np.ndarray>\n",
+       "    visibility_at_screen_level                               (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array<chunksize=(1, 1, 18, 960, 1280), meta=np.ndarray>\n",
+       "    wind_direction_at_10m                                    (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array<chunksize=(1, 1, 18, 960, 1280), meta=np.ndarray>\n",
+       "    wind_speed_at_10m                                        (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array<chunksize=(1, 1, 18, 960, 1280), meta=np.ndarray>\n",
+       "    wind_speed_at_10m_max-PT01H                              (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array<chunksize=(1, 1, 18, 960, 1280), meta=np.ndarray>
" + ], + "text/plain": [ + "\n", + "Dimensions: (forecast_period: 127, forecast_reference_time: 408, latitude: 960, longitude: 1280, realization: 18)\n", + "Coordinates:\n", + " * forecast_period (forecast_period) timedelta64[ns] ...\n", + " * forecast_reference_time (forecast_reference_time) datetime64[ns] ...\n", + " * latitude (latitude) float64 ...\n", + " * longitude (longitude) float64 ...\n", + " * realization (realization) int64 ...\n", + "Data variables: (12/24)\n", + " cloud_amount_of_low_cloud (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array\n", + " cloud_amount_of_medium_cloud (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array\n", + " cloud_amount_of_total_cloud (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array\n", + " fog_fraction_at_screen_level (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array\n", + " hail_fall_accumulation-PT01H (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array\n", + " height_ASL_at_freezing_level (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array\n", + " ... ...\n", + " temperature_at_screen_level_min-PT01H (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array\n", + " temperature_at_surface (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array\n", + " visibility_at_screen_level (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array\n", + " wind_direction_at_10m (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array\n", + " wind_speed_at_10m (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array\n", + " wind_speed_at_10m_max-PT01H (forecast_reference_time, forecast_period, realization, latitude, longitude) float32 dask.array" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "mogreps_g_ds = intake.cat.met_office.weather_forecasts.mogreps_g.single_level(license_accepted=True).to_dask()\n", + "mogreps_g_ds" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "upper-spine", + "metadata": {}, + "outputs": [], + "source": [ + "subset_uk = mogreps_g_ds.loc[{'forecast_reference_time': '2021-03-01T00:00:00', 'realization': 0, 'latitude': slice(50, 60), 'longitude': slice(-5,5)}]" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "substantial-check", + "metadata": {}, + "outputs": [], + "source": [ + "import iris.quickplot" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "dominican-powell", + "metadata": {}, + "outputs": [], + "source": [ + "time_series_exeter = subset_uk.temperature_at_surface.interp(latitude=51, longitude=-3)" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "competent-nature", + "metadata": {}, + "outputs": [], + "source": [ + "time_series_exeter_cube = time_series_exeter.to_iris()" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "empirical-marketplace", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "iris.quickplot.plot(time_series_exeter_cube)" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "accredited-electronics", + "metadata": {}, + "outputs": [], + "source": [ + "temp_cube = subset_uk.temperature_at_surface[0,:,:].to_iris()" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "aquatic-civilization", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "fig1 = matplotlib.pyplot.figure(figsize=(8,16))\n", + "ax1 = fig1.add_subplot(1,1,1,projection=cartopy.crs.PlateCarree())\n", + "iris.quickplot.contourf(temp_cube,axes=ax1)\n", + "ax1.coastlines()" + ] + }, + { + "cell_type": "markdown", + "id": "electoral-challenge", + "metadata": {}, + "source": [ + "In this example, we see that there is intuitive access to the dataset as a whole and to the parts of the dataset. We can perform simple operation with minimal code, and interact using physically meaningful value e.g. latitude and longtidue, rather than arbitrary numerical indices. We can see here that a catalogue priovides a more intuitive way for a researcher to access the data without having to understand the technical details of how it is being stored. \n", + "\n", + "Feaures of this catalog include:\n", + "* Quick access to to emtadata to see what is contained\n", + "* Lazy loading so can quickly set up operations.\n", + "* inuittive interface to data through meaningful values e.g. latitude, longitude.\n", + "* task graph created based on lazy data object\n", + "* when computation is triggered it is distributed to workers in a cluster\n", + "* results can be quickly computed and gathered to be interacted with in the notebook." + ] } ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "pangeo_lectures", "language": "python", - "name": "python3" + "name": "pangeo_lectures" }, "language_info": { "codemirror_mode": { @@ -177,7 +5968,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.9" + "version": "3.7.10" } }, "nbformat": 4, diff --git a/05_practical_next_steps.ipynb b/05_practical_next_steps.ipynb new file mode 100644 index 0000000..c1a52e0 --- /dev/null +++ b/05_practical_next_steps.ipynb @@ -0,0 +1,106 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "standing-reception", + "metadata": {}, + "source": [ + "# Welcome to the Pangeo Lifestyle! \n", + "\n", + "![Pangeo Logo ](images/pangeo_logo_small.png)\n", + "\n", + "[Pangeo Website](https://pangeo.io/)\n", + "\n", + "Hopefully you have been convinced while making your way through these notebooks of the importancer of scientific compute platform that is fit for purpose, combine a rich, interactive user experience with capapcity to operate at the scales required by the size of the datasets used in today's cutting edge research. The philosophy of how to approach the challenges of big data is as important as the stack of software that forms a particular implementation of the Pangeo model of a computing platform. This is an evolving fields and as tools mature and new technologies become more widely available e.g. FPGA and ASICs in the short-term future, genetic and quantam computing in the long-term, the way that this model is implemented will continue to evolve, but the principles behind wat a good system looks like will remain.\n", + "\n", + "The next step is to try some of this out for yourself, ideally on an actual research problem you are considering, to see how this approach to data and computing can help your research. Below are some links to help you get started with a Pangeo model of computing." + ] + }, + { + "cell_type": "markdown", + "id": "increased-vegetable", + "metadata": {}, + "source": [ + "### Getting Started - A Platform\n", + "\n", + "* Jupyter lab on a local machine or VM - https://jupyter.org/\n", + "* Notebooks on JASMIN - https://help.jasmin.ac.uk/article/4851-jasmin-notebook-service\n", + "* Commerical Cloud\n", + " * Azure ML - https://azure.microsoft.com/en-gb/services/machine-learning/\n", + " * AWS Sagemaker - https://aws.amazon.com/sagemaker/\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "committed-double", + "metadata": {}, + "source": [ + "### Getting Started - Example Notebooks\n", + "\n", + "* Pangeo Gallery - http://gallery.pangeo.io/\n", + "* Iris gallery - https://scitools-iris.readthedocs.io/en/latest/generated/gallery/index.html\n", + "* Kaggle competitions - https://www.kaggle.com/\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "collectible-cancellation", + "metadata": {}, + "source": [ + "### Getting Started - Example Data\n", + "\n", + "* STAC - https://stacspec.org/\n", + "* Pangeo Data - https://mldata.pangeo.io/\n", + "* AWS Earth - https://aws.amazon.com/earth/\n", + "* AI for Earth - https://www.microsoft.com/en-us/ai/ai-for-earth-tech-resources\n" + ] + }, + { + "cell_type": "markdown", + "id": "express-spoke", + "metadata": {}, + "source": [ + "### More Talks\n", + "\n", + "* Pangeo on Youtube - https://www.youtube.com/playlist?list=PLuQQBBQFfpgpF3eGnlgXNqWmXhND3Vzg_\n" + ] + }, + { + "cell_type": "markdown", + "id": "unique-facing", + "metadata": {}, + "source": [ + "### Get Involved - How to contribute\n", + "\n", + "Pangeo is primarily a community of people trying to solve common problems around scientific computing and big data. A good way to learn is to get involved in community efforts.\n", + "\n", + "* Community Homepage - https://pangeo.io/\n", + "* Discourse for Discussion - https://discourse.pangeo.io/\n", + "\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "conda_amazonei_mxnet_p27", + "language": "python", + "name": "conda_amazonei_mxnet_p27" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.16" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/images/dask_taskGraph.png b/images/dask_taskGraph.png new file mode 100644 index 0000000..1f6b33d Binary files /dev/null and b/images/dask_taskGraph.png differ diff --git a/images/data_description_levels.png b/images/data_description_levels.png new file mode 100644 index 0000000..4a7e524 Binary files /dev/null and b/images/data_description_levels.png differ diff --git a/images/pangeoStackElements_buildYourOwn.png b/images/pangeoStackElements_buildYourOwn.png new file mode 100644 index 0000000..e791c23 Binary files /dev/null and b/images/pangeoStackElements_buildYourOwn.png differ diff --git a/requirements.yml b/requirements.yml index f6ddfa0..e26aa1f 100644 --- a/requirements.yml +++ b/requirements.yml @@ -2,6 +2,7 @@ name: pangeo-lectures-env channels: - defaults - conda-forge + - informaticslab dependencies: - python=3.7 - jupyter @@ -13,4 +14,11 @@ dependencies: - cartopy - bokeh - scikit-learn - - jupyterlab \ No newline at end of file + - jupyterlab + - intake + - intake-esm + - s3fs + - adlfs + - gcsfs + - boto3 + - intake_informaticslab diff --git a/sagemaker_setup.sh b/sagemaker_setup.sh new file mode 100644 index 0000000..d38b65c --- /dev/null +++ b/sagemaker_setup.sh @@ -0,0 +1,7 @@ +conda env create --file SageMaker/PangeoLectures/requirements.yml + +conda activate pangeo-lecture-env + +python -m ipykernel install --user --name pangeo-lectures + +conda deactivate \ No newline at end of file