# Calculating averages and standard deviations - ARCTIC

Purpose:
--------
The purpose of this notebook is to calculate averages and standard deviations for the configuration file used to train CGnet for machine learning detection of atmospheric rivers and tropical cyclones.\
See ClimateNet repo here: https://github.com/andregraubner/ClimateNet

Prerequisites:
--------------
* Data processed here: /glade/work/tking/cgnet/QA_xml/all_arctic_converted_masks/split_files/
    * notebook to generate h5 files: /glade/work/tking/cgnet/QA_xml/XML_to_h5.ipynb
    * add underlying data with t PolarLatLon.ipynb
    * 
* Also maybe data processed here: /glade/derecho/scratch/tking/cgnet/high_lat_QC/from_nersc/2dlatlon/polar/renamed/tmq/formatted_for_inference/
    * script: see /glade/work/tking/cgnet/ML-extremes/scripts/batch_std.sh and standard_dev.py

Authors/Contributors:
---------------------
* Teagan King
* John Truesdale
* Katie Dagon

## Import libraries

In [1]:
import xarray as xr
import numpy as np

### Set up Dask

In [2]:
# Import dask
import dask

# Use dask jobqueue
from dask_jobqueue import PBSCluster

# Import a client
from dask.distributed import Client

# Setup your PBSCluster
nmem1 = '10GiB' # PBSCluster specification
nmem2 = '10GB' # pbs specification
cluster = PBSCluster(
    cores=1, # The number of cores you want
    memory=nmem1, # Amount of memory
    processes=1, # How many processes
    queue='casper', # The type of queue to utilize (/glade/u/apps/dav/opt/usr/bin/execcasper)
    local_directory='/glade/derecho/scratch/$USER/local_dask', # Use your local directory
    resource_spec='select=1:ncpus=1:mem='+nmem2, # Specify resources
    account='P06010014', # Input your project ID here, previously this was known as 'project', now is 'account'
    walltime='02:00:00', # Amount of wall time
    # interface='ib0', # Interface to use
)

# Scale up
cluster.scale(30)

# Change your url to the dask dashboard so you can see it
dask.config.set({'distributed.dashboard.link':'https://jupyterhub.hpc.ucar.edu/stable/user/{USER}/proxy/{port}/status'})

# Setup your client
client = Client(cluster)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 39873 instead


In [3]:
dask.config.set({'distributed.dashboard.link':'https://jupyterhub.hpc.ucar.edu/stable/user/{USER}/GPU/proxy/{port}/status'}) # need to include name of server if named!

<dask.config.set at 0x14d5c41d2760>

In [5]:
cluster.scale(0)

In [4]:
client

0,1
Connection method: Cluster object,Cluster type: dask_jobqueue.PBSCluster
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/tking/GPU/proxy/39873/status,

0,1
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/tking/GPU/proxy/39873/status,Workers: 0
Total threads: 0,Total memory: 0 B

0,1
Comm: tcp://128.117.208.97:41253,Workers: 0
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/tking/GPU/proxy/39873/status,Total threads: 0
Started: Just now,Total memory: 0 B


## Set file paths

In [22]:
file_path = '/glade/work/tking/cgnet/QA_xml/all_arctic_converted_masks/split_files/*/'

## Open training data

### Original training data from NERSC

In [23]:
%%time
# ds_pr = xr.open_mfdataset(file_path+"pr_*.nc", concat_dim='time', combine='nested', parallel=True)
# ds_prw = xr.open_mfdataset(file_path+"prw_*.nc", concat_dim='time', combine='nested', parallel=True)
# ds_psl = xr.open_mfdataset(file_path+"psl_*.nc", concat_dim='time', combine='nested', parallel=True)
# ds_ua850 = xr.open_mfdataset(file_path+"ua850_*.nc", concat_dim='time', combine='nested', parallel=True)
# ds_va850 = xr.open_mfdataset(file_path+"va850_*.nc", concat_dim='time', combine='nested', parallel=True)
# ds_windhusavi = xr.open_mfdataset(file_path+"windhusavi_*.nc", concat_dim='time', combine='nested', parallel=True)

ds = xr.open_mfdataset(file_path+"*.nc", concat_dim='time', combine='nested', parallel=True)

# The polar files are separated by variable instead of containing all vars

Task exception was never retrieved
future: <Task finished name='Task-48727' coro=<Client._gather.<locals>.wait() done, defined at /glade/u/home/tking/.conda/envs/polar_ARs/lib/python3.9/site-packages/distributed/client.py:1994> exception=AllExit()>
Traceback (most recent call last):
  File "/glade/u/home/tking/.conda/envs/polar_ARs/lib/python3.9/site-packages/distributed/client.py", line 1999, in wait
    raise AllExit()
distributed.client.AllExit


CPU times: user 15 s, sys: 7.26 s, total: 22.3 s
Wall time: 2min 27s


## Function to calculate weighted global mean and standard deviation

In [6]:
def mean_std(ds, var):
    # unweighted mean across time/space
    var_mean = ds[var].mean().compute()

    # std across time/space
    var_std = ds[var].std().compute()

    return (var_mean, var_std)

## Calculate values

### Original NERSC training data

In [33]:
%%time
all_w_means = {}
all_means = {}
all_w_std = {}
all_std = {}
for var in ['pr', 'tmq', 'ivt', 'psl']:
    means, std = mean_std(ds, var)
    all_means[var] = means
    all_std[var] = std

CPU times: user 7.69 s, sys: 468 ms, total: 8.16 s
Wall time: 39 s


In [35]:
for i, var in enumerate(['pr', 'tmq', 'ivt', 'psl']):
    print("{} unweighted global mean: {}".format(var, all_means[var].values))
    print("{} unweighted std: {}".format(var, all_std[var].values))

pr unweighted global mean: 3.307231634227589e-05
pr unweighted std: 0.00013132695522992795

tmq unweighted global mean: 21.593790761325735
tmq unweighted std: 13.6215941728325

ivt unweighted global mean: 175.08854037051293
ivt unweighted std: 147.9106153573991

psl unweighted global mean: 101001.37087934873
psl unweighted std: 1387.3006430965993



## Calculate Means and Standard Deviations for Inference Data

In [None]:
inference_path = '/glade/derecho/scratch/tking/cgnet/high_lat_QC/from_nersc/2dlatlon/polar/renamed/tmq/formatted_for_inference/'



In [None]:
%%time
ds = xr.open_mfdataset(inference_path+"*.nc", concat_dim='time', combine='nested', parallel=True)

In [5]:
ds['tmq'].mean().compute()

tmq mean is 176.30763

In [None]:
ds['tmq'].std().compute()

tmq std is 148.23515

In [None]:
# end
print('done')

In [None]:
# script: see /glade/work/tking/cgnet/ML-extremes/scripts/batch_std.sh and standard_dev.py