# Training CGNet

Purpose:
--------
The purpose of this notebook is to train CGnet for machine learning detection of atmospheric rivers and tropical cyclones.\
See ClimateNet repo here: https://github.com/andregraubner/ClimateNet

Authors/Contributors:
---------------------
* Teagan King
* John Truesdale
* Katie Dagon

## Import libraries

In [1]:
import os
import sys
import json
import numpy as np

sys.path.append("/glade/work/kdagon/ClimateNet") # append path to ClimateNet repo
from climatenet.utils.data import ClimateDatasetLabeled, ClimateDataset
from climatenet.models import CGNet
from climatenet.utils.utils import Config
from climatenet.track_events import track_events
from climatenet.analyze_events import analyze_events
from climatenet.visualize_events import visualize_events

from os import path

## Config file
Use `get_averages_and_standard_devs.ipynb` to calculate means/stds for given training dataset.

In [3]:
config = Config('/glade/work/kdagon/ML-extremes/trained_models/config_021523.json')

In [4]:
config.description

'The basic CGNet model. You can use this config to train your own model, or load it with our trained weights.'

In [18]:
config.train_batch_size

16

## Confirm GPU resources
Can request through JupyterHub launch page.\
Current resources request (2/15/23): 1 note, 4 cpu, 64GB mem, 2 V100 GPU

In [5]:
# requires loading pytorch into environment
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())

True
2


## Instantiate CGNet model given config file

In [7]:
%%time
cgnet = CGNet(config)

CPU times: user 39 ms, sys: 1.44 ms, total: 40.5 ms
Wall time: 39.7 ms


In [19]:
cgnet.optimizer

Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.001
    weight_decay: 0
)

## Set train, test data

In [10]:
train_path = "/glade/campaign/cgd/amp/jet/ClimateNet_12012020/portal.nersc.gov/project/ClimateNet/climatenet_new"

train = ClimateDatasetLabeled(path.join(train_path, 'train'), config)
test = ClimateDatasetLabeled(path.join(train_path, 'test'), config)

In [13]:
train.fields

{'TMQ': {'mean': 24.927238169017997, 'std': 15.817276954650879},
 'U850': {'mean': 1.0356735863118816, 'std': 8.29762077331543},
 'V850': {'mean': 0.20847854977498861, 'std': 6.231630802154541},
 'PSL': {'mean': 101095.03520124489, 'std': 1461.225830078125}}

In [16]:
train.length

398

## Train model

Memory use holding at ~13.6GB during training\
Each epoch takes ~1 min to run

In [11]:
cgnet.train(train) # use ~20 epochs for non-test
# IOU mean should be around 0.75 after all epochs?
# weights and measures site to look at ML performance
# maybe playing with those parameters could improve model

  0%|          | 0/25 [00:00<?, ?it/s]

Epoch 1:


Loss: 0.7898991703987122: 100%|██████████| 25/25 [01:09<00:00,  2.80s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[2.74805881e+08 2.24246630e+07 3.32901550e+07]
 [1.10457800e+06 3.46212000e+05 1.75557000e+05]
 [5.91861400e+06 1.16004800e+06 1.28992200e+07]]
IOUs:  [0.81413377 0.01373255 0.24136139] , mean:  0.35640923478750725
Epoch 2:


Loss: 0.783986508846283: 100%|██████████| 25/25 [01:05<00:00,  2.64s/it] 
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.15537449e+08 4.87274000e+05 1.44959760e+07]
 [1.06048700e+06 4.42264000e+05 1.23596000e+05]
 [7.74903100e+06 5.88690000e+04 1.21699820e+07]]
IOUs:  [0.92988314 0.2035747  0.35175947] , mean:  0.4950724371682615
Epoch 3:


Loss: 0.7827636003494263: 100%|██████████| 25/25 [01:05<00:00,  2.61s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.16616019e+08 8.61547000e+05 1.30431330e+07]
 [7.93235000e+05 7.27465000e+05 1.05647000e+05]
 [7.41917000e+06 8.51770000e+04 1.24735350e+07]]
IOUs:  [0.93470646 0.28272247 0.37654065] , mean:  0.5313231949008911
Epoch 4:


Loss: 0.7855181097984314: 100%|██████████| 25/25 [01:06<00:00,  2.65s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.17169463e+08 8.57412000e+05 1.24938240e+07]
 [7.33760000e+05 7.96472000e+05 9.61150000e+04]
 [7.20713800e+06 8.77810000e+04 1.26829630e+07]]
IOUs:  [0.93709143 0.30972569 0.38943235] , mean:  0.5454164902594998
Epoch 5:


Loss: 0.7828373908996582: 100%|██████████| 25/25 [01:05<00:00,  2.62s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.17174957e+08 9.62691000e+05 1.23830510e+07]
 [6.67936000e+05 8.59138000e+05 9.92730000e+04]
 [6.86814000e+06 8.66110000e+04 1.30231310e+07]]
IOUs:  [0.93822985 0.32109518 0.40120297] , mean:  0.5535093328230187
Epoch 6:


Loss: 0.780777096748352: 100%|██████████| 25/25 [01:05<00:00,  2.62s/it] 
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.17551376e+08 8.96819000e+05 1.20725040e+07]
 [6.53158000e+05 8.82849000e+05 9.03400000e+04]
 [6.75682800e+06 8.35470000e+04 1.31375070e+07]]
IOUs:  [0.93969382 0.33868285 0.40874954] , mean:  0.5623754049266888
Epoch 7:


Loss: 0.7802695631980896: 100%|██████████| 25/25 [01:05<00:00,  2.63s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.17120978e+08 1.02662900e+06 1.23730920e+07]
 [5.89703000e+05 9.40475000e+05 9.61690000e+04]
 [6.48350600e+06 7.92380000e+04 1.34151380e+07]]
IOUs:  [0.93935634 0.34421718 0.41344589] , mean:  0.5656731379215608
Epoch 8:


Loss: 0.7799997925758362: 100%|██████████| 25/25 [01:05<00:00,  2.60s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.17773701e+08 9.30261000e+05 1.18167370e+07]
 [6.27115000e+05 9.11871000e+05 8.73610000e+04]
 [6.51002700e+06 8.36590000e+04 1.33841960e+07]]
IOUs:  [0.94111157 0.34537075 0.41980442] , mean:  0.5687622462781506
Epoch 9:


Loss: 0.7831518054008484: 100%|██████████| 25/25 [01:04<00:00,  2.59s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.17452533e+08 1.06332300e+06 1.20048430e+07]
 [5.81570000e+05 9.57589000e+05 8.71880000e+04]
 [6.27657600e+06 8.72000000e+04 1.36141060e+07]]
IOUs:  [0.94093787 0.34484474 0.42451334] , mean:  0.5700986508434784
Epoch 10:


Loss: 0.7795233130455017: 100%|██████████| 25/25 [01:03<00:00,  2.55s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.17731803e+08 1.06670200e+06 1.17221940e+07]
 [5.69050000e+05 9.71772000e+05 8.55250000e+04]
 [6.09196600e+06 8.49960000e+04 1.38009200e+07]]
IOUs:  [0.94231623 0.34980427 0.43418779] , mean:  0.5754360983189896
Epoch 11:


Loss: 0.7827109098434448: 100%|██████████| 25/25 [01:04<00:00,  2.57s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.17932581e+08 1.03582200e+06 1.15522960e+07]
 [5.60802000e+05 9.82011000e+05 8.35340000e+04]
 [6.08588900e+06 7.74770000e+04 1.38145160e+07]]
IOUs:  [0.94295175 0.35844449 0.43697861] , mean:  0.5794582830442437
Epoch 12:


Loss: 0.7796162366867065: 100%|██████████| 25/25 [01:03<00:00,  2.54s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.17774804e+08 1.09885800e+06 1.16470370e+07]
 [5.56496000e+05 9.89295000e+05 8.05560000e+04]
 [5.87684900e+06 8.38020000e+04 1.40172310e+07]]
IOUs:  [0.94308055 0.35218673 0.44210759] , mean:  0.5791249552464285
Epoch 13:


Loss: 0.7797759771347046: 100%|██████████| 25/25 [01:04<00:00,  2.56s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.1838413e+08 1.0574860e+06 1.1079083e+07]
 [5.4822000e+05 9.9669300e+05 8.1434000e+04]
 [5.8321630e+06 7.5790000e+04 1.4069929e+07]]
IOUs:  [0.94503742 0.36116999 0.45185139] , mean:  0.5860196006838368
Epoch 14:


Loss: 0.78229820728302: 100%|██████████| 25/25 [01:03<00:00,  2.54s/it]  
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.18053095e+08 1.06192100e+06 1.14056830e+07]
 [5.45596000e+05 9.97864000e+05 8.28870000e+04]
 [5.66397100e+06 8.43180000e+04 1.42295930e+07]]
IOUs:  [0.94453373 0.35990371 0.45221473] , mean:  0.5855507246464694
Epoch 15:


Loss: 0.7772184610366821: 100%|██████████| 25/25 [01:03<00:00,  2.56s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.18413168e+08 1.06169500e+06 1.10458360e+07]
 [5.42639000e+05 1.00391400e+06 7.97940000e+04]
 [5.58662100e+06 7.77810000e+04 1.43134800e+07]]
IOUs:  [0.94582863 0.36297117 0.46018855] , mean:  0.5896627796592041
Epoch 16:


Loss: 0.7783818244934082: 100%|██████████| 25/25 [01:03<00:00,  2.56s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.1822104e+08 1.0560520e+06 1.1243607e+07]
 [5.3489300e+05 1.0139800e+06 7.7474000e+04]
 [5.4872380e+06 7.8153000e+04 1.4412491e+07]]
IOUs:  [0.94555882 0.3673106  0.46047823] , mean:  0.5911158798631818
Epoch 17:


Loss: 0.7813333868980408: 100%|██████████| 25/25 [01:04<00:00,  2.58s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.18298695e+08 1.05524800e+06 1.11667560e+07]
 [5.34563000e+05 1.01044400e+06 8.13400000e+04]
 [5.37576000e+06 8.05990000e+04 1.45215230e+07]]
IOUs:  [0.94610388 0.3658121  0.46504622] , mean:  0.5923207341773779
Epoch 18:


Loss: 0.7830923795700073: 100%|██████████| 25/25 [01:03<00:00,  2.55s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.18462197e+08 1.04817900e+06 1.10103230e+07]
 [5.30976000e+05 1.01621500e+06 7.91560000e+04]
 [5.35913400e+06 8.24270000e+04 1.45363210e+07]]
IOUs:  [0.94664674 0.36860077 0.46789687] , mean:  0.594381461820222
Epoch 19:


Loss: 0.7803710699081421: 100%|██████████| 25/25 [01:03<00:00,  2.52s/it]
  0%|          | 0/25 [00:00<?, ?it/s]

Epoch stats:
[[3.18753268e+08 1.05405600e+06 1.07133750e+07]
 [5.21263000e+05 1.02060300e+06 8.44810000e+04]
 [5.28368200e+06 7.75780000e+04 1.46166220e+07]]
IOUs:  [0.9477519  0.3700544  0.47493977] , mean:  0.5975820234985322
Epoch 20:


Loss: 0.7817890048027039: 100%|██████████| 25/25 [01:03<00:00,  2.55s/it]

Epoch stats:
[[3.18470937e+08 1.06732100e+06 1.09824410e+07]
 [5.12900000e+05 1.03284500e+06 8.06020000e+04]
 [5.20397100e+06 7.82790000e+04 1.46956320e+07]]
IOUs:  [0.94716048 0.37260633 0.47342764] , mean:  0.5977314837509967





Training is maxing out around 0.5977 IOU mean at last epoch (increase # epochs? play with batch size, other hyperparameters)

## Evaluate model on test data

In [22]:
test.fields

{'TMQ': {'mean': 24.927238169017997, 'std': 15.817276954650879},
 'U850': {'mean': 1.0356735863118816, 'std': 8.29762077331543},
 'V850': {'mean': 0.20847854977498861, 'std': 6.231630802154541},
 'PSL': {'mean': 101095.03520124489, 'std': 1461.225830078125}}

In [23]:
test.length

61

In [24]:
%%time
cgnet.evaluate(test)

100%|██████████| 4/4 [00:09<00:00,  2.46s/it]

Evaluation stats:
[[4.7718497e+07 1.6514600e+05 2.7706590e+06]
 [1.2726200e+05 1.6578600e+05 3.5460000e+03]
 [8.1544900e+05 1.3236000e+04 2.1893150e+06]]
IOUs:  [0.92483061 0.34904079 0.37797609] , mean:  0.5506158312616695
CPU times: user 1.62 s, sys: 1.69 s, total: 3.31 s
Wall time: 9.86 s





Testing data is at 0.55 IOU mean

## Save out model

In [26]:
# makes a folder by the name specified, puts the trained model and config file in there
cgnet.save_model('/glade/work/kdagon/ML-extremes/trained_models/trained_cgnet.021523')

In [48]:
# this how to load in previously trained model
cgnet.load_model('/glade/work/kdagon/ML-extremes/trained_models/trained_cgnet.021523')

## Set inference data

In [38]:
#year=2000
#inference_path = "/glade/campaign/cgd/amp/jet/ClimateNet/Climate_data_"+str(year)
inference_path = "/glade/scratch/tking/cgnet/historical_2000_2005/split_files" # torch input type error with these files

inference = ClimateDataset(inference_path, config)  # could test different config with std/means for inference data

In [39]:
inference.fields

{'TMQ': {'mean': 24.927238169017997, 'std': 15.817276954650879},
 'U850': {'mean': 1.0356735863118816, 'std': 8.29762077331543},
 'V850': {'mean': 0.20847854977498861, 'std': 6.231630802154541},
 'PSL': {'mean': 101095.03520124489, 'std': 1461.225830078125}}

In [40]:
inference.length

17520

## Inference mode

In [35]:
%%time
class_masks = cgnet.predict(inference) # masks with 1==TC, 2==AR

100%|██████████| 183/183 [06:38<00:00,  2.18s/it]


CPU times: user 2min 17s, sys: 1min 8s, total: 3min 26s
Wall time: 6min 46s


In [41]:
# with new files - need to reprocess TMQ as a float, not a double
class_masks = cgnet.predict(inference) # masks with 1==TC, 2==AR

  0%|          | 0/1095 [00:02<?, ?it/s]


RuntimeError: Input type (torch.cuda.DoubleTensor) and weight type (torch.cuda.FloatTensor) should be the same

## Save out masks

In [36]:
%%time
class_masks.to_netcdf("/glade/scratch/kdagon/cgnet/class_masks."+str(year)+".nc")

CPU times: user 55.9 ms, sys: 4.16 s, total: 4.22 s
Wall time: 7.53 s


## Track events
Create masks with event IDs

Note: memory spikes here - resource intensive!

In [None]:
%%time
event_masks = track_events(class_masks)

In [None]:
%%time
event_masks.to_netcdf("/glade/scratch/kdagon/cgnet/event_masks."+str(year)+".test.nc")

## Analyze events
Produce some visualizations

In [None]:
analyze_events(event_masks, class_masks, "/glade/scratch/kdagon/cgnet/")

In [None]:
visualize_events(event_masks, inference, "/glade/scratch/kdagon/cgnet/")   

## Weights and Biases
https://docs.wandb.ai/quickstart \
Still figuring out how this works...

In [42]:
import wandb

In [44]:
wandb.login()

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


[34m[1mwandb[0m: Paste an API key from your profile and hit enter:  ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /glade/u/home/kdagon/.netrc


True

In [46]:
wandb.init(project="climatenet-test", entity="katie-dagon")

Ignoring settings passed to wandb.setup() which has already been configured.
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


[34m[1mwandb[0m: Paste an API key from your profile and hit enter:  ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /glade/u/home/kdagon/.netrc
[34m[1mwandb[0m: wandb version 0.13.10 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.10.2
[34m[1mwandb[0m: Run data is saved locally in wandb/run-20230215_172859-2auiboc1
[34m[1mwandb[0m: Syncing run [33mdecent-river-1[0m





In [47]:
wandb.config = {
  "learning_rate": 0.001,
  "epochs": 20,
  "batch_size": 16
}

In [49]:
wandb.log({"loss": loss})

NameError: name 'loss' is not defined