# Introduction

Python script for data acquisition for emulating microphysics (Kessler) in supercell (climate) test case - load and extract data from `netCDF` file to `numpy` array for ML model

Microphysics consits of 4 flow variables - temperature, water vapor, cloud water \[liquid\] & precipitation/rain \[liquid\].

* **Input data**: Microphysics of a single grid cell with dry air density
    - Size of a single input to NN model: $N_{\text{micro}} + 1 = 5$ for 2D/3D simulation
* **Output data**: Microphysics of the given cell after emmulation (at next time step)
    - Size of corresponding output from NN model: $[N_{\text{micro}}] = [4]$ for 2D/3D simulation
* **Training data size**:
    - Input:  $5 \times N_{\text{train}}$
    - Output: $4  \times N_{\text{train}}$


**By Matt Norman and Murali Gopalakrishnan Meena, ORNL**

In [1]:
!pip install netCDF4
from netCDF4 import Dataset
import numpy as np
import os

path = f'supercell_kessler_data.nc'
data_link = "https://www.dropbox.com/s/nonpheml3309q7d/supercell_kessler_data.nc?dl=0"

# Download the data if necessary
if ( not os.path.isfile(path) ):
    print(f"Downloading data from:\n {data_link}...")
    !wget {data_link} -O {path}

print('Reading dataset...')

# Open NetCDF4 file, allocate input and output data arrays
nc = Dataset(path,'r')
[num_samples, num_vars_in, stencil_size] = nc.variables['inputs'].shape
input_from_file  = np.ndarray(shape=nc.variables['inputs' ].shape,dtype=np.single)
output_from_file = np.ndarray(shape=nc.variables['outputs'].shape,dtype=np.single)

# We need to chunk the reading to avoid overflowing available memory
num_chunks = 20
chunk_size = int(np.ceil(num_samples / num_chunks))
# Loop over chunks and load data
for ichunk in range(num_chunks) :
  ibeg = int( ichunk   *chunk_size)
  iend = int((ichunk+1)*chunk_size)
  if (ichunk == num_chunks-1) :  # Ensure we don't go past the last index
    iend = num_samples
  input_from_file [ibeg:iend,:,:] = nc.variables['inputs' ][ibeg:iend,:,:]
  output_from_file[ibeg:iend,:]   = nc.variables['outputs'][ibeg:iend,:]
  print(f'  * Finished reading chunk {ichunk+1} of {num_chunks}')

nc.close()

print('Shuffling dataset...')

# Randomly shuffle the samples before saving to file
permuted_indices = np.random.permutation(np.arange(0, num_samples))
input_from_file  = input_from_file [permuted_indices[:],:,:]
output_from_file = output_from_file[permuted_indices[:],:]

print('Saving data to file...')

np.savez('supercell_kessler_data.npz',
         input_from_file=input_from_file, output_from_file=output_from_file)


Collecting netCDF4
  Downloading netCDF4-1.6.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)
     |████████████████████████████████| 5.2 MB 3.8 MB/s            
[?25hCollecting cftime
  Downloading cftime-1.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (241 kB)
     |████████████████████████████████| 241 kB 115.3 MB/s            
Installing collected packages: cftime, netCDF4
Successfully installed cftime-1.6.1 netCDF4-1.6.0
Downloading data from:
 https://www.dropbox.com/s/nonpheml3309q7d/supercell_kessler_data.nc?dl=0...
--2022-07-24 15:30:30--  https://www.dropbox.com/s/nonpheml3309q7d/supercell_kessler_data.nc?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.9.18, 2620:100:601f:18::a27d:912
Connecting to www.dropbox.com (www.dropbox.com)|162.125.9.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/nonpheml3309q7d/supercell_kessler_data.nc [following]
--2022-07-24 15:30:31--  https://www