# Introduction

Python script for data acquisition for emulating microphysics (Kessler) in supercell (climate) test case - load and extract data from netCDF file for ML model

Microphysics consits of 4 flow variables - temperature, water vapor, cloud water \[liquid\] & precipitation/rain \[liquid\].

* **Input data**: Microphysics of 1D cell stencil ($3 \times 1$) for a given cell
    - Size of a single input to NN model: $[N_{\text{micro}} \times N_{\text{coarse stencil cells} }] = [4 \times 3]$ for 2D/3D simulation
* **Output data**: Microphysics of given cell after emmulation (at next time step)
    - Size of corresponding output from NN model: $[N_{\text{micro}}] = [4]$ for 2D/3D simulation
* **Training data size**:
    - Input:  $12 \times N_{\text{train}}$
    - Output: $4  \times N_{\text{train}}$


**By MGM, ORNL**

2022 April 05

**Edited:**
* 2022 Apr. 05:
    * initial code complete

# Import libraries

In [1]:
import numpy as np

!pip install netCDF4
import netCDF4
from netCDF4 import Dataset

import matplotlib.pyplot as plt



# Parameters: Total number of dataset and % of testing

In [2]:
Ntrain       = np.int(1e6)      # number of training data
Ntest        = np.int(1e4)      # number of testing data
shuffledata  = True             # randomly shuffle data or not (***)

savedata  = True

# Load fluid flow data

* Load snapshots stored as `netCDF` format

In [3]:
nc_data = Dataset('Data_training/supercell_micro_surrogate_data.nc','r')

* Extract variables

In [4]:
data_ip = nc_data.variables["inputs"]
data_op = nc_data.variables["outputs"]
# Define dimensions
[Nsampls, Nmicro, Nstenc] = data_ip.shape
idT = 0; idV = 1; idC = 2; idP = 3;


print(f'Input shape = {data_ip.shape}')
print(f'Output shape = {data_op.shape}')
print(f'Total number of data points = {Nsampls:,}')

Input shape = (9788877, 4, 3)
Output shape = (9788877, 4)
Total number of data points = 9,788,877


# Extract training & testing data

In [5]:
# compute number of samples for training & testing
if shuffledata: samplList = np.random.permutation(np.arange(0, Ntrain))
else: samplList = np.arange(0, Ntrain)
if Nsampls<Ntrain:
    print("Need more samples!!!!!")
else:
    samplList = samplList[0:Ntrain]
    # Training data
    datatrain_IP = data_ip[samplList, :, :].reshape( (Ntrain, Nmicro*Nstenc) ).T
    datatrain_OP = data_op[samplList, :].T
    # Testing data
    datatest_IP = data_ip[Ntrain:Ntrain+Ntest, :, :].reshape( (Ntest, Nmicro*Nstenc) ).T
    datatest_OP = data_op[Ntrain:Ntrain+Ntest, :].T

# Save data

* Save all the data arrays in a `.npz` file

In [6]:
if savedata:
    np.savez('Data_training/supercell_micro_Ntrain'+str(Ntrain)+'.npz',
             Ntrain=Ntrain, Ntest=Ntest,
             datatrain_IP=datatrain_IP, datatrain_OP=datatrain_OP, 
             datatest_IP=datatest_IP, datatest_OP=datatest_OP)