# Data pre-processing for u-net model

In this notebook we prepare the data that is going to be ingested by the u-net model. A total of 3 features are going to be used: HRV_norm, IR_108 and channel differences (WV_062-IR_108). The IR_108 contains information about the cloud top height, the channel differences contain information about the water content of the cloud and the HRV provides information about the structure of the cloud. Theoretically, one of the main advantages of using the u-net model is that the model can learn about the spatial structure of these variables

In [1]:
import glob
import os
import pyart
import numpy as np
import pandas as pd
from copy import deepcopy


## You are using the Python ARM Radar Toolkit (Py-ART), an open source
## library for working with weather radar data. Py-ART is partly
## supported by the U.S. Department of Energy as part of the Atmospheric
## Radiation Measurement (ARM) Climate Research Facility, an Office of
## Science user facility.
##
## If you use this software to prepare a publication, please cite:
##
##     JJ Helmus and SM Collis, JORS 2016, doi: 10.5334/jors.119



In [2]:
# suppress anoying iypthon warnings. Not ideal since we suppress also potentially relevant warnings
import warnings
warnings.filterwarnings('ignore')

  and should_run_async(code)


## Auxiliary functions

In [3]:
# Function to read original dataset
# data is stored as (nz, ny, nx), we return (nx, ny)
def read_nc(fname):
    sat_grid = pyart.io.read_grid(fname)
    for field_name in sat_grid.fields.keys():
        data = np.transpose(np.squeeze(sat_grid.fields[field_name]['data']))
    return data        

In [4]:
# Function for minmax scaling
def minmax_scaling(data, vmin, vmax):
    data2 = deepcopy(data)
    data2[data2>vmax] = vmax
    data2[data2<vmin] = vmin
    return (data2-vmin)/(vmax-vmin)

## Some global variables

In [5]:
fbasepath = '/data/pyrad_products/MSG_ML/'
features = ['HRV_norm', 'IR_108', 'WV_062-IR_108']
nfeatures = len(features)
target = 'POH90'

vmins = [0., 200., -78.]
vmaxs = [100., 311., 9.]

We use minmax normalization to put all variables within the 0-1 range. The min, max values for each variable have been obtained from the EDA. The features matrix has shape nx, ny, n channels (HRV_norm, IR_108 and WV_062-IR_108). The target matrix has shape nx, ny, n classes. The classes are no hail (0) and hail (1). We transform the POH90 to 0 (no hail or not computed) and 1 probabilty of hail above 90%). The shape of those matrices is the one required by the u-net as implemented in the unet package. 

In [6]:
years = ['2018', '2019', '2020']
months = ['04', '05', '06', '07', '08', '09']
for year in years:
    for month in months:
        # Get list of files and data size
        flist = glob.glob(fbasepath+'*/NETCDF/'+features[0]+'/'+year+month+'*.nc')
        if len(flist) == 0:
            continue
        flist.sort()
        img_size = read_nc(flist[0]).shape
        data_size = img_size[0]*img_size[1]
        
        for fname in flist:
            # Get time step
            bfile = os.path.basename(fname)
            dt_str = bfile[0:14]
            print(dt_str, end="\r", flush=True)
            
            # Read all files corresponding to a time step
            # Put them in features and target matrices
            X = np.empty((img_size[0], img_size[1], nfeatures), dtype=np.float32)
            for i, (vmin, vmax, feature) in enumerate(zip(vmins, vmaxs, features)):
                flist_aux = glob.glob(fbasepath+'*/NETCDF/'+feature+'/'+dt_str+'*.nc')
                data = read_nc(flist_aux[0])
                data = minmax_scaling(data, vmin, vmax)  
                X[:, :, i] = data
               
            flist_aux = glob.glob(fbasepath+'*/NETCDF/'+target+'/'+dt_str+'*.nc')
            y = read_nc(flist_aux[0])
            
            # Only hail/no hail
            y[y == 1] = 0
            y[y == 2] = 1
            
            # onehot encoding
            y_onehot = np.eye(2)[y]
             
            # Save data into a .npz file
            np.savez('/data/ml_course/05_Capstone_project/dl_data/'+dt_str+'_data.npz', features=X, targets=y_onehot)

20200731173000