# Check whether images contain invalid values

To read the NETCDF files containing the data we use the MeteoSwiss version of Py-ART that can be found [here](https://github.com/meteoswiss-mdr/pyart). The conda package can be obtained [here](https://anaconda.org/conda-forge/pyart_mch)

In [1]:
import glob
import os
import pyart
import numpy as np
import pandas as pd
from copy import deepcopy


## You are using the Python ARM Radar Toolkit (Py-ART), an open source
## library for working with weather radar data. Py-ART is partly
## supported by the U.S. Department of Energy as part of the Atmospheric
## Radiation Measurement (ARM) Climate Research Facility, an Office of
## Science user facility.
##
## If you use this software to prepare a publication, please cite:
##
##     JJ Helmus and SM Collis, JORS 2016, doi: 10.5334/jors.119



In [2]:
# suppress anoying iypthon warnings. Not ideal since we suppress also potentially relevant warnings
import warnings
warnings.filterwarnings('ignore')

  and should_run_async(code)


## Auxiliary functions

In [3]:
# Function to read original dataset
# data is stored as (nz, ny, nx), we return (nx, ny)
def read_nc(fname):
    sat_grid = pyart.io.read_grid(fname)
    for field_name in sat_grid.fields.keys():
        data = np.transpose(np.squeeze(sat_grid.fields[field_name]['data']))
    return data        

In [4]:
def check_nans(variables):
    for var in variables:
        flist = glob.glob(fbasepath+'*/NETCDF/'+var+'/*.nc')
        flist.sort()
        print(var, len(flist))
        nnan = 0
        for fname in flist:
            # Get time step
            bfile = os.path.basename(fname)
            dt_str = bfile[0:14]
            print(dt_str, end="\r", flush=True)
            
            data = read_nc(fname).flatten()
            ind = np.where(np.isnan(data))[0]
            if ind.size > 0:
                print('File '+dt_str+' contains '+str(ind.size)+' NaNs')
                nnan += 1
        print('Number of files with NaNs:', nnan)

## Some global variables

In [5]:
fbasepath = '/data/pyrad_products/MSG_ML/'

## Check variables

In [6]:
check_nans(['IR_108_text'])

IR_108_text 4355
File 20180404114500 contains 4 NaNs
File 20180404121000 contains 9 NaNs
File 20180423143000 contains 4 NaNs
File 20180429142000 contains 2 NaNs
File 20180429143000 contains 1 NaNs
File 20180429143500 contains 12 NaNs
File 20180429144000 contains 3 NaNs
File 20180429150500 contains 2 NaNs
File 20180429155000 contains 1 NaNs
File 20180429160000 contains 10 NaNs
File 20180429161000 contains 1 NaNs
File 20180429161500 contains 3 NaNs
File 20180429162500 contains 9 NaNs
File 20180429163000 contains 20 NaNs
File 20180429164000 contains 5 NaNs
File 20180429164500 contains 9 NaNs
File 20180429165000 contains 7 NaNs
File 20180429165500 contains 1 NaNs
File 20180429170000 contains 5 NaNs
File 20180429170500 contains 1 NaNs
File 20180429171000 contains 21 NaNs
File 20180429171500 contains 16 NaNs
File 20180506142500 contains 1 NaNs
File 20180506143000 contains 5 NaNs
File 20180506144500 contains 9 NaNs
File 20180506150500 contains 1 NaNs
File 20180506151000 contains 5 NaNs
File 2

2680 files contain some invalid values in the IR_108 variable although most of them have very few invalid pixels. This invalid values appear due to limitations in the float precision that cause the local variance to be negative and consequently the standard deviation (the root of the local variance) to become a non-valid number. Invalid numbers could be safely replaced by 0s. 

In [7]:
check_nans(['HRV_norm_text'])

HRV_norm_text 4353
File 20180612155000 contains 9862 NaNs
File 20180807125500 contains 13415 NaNs
Number of files with NaNs: 2


Visual inspection has shown that the two files that contain such a large amount of invalid numbers correspond with files with corrupted HRV values and they will be removed from the dataset

In [8]:
check_nans(['WV_062-IR_108_text'])

WV_062-IR_108_text 4355
File 20180429170500 contains 1 NaNs
File 20180506170000 contains 1 NaNs
File 20180509143500 contains 1 NaNs
File 20180509162500 contains 1 NaNs
File 20180512144000 contains 1 NaNs
File 20180515130000 contains 1 NaNs
File 20180515154500 contains 1 NaNs
File 20180515160500 contains 2 NaNs
File 20180520141000 contains 1 NaNs
File 20180526164500 contains 2 NaNs
File 20180527173000 contains 3 NaNs
File 20180530151500 contains 1 NaNs
File 20180530155500 contains 1 NaNs
File 20180530165000 contains 3 NaNs
File 20180530171500 contains 1 NaNs
File 20180531100000 contains 2 NaNs
File 20180531134500 contains 1 NaNs
File 20180531155000 contains 1 NaNs
File 20180604120000 contains 1 NaNs
File 20180604145000 contains 1 NaNs
File 20180604160500 contains 1 NaNs
File 20180604161500 contains 1 NaNs
File 20180604164500 contains 5 NaNs
File 20180605162500 contains 2 NaNs
File 20180611073500 contains 4 NaNs
File 20180611105500 contains 2 NaNs
File 20180611112000 contains 1 NaNs
File

The reasons for the existence of nans are the same as for the IR_108_text variable

In [9]:
check_nans(['IR_108', 'HRV', 'HRV_norm', 'WV_062-IR_108'])

IR_108 4355
Number of files with NaNs: 0
HRV 4353
Number of files with NaNs: 0
HRV_norm 4353
Number of files with NaNs: 0
WV_062-IR_108 4355
Number of files with NaNs: 0


The data check has shown that the original variables contain only valid values. NaN values may appears as a result of artifacts in the computation of the texture. These NaN values could be safely considered to be 0s. However they appear mostly in areas that are of no interest for the RF models (since they are thresholded) and there is sufficient data to train the model. Therefore we will simply ignore remove them from the RF model dataset. The data check has also exposed to files containing corrupted HRV values. These files will be removed from the u-net dataset.   