Skikit may only be avaiable in my non-GPU environment

In [199]:
import numpy as np
import math
import xarray as xr
import dask
from sklearn.model_selection import train_test_split
import tensorflow as tf
import gc
#from sklearn.preprocessing import StandardScaler
#from sklearn.preprocessing import MinMaxScaler

Import the data sample.  This one has already been spliced by lat/lon and Vertical Velocity pulled out

In [200]:
path = '/DFS-L/DATA/pritchard/gmooers/Workflow/MAPS/SPCAM/Small_Sample/Data_Points/One_Day_Merged_Data.nc'
real_ds = xr.open_dataset(path)

In [201]:
w_velocity = real_ds['CRM_W'].values
w_velocity = np.squeeze(w_velocity)

The array is currently set up as:

(location, time, level, crm_x)

where location is the 109 lat/lon points that pass the filtering test for probable deep converction in the afternoon (lst time)

Time is in 15 minute intervals

30 vertical levels divide up the atmosphere

128 CRMs in the x direction per GCM grid cell

(109, 96, 30, 128)

Need to reshape array so time dimension can be shuffled?

In [202]:
t = len(w_velocity[0])
coords = len(w_velocity)
lev = len(w_velocity[0][0])
crm_x = len(w_velocity[0][0][0])
w_new = np.zeros(shape =(t, coords, lev, crm_x))
w_new[:,:,:,:] = np.nan
for i in range(len(w_velocity)):
    for j in range(len(w_velocity[i])):
        w_new[j, i, :, :] = w_velocity[i,j,:,:]
        

Must check to see if array has nan values within it

In [203]:
np.isnan(w_new).any()

False

For the Morphology Tests, e.g. feeding in low resolution image snap shots, I do not want a diurnal cycle, so I will shuffle by time:

https://www.tensorflow.org/api_docs/python/tf/random/shuffle

I seem to need to use a tensorflow built in function to do this on an array more than two dimensions....


In [204]:
w_shuffled = tf.random.shuffle(w_new, seed=None, name=None)
sess = tf.InteractiveSession()
w_numpy = w_shuffled.eval()
gc.collect()

96

Need to split data into training and test sections:

Will do an 80/20 split for now

In [205]:
w_train = w_numpy[:int(4*len(w_numpy)/5),:,:,:]
w_test = w_numpy[int(4*len(w_numpy)/5):,:,:,:]

Must scale all array values to between 0 and 1

Seems standardization not normalization is apropriate
- both training and validation data

https://stats.stackexchange.com/questions/10289/whats-the-difference-between-normalization-and-standardization

Skikit learn has built in fuctions to do this:

https://scikit-learn.org/stable/modules/preprocessing.html


Actually, disregard above - code only works for 2D array, I will have to do this manually

In [206]:
print(np.max(w_numpy))
print(np.min(w_numpy))

24.763671875
-8.849842071533203


Method 1:

Assign z scores centered around $\mu$ of 0 and $\sigma$ = 1
Standardization:

$X^` = \frac{x - \mu}{\sigma}$

In [207]:
rescaled_train = (w_train - w_train.mean(axis=(2,3),keepdims=1)) / w_train.std(axis=(2,3),keepdims=1)
rescaled_test = (w_test - w_test.mean(axis=(2,3),keepdims=1)) / w_test.std(axis=(2,3),keepdims=1)

In [208]:
print(np.max(rescaled_train))
print(np.min(rescaled_train))
print(np.max(rescaled_test))
print(np.min(rescaled_test))

30.640037919268302
-21.487122774948084
31.45649386707715
-19.78183682514497


Method 2:

Normalization: Scale each value in arrray between 0 to 1.  This seems to be method of choice in most "image" problems where they divide by 255. to get pixels between 0 and 1, so I will defer to it for now?

$X^` = \frac{x - min(x)}{max(x)-min(x)}$

The built in interpolation function will allow this to easily be done in a line of code

https://stackoverflow.com/questions/36000843/scale-numpy-array-to-certain-range

In [209]:
rescaled_train = np.interp(w_train, (w_train.min(), w_train.max()), (0, +1))

In [210]:
rescaled_test = np.interp(w_test, (w_train.min(), w_train.max()), (0, +1))

In [211]:
print(np.max(rescaled_train))
print(np.min(rescaled_train))
print(np.max(rescaled_test))
print(np.min(rescaled_test))

1.0
0.0
0.848476042326487
0.0


Transform the arrays into a number of "Low Res Images" - e.g. 30x128 arrays that can be fed into the VAE as snapshots

In [212]:
final_train = np.zeros(shape=(int(4*t/5)*coords, lev, crm_x))
final_train[:,:,:] = np.nan
count = 0

for i in range(len(rescaled_train)):
    for j in range(len(rescaled_train[i])):
        final_train[count, :, :] = rescaled_train[i,j,:,:]
        count = count+1

In [213]:
final_test = np.zeros(shape=((t-int(4*t/5))*coords, lev, crm_x))
final_test[:,:,:] = np.nan
count = 0

for i in range(len(rescaled_test)):
    for j in range(len(rescaled_test[i])):
        final_test[count, :, :] = rescaled_test[i,j,:,:]
        count = count+1

Save these training and test datasets to standard Preprocessed Folder for use in the VAE

In [214]:
np.save('/fast/gmooers/Preprocessed_Data/W_Trial/W_Training.npy', final_train)
np.save('/fast/gmooers/Preprocessed_Data/W_Trial/W_Test.npy', final_test)