# Preprocessing

The script to preprocess the data is:

preprocess_RG.py

This python script will scale all the variables of interest, stack them together, and combine the latitude, longitude, and time dimensions into one dimension called "sample"


You can specify what variables you want in your neural network input and ouput layers through the yml file:

SPCAM5.yml



You simply add or remove the variables to the lists

In [None]:
inputs: [TBP, QBP, PS, SOLIN, SHFLX, LHFLX]
outputs: [PTTEND, PTEQ, FSNT, FSNS, FLNT, FLNS, PRECT]

### Training

The full command to preprocess the training data is:

In [None]:
python3 /fast/gmooers/Real_Geography_Manuscript/preprocess_RG.py --config_file /fast/gmooers/Real_Geography_Manuscript/SPCAM5.yml --in_dir /DFS-L/DATA/pritchard/gmooers/Workflow/SPCAM_DATA/SPCAM5/2_Degree_Res/ --aqua_names TimestepOutput_Neuralnet_SPCAM_216.cam.h1.2014* --out_dir /fast/gmooers/Real_Geography_Manuscript/Preprocessed_Data/Full_Year_2021/ --out_pref full_physics_essentials_train_month01

Broken down into pieces the additional arguments refer to:
    
--in_dir :  Where your raw (unprocessed) simulation data is 

--aqua_names : The names of the netcdf files of unprocessed data you want to include in training data (probably the year/years you want)

--out_dir : Where you want to put your preprocessed training data after you finish preporcessing the data

--out_pref : What you want to call your preprocessed training data 

### Validation/Testing

You will want to use the same command above for validation (or testing) dataexpect you will want one additional argument to ensure the validation data is normalized the same way the training data is:

--ext_norm : {pathway to normalization file create by the command to train the data}

# Training

There is a sample script to train a simple feed-forward network. Most of the hyper-parameters are hard-coded in (activation function, number of layers, layer width, ect...) and can be adjusted as needed 

Deep_Training.py

Within the script you'll want to change the path to the training and validation data (line 16)

In [None]:
DATADIR = '/fast/gmooers/Preprocessed_Data/7_Years_Spaced/'

And if you changed the name in the --out_pref argument you will need to change lines in the Data_Generator() arguments as well (18-40) 

In [None]:
train_gen = DataGenerator(
    data_dir=DATADIR, 
    feature_fn='full_physics_essentials_train_month01_shuffle_features.nc',
    target_fn='full_physics_essentials_train_month01_shuffle_targets.nc',
    batch_size=512,
    norm_fn='full_physics_essentials_train_month01_norm.nc',
    fsub='feature_means', 
    fdiv='feature_stds', 
    tmult='target_conv',
    shuffle=True,
)

valid_gen = DataGenerator(
    data_dir=DATADIR, 
    feature_fn='full_physics_essentials_valid_month02_features.nc',
    target_fn='full_physics_essentials_valid_month02_targets.nc',
    batch_size=512,
    norm_fn='full_physics_essentials_train_month01_norm.nc',  # SAME NORMALIZATION FILE!
    fsub='feature_means', 
    fdiv='feature_stds', 
    tmult='target_conv',
    shuffle=False,
)

Tensorboard information can be removed on lines 107, 23, 124 for simplicity

Change where you want to save the trained model (.h5 file) and loss curves on lines 129, 152

# Model Predctions

The file to generate neural network predictions from the training data is:

Model_Predictions.py

Again, you will want to change the path to the data directory and if necessary the names of the validation data (lines 31-45)

In [None]:
DATADIR = 'Preprocessed_Data/RG_Spaced_10_Years/'



valid_gen = DataGenerator(
    data_dir=DATADIR, 
    feature_fn='full_physics_essentials_test_month02_features.nc',
    target_fn='full_physics_essentials_test_month02_targets.nc',
    batch_size=512,
    norm_fn='full_physics_essentials_train_month01_norm.nc',  # SAME NORMALIZATION FILE!
    fsub='feature_means', 
    fdiv='feature_stds', 
    tmult='target_conv',
    shuffle=False,
)

and the trained model (line 54)

In [None]:
model = keras.models.load_model('Models/8_Years_Linear.h5')

And the test data (line 56)

In [None]:
path_to_file = 'Preprocessed_Data/RG_Spaced_10_Years/full_physics_essentials_test_month02_features.nc'

And finally save the model predictions (line 107)

In [None]:
myds.to_netcdf('Models/Test_Final_Linear_DNN_Year.nc')

# Analysis

A quick way is to look at how well the neural network emulates a vertical cross-section of the atmosphere. You can get those $R^2$ values in the script

https://github.com/gmooers96/RG_Manuscript_Revised/blob/main/Post_Processing/Figure_1_2_3/Lat_Pressure_Timestep_R2_Heating.ipynb

You'll again want to change the filepaths for the model predictions and test data you generated. In the test data (axis 2) 0-30 are heating, 30-60 are moistening, and 61 is rainfall

In [None]:
path_to_file = "/fast/gmooers/Real_Geography_Manuscript/Preprocessed_Data/RG_Spaced_10_Years/full_physics_essentials_test_month02_targets.nc"
ds = xr.open_dataset(path_to_file)
truths = ds.targets[:, :30].values
lons = ds.lon.values
lats = ds.lat.values
print('halfway')
path_to_file = "/fast/gmooers/Real_Geography_Manuscript/Models/Final_Sherpa_DNN_Annual.nc"
ds = xr.open_dataset(path_to_file)
features = ds.Prediction[:, :30].

e.g. the the code above extracts heating predictions from the neural network and truths from the spcam data

When you have the (2D) $R^2$ values you can take the the .npy file of those from the script above and then go to:

https://zenodo.org/record/4558716

And retrieve the .npy files for latitude (x) and pressure (z) coordinates to visualize the plot properly.

"X_Coords.npy"
"Z_Coords.npy"

An exmaple of a script to visualize the plot:

In [None]:
ax.pcolor(X_Coords, Z_Coords, R2, cmap = 'Blues', vmin = 0, vmax = 1.0,  rasterized=True)
ax.contour(X_Coords, Z_Coords, R2, [0.7], colors='pink', linewidths=[4])
ax.contour(X_Coords, Z_Coords, R2, [0.9], colors='orange', linewidths=[4])
ax.set_title("(g) Best Real Geography Heating", fontsize = fz*0.9)
ax.set_ylim(ax[3,0].get_ylim()[::-1])