# Convert Lat/Lon Files to Polar Stereographic Files

This notebook provides a workflow for converting lat/lon netcdf files to polar stereographic coordinates. Then, we combine the polar stereographic underlying data files and mask files in order to use as input to cgnet. To get the data into the proper format, we needed to do the following:
 - utilize a template file as a base for modifying netcdf files of the existing mask arrays. The template file includes all proper GIS attributes.
 - python script reads the Geotiff file and creates a netcdf file with the x,y,lat,lon coordinate dimensions and variables

Authors:
 - John Truesdale
 - Teagan King

Some background notes from John on methods for regridding:
----------------------------------------------------------
 - used QGIS to assign a standard stereographic coordinate and create a GeoTiff file version of one of the Polar jpg plots
 - Giving a standard location to each jpeg pixel basically consisted of finding similar features at the pixel level (the tip of an island, the most inset part of a promenent bay) between our polar jpegs and a map that is already georeferenced and has known locations for those pixels. If you choose 3-10 pixels in common you can create a linear regression that will map out all the rest of the pixels on your jpeg.
 - Once the GIS application was able to calculate the transform to go from pixel to a standard coordinate system, I saved all that information in a GeoTiff file.
 - The polar projection jpegs on which the climatenet masks are drawn were created from python matplotlib and you can grab the coordinate information from matplotlib; this was checked with QGIS
 - Because the LLNL polar jpegs are a projected coordinate, the underlying unit in a stereographic projection is meters.  The x and y variables on the GeoTiff and converted netcdf file contain meter offsets of every pixel (row,col) of the ar_mask array with respect to one of the standard south pole stereographic coordinate systems.
 -   When you look at the square projected polar image you see that the longitude lines converge at the pole and latitudes are a set of nested circles.  When you are describing this grid in lat/lon coordinates it is known as a curvilinear grid where each pixel (array location) requires a unique lat/lon pair to specify its position on a regular grid. A straight line along any row or column of the jpeg raster (or ar_mask array) will intersect different lat lon values for every pixel.  The coordinate information for our rectangular ar_mask array therefore contains lat and lon variables that are two dimensions and describe the entire grid of 1152x1152 points with unique lat/lon values for each pixel (array location) of ar_mask. The standard netcdf way of denoting a curvilinear coordinate is by creating the dimensions that define the size of the ar_mask array (x,y), adding lat/lon variables that are each dimensioned (x,y) containing the lat/lon coordinates of that point, and finally adding metadata to the ar_mask array noting that the coordinates for this variable are not the dimension variables x,y but the lat/lon variables.

Generate remapped IVT/TMQ/etc underlying data:  
 - ESMF_RegridWeightGen --ignore_unmapped --dst_regional -m bilinear -w map_fv0.23x0.31_to_sp_stereo_near.nc -s /glade/p/cesmdata/inputdata/share/scripgrids/fv0.23x0.31_141008.nc -d /glade/u/home/jet/sp_stereographic_SCRIP.nc  
 - ncremap -m ./map_fv0.23x0.31_to_sp_stereo_near.nc -i windhusavi_3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200001-200012.nc -o polar_ivt/windhusavi_3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200001-200012.nc  
 - see /glade/scratch/tking/cgnet/high_lat_QC/

In [1]:
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import os
from rasterio.warp import transform
import urllib.request
import xarray as xr
import glob
from netCDF4 import date2num
from netCDF4 import Dataset
import numpy as np
import datetime as dt
import cftime
import pdb

## Generate template file (only needs to be done once)

In [9]:
# # Read the data
# input_path = '/glade/work/tking/cgnet/polar_regridding/data-2003-04-29-02-0-copy-sav1.tif'

# da = xr.open_rasterio(input_path)
# yval=da['y']
# ryval=np.flip(yval)
# # Compute the lon/lat coordinates with rasterio.warp.transform
# ny, nx = len(da['y']), len(da['x'])
# x, y = np.meshgrid(da['x'], ryval)
# # Rasterio works with 1D arrays
# lon, lat = transform(da.crs, {'init': 'EPSG:4326'},
#                      x.flatten(), y.flatten())
# lon = np.asarray(lon).reshape((ny, nx))
# lat = np.asarray(lat).reshape((ny, nx))
# da.coords['lon'] = (('y', 'x'), lon)
# da.coords['lat'] = (('y', 'x'), lat)

In [10]:
# da.to_netcdf(path='/glade/work/tking/cgnet/polar_regridding/data-2003-04-29-02-0-sav1-rev-latlon.nc')
# # use for just antarctic

## set up dask

In [2]:
# Import dask
import dask

# Use dask jobqueue
from dask_jobqueue import PBSCluster

# Import a client
from dask.distributed import Client

# Setup your PBSCluster
nmem='25GB' # specify memory here so it duplicates below
cluster = PBSCluster(
    cores=1, # The number of cores you want
    memory=nmem, # Amount of memory
    processes=1, # How many processes
    queue='casper', # The type of queue to utilize (/glade/u/apps/dav/opt/usr/bin/execcasper)
    local_directory='/glade/scratch/$USER/local_dask', # Use your local directory
    resource_spec='select=1:ncpus=1:mem='+nmem, # Specify resources
    account='P93300313', # Input your project ID here, previously this was known as 'project', now is 'account'
    walltime='08:00:00', # Amount of wall time
    # interface='ib0', # Interface to use
)

# Scale up
cluster.scale(10)

# Change your url to the dask dashboard so you can see it
dask.config.set({'distributed.dashboard.link':'https://jupyterhub.hpc.ucar.edu/stable/user/{USER}/proxy/{port}/status'})

# Setup your client
client = Client(cluster)

Task exception was never retrieved
future: <Task finished name='Task-25' coro=<_wrap_awaitable() done, defined at /glade/u/home/tking/.conda/envs/polar_ARs/lib/python3.9/asyncio/tasks.py:681> exception=RuntimeError('Command exited with non-zero exit code.\nExit code: 32\nCommand:\nqsub /glade/derecho/scratch/tking/tmp/tmpq0dmniva.sh\nstdout:\n\nstderr:\nqsub: \nUnknown queue requested. Please correct or contact\ncislhelp@ucar.edu for assistance\n                        \n\n')>
Traceback (most recent call last):
  File "/glade/u/home/tking/.conda/envs/polar_ARs/lib/python3.9/asyncio/tasks.py", line 688, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/glade/u/home/tking/.conda/envs/polar_ARs/lib/python3.9/site-packages/distributed/deploy/spec.py", line 59, in _
    await self.start()
  File "/glade/u/home/tking/.conda/envs/polar_ARs/lib/python3.9/site-packages/dask_jobqueue/core.py", line 411, in start
    out = await self._submit_job(fn)
  File "/glade/u/home/t

In [3]:
client

0,1
Connection method: Cluster object,Cluster type: dask_jobqueue.PBSCluster
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/tking/proxy/8787/status,

0,1
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/tking/proxy/8787/status,Workers: 0
Total threads: 0,Total memory: 0 B

0,1
Comm: tcp://128.117.211.174:34051,Workers: 0
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/tking/proxy/8787/status,Total threads: 0
Started: Just now,Total memory: 0 B


## define dictionary of file names

In [4]:
tmq_dict = {2000: "prw_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200001010000-200012312100.nc",
            2001: "prw_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200101010000-200112312100.nc",
            2002: "prw_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200201010000-200212312100.nc",
            2003: "prw_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200301010000-200312312100.nc",
            2004: "prw_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200401010000-200412312100.nc"}

ivt_dict = {2000: "windhusavi_3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200001-200012.nc",
            2001: "windhusavi_3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200101-200112.nc",
            2002: "windhusavi_3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200201-200212.nc",
            2003: "windhusavi_3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200301-200312.nc",
            2004: "windhusavi_3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200401-200412.nc"}

psl_dict = {2000: "psl_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200001010000-200012312100.nc",
            2001: "psl_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200101010000-200112312100.nc",
            2002: "psl_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200201010000-200212312100.nc",
            2003: "psl_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200301010000-200312312100.nc",
            2004: "psl_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200401010000-200412312100.nc"}

pr_dict = {2000: "pr_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200001010000-200012312359.nc",
           2001: "pr_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200101010000-200112312359.nc",
           2002: "pr_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200201010000-200212312359.nc",
           2003: "pr_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200301010000-200312312359.nc",
           2004: "pr_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200401010000-200412312359.nc",
           2005: "pr_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200501010000-200512052359.nc"}

## Change to 2-dimensional lat/lon before regridding 

## Regrid

In [6]:
# only do once

for var in ['ivt']: # pr , 'psl', 'tmq'
    if var=='pr':
        dictionary = pr_dict
    elif var=='psl':
        dictionary = psl_dict
    elif var=='ivt':
        dictionary = ivt_dict
    elif var=='tmq':
        dictionary = tmq_dict
    for year in [2000]: #2001,2002,2003,2004]:
        ds_before_regrid = xr.open_dataset('/glade/derecho/scratch/tking/cgnet/high_lat_QC/from_nersc/2dlatlon/{}'.format(dictionary[year]))
        ds_before_regrid
        # we want this to be two dimensional lat/lon instead of 1

        # mesh is one useful tool but could do by hand
        # duplicate lat/lon array for lon/lat number of times
        # in order to have lat (y,x) and lon (y,x)

        # y is lat
        # x is lon
        # dimensions should be time, x, y, 
        # follow example here: https://xesmf.readthedocs.io/en/latest/notebooks/Curvilinear_grid.html
        y_len = ds_before_regrid.lon.shape[0]
        x_len = ds_before_regrid.lat.shape[0]

        ds_before_regrid['lat_val'] = (('y','x'), np.tile(ds_before_regrid.lat, (y_len,1)))
        ds_before_regrid['lon_val'] = (('y','x'), np.transpose(np.tile(ds_before_regrid.lon, (x_len,1))))

        if var = 'ivt':
            ds_before_regrid[var] = ds_before_regrid.windhusavi.swap_dims({'lat':'x','lon':'y'})
        elif var = 'psl':
            ds_before_regrid[var] = ds_before_regrid.psl.swap_dims({'lat':'x','lon':'y'})
        elif var = 'tmq':
            ds_before_regrid[var] = ds_before_regrid.prw.swap_dims({'lat':'x','lon':'y'})
        elif var = 'pr':
            ds_before_regrid[var] = ds_before_regrid.pr.swap_dims({'lat':'x','lon':'y'})

        ds_before_regrid = ds_before_regrid.drop_dims('lat')
        ds_before_regrid = ds_before_regrid.drop_dims('lon')
        
        ds_before_regrid.to_netcdf('/glade/derecho/scratch/tking/cgnet/high_lat_QC/from_nersc/updated_latlon/{}'.format(dictionary[year]))
        print("done with {} year".format(var), str(year))

done with year 2000


## regrid TMQ/IVT/PSL/PR data from lat/lon to polar

In [3]:
# submit scripts below in batch scripts, see example at
#     /glade/scratch/tking/cgnet/high_lat_QC/from_nersc/2dlatlon/remap_script and batch_remap.sh

In [314]:
# module load gnu/9.1.0
# module load esmf_libs/8.0.0
# module load esmf-8.0.0-ncdfio-mpi-O
# module load nco/4.7.9

In [315]:
# set srcgrid=f09
# set dstgrid=sp_stereo
# set srcgridfile=/glade/p/cesmdata/inputdata/share/scripgrids/fv0.23x0.31_141008.nc
# set dstgridfile=/glade/u/home/jet/sp_stereographic_SCRIP.nc
# set srcinitfile=prw_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200001010000-200012312100.nc
# set dstinitfile=polar_tmq/prw_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200001010000-200012312100_polar_nearest.nc

# #create the map file
# ESMF_RegridWeightGen --ignore_unmapped --src_regional -m neareststod -w map_${srcgrid}_to_${dstgrid}_near.nc -s ${srcgridfile} -d ${dstgridfile}

# #use the mapfile to remap srcinitfile to dstinitfile
# ncremap -m ./map_${srcgrid}_to_${dstgrid}_near.nc -i ${srcinitfile} -o ${dstinitfile}

## Rename regridded files to avoid lat/lon being dimension and variable name

In [4]:
# use ncrename to rename the output lat/lon dims to not conflict with vars
# ncrename -v lon,longitude -d lat,latitude prw_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200001010000-200012312100.nc renamed/prw_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200001010000-200012312100.nc

pr_ds = xr.open_dataset('/glade/scratch/tking/cgnet/high_lat_QC/prw/2dlatlon/polar_tmq/renamed/renamed_2/prw_A3hr_CAM5-1-025degree_All-Hist_est1_v1-0_run002_200001010000-200012312100.nc')


In [5]:
pr_ds

## Generate combined mask/underlying data files by creating netcdf from scratch

### run the next cell if temp.nc (temporary file to fill) already exists

In [11]:
rm /glade/u/home/tking/work/cgnet/QA_xml/all_antarctic_converted_masks/temp.nc

### Note on time adjustment
We'll need to use the bug fix included in the below cells for the first two rounds of QC'd data; this issue has been fixed in the masks of following datasets

### Create temp file with correct attributes

In [5]:
temp_file = '/glade/work/tking/cgnet/polar_regridding/data-2003-04-29-02-0-sav1-rev-latlon.nc'
temp = xr.open_dataset(temp_file)

ncfile = Dataset('/glade/u/home/tking/work/cgnet/QA_xml/all_antarctic_converted_masks/temp.nc',mode='w',format='NETCDF4_CLASSIC') 

# Create dimensions
y_dim = ncfile.createDimension('y', 1152)        # vertical displacement axis
x_dim = ncfile.createDimension('x', 1152)        # horizontal displacement axis
time_dim = ncfile.createDimension('time', None)  # unlimited axis (can be appended to)
sample_id_dim = ncfile.createDimension('sample_id', 6)

# Include time variable and relevant attributes
time = ncfile.createVariable('time', np.float64, ('time',))
time.units = 'hours since 1970-01-01'
time.calendar = 'noleap'
time.long_name = 'time'

# Include date information
date = ncfile.createVariable('date', np.float64, ('time',))
date.long_name = 'current date'
datesec = ncfile.createVariable('datesec', np.float64, ('time',))
datesec.long_name = 'current seconds of current date'

# Include ar_mask variable and relevant attributes
ar_mask = ncfile.createVariable('ar_mask', np.float64, ('time','sample_id','y','x'))
ar_mask.long_name = "Atmospheric River Mask"
ar_mask.standard_name = "AR flag"
ar_mask.flag_values = 0, 1
ar_mask.flag_meanings = "Background Atmospheric_River"

# Include underlying data
tmq = ncfile.createVariable('tmq',np.float64,('time','y','x'))
ivt = ncfile.createVariable('ivt',np.float64,('time','y','x'))
psl = ncfile.createVariable('psl',np.float64,('time','y','x'))
pr = ncfile.createVariable('pr',np.float64,('time','y','x'))

# include y, x, lat, & lon from temp file
y = ncfile.createVariable('y',np.float64,('y'))
y.long_name = 'vertical offset from pole'
y.units = 'meters'

x = ncfile.createVariable('x',np.float64,('x'))
x.long_name = 'horizontal offset from pole'
x.units = 'meters'

lat = ncfile.createVariable('lat', np.float64,('y','x'))
lat.units = 'degrees_north'
lat.long_name = 'latitude'

lon = ncfile.createVariable('lon', np.float64,('y','x'))
lon.units = 'degrees_east'
lon.long_name = 'longitude'

# Add the y, x, lat, & lon data values to the netcdf file
ncfile['y'][:] = temp.y
ncfile['x'][:] = temp.x
ncfile['lat'][:,:] = temp.lat
ncfile['lon'][:,:] = temp.lon

# Copy temp file metadata
ivt.transform = temp.__xarray_dataarray_variable__.transform
ivt.crs = temp.__xarray_dataarray_variable__.crs
ivt.coordinates = 'lat lon'
tmq.transform = temp.__xarray_dataarray_variable__.transform
tmq.crs = temp.__xarray_dataarray_variable__.crs
tmq.coordinates = 'lat lon'
psl.transform = temp.__xarray_dataarray_variable__.transform
psl.crs = temp.__xarray_dataarray_variable__.crs
psl.coordinates = 'lat lon'
pr.transform = temp.__xarray_dataarray_variable__.transform
pr.crs = temp.__xarray_dataarray_variable__.crs
pr.coordinates = 'lat lon'

# Copy global metadata
ncfile.transform = temp.__xarray_dataarray_variable__.transform
ncfile.crs = temp.__xarray_dataarray_variable__.crs
ncfile.res = temp.__xarray_dataarray_variable__.res
ncfile.nodatavals = temp.__xarray_dataarray_variable__.nodatavals
ncfile.scales = temp.__xarray_dataarray_variable__.scales
ncfile.offsets = temp.__xarray_dataarray_variable__.offsets
ncfile.AREA_OR_POINT = temp.__xarray_dataarray_variable__.AREA_OR_POINT
ncfile.coordinates = "lat lon"

### Loop through mask files and underlying data files; add both to temp file

My process has been to adjust the year in the for loop, move temp.nc to a new name, check the results, and then run for a different year. One could also rename temp.nc above to correspond with the year and then not bother with renaming the files. 

In [None]:
print('starting at {}'.format(dt.datetime.now()))
directory_of_underlying_data = "/glade/derecho/scratch/tking/cgnet/high_lat_QC/from_nersc/2dlatlon/polar/renamed/"
time_index = -1
round_val = 3 # change this value to indicate whether or not bug fix from round 1 is used

# The years below correspond to the mask's listed years (ie, incorrect years from round 1)
for year in [2003]:

    if round_val == 1:
        mask_file_list = sorted(glob.glob('/glade/work/tking/cgnet/QA_xml/round_1/h5/qa*/antarctic/netcdfs/data-{}-*'.format(year)))
        # mask_file_list = sorted(glob.glob('/glade/work/tking/cgnet/QA_xml/h5/qa*/antarctic/netcdfs/data-{}-*'.format(year)))
    elif round_val == 2:
        mask_file_list = sorted(glob.glob('/glade/work/tking/cgnet/QA_xml/round_2/h5/qa*/antarctic/netcdfs/data-{}-*'.format(year)))
    elif round_val == 3:
        mask_file_list = sorted(glob.glob('/glade/work/tking/cgnet/QA_xml/round_3/h5/qa*/antarctic/netcdfs/data-{}-*'.format(year))) #[:5]
    
    if round_val == 1:
        shifted_year = year - 1
    else:
        shifted_year = year
    
    # Bug fix from 2000 data being pulled in previously:
    if round_val == 1:
        shifted_year = 2000

    tmq_ds = xr.open_dataset(directory_of_underlying_data+'tmq/{}'.format(tmq_dict[shifted_year]))
    psl_ds = xr.open_dataset(directory_of_underlying_data+'psl/{}'.format(psl_dict[shifted_year]))
    ivt_ds = xr.open_dataset(directory_of_underlying_data+'ivt/{}'.format(ivt_dict[shifted_year]))
    pr_ds = xr.open_dataset(directory_of_underlying_data+'pr/{}'.format(pr_dict[shifted_year]))
    
    for antarctic_mask_file in mask_file_list[:]:
        time = antarctic_mask_file.split('/')[-1].split('data-')[1].split('.nc')[0].split('-00-2')[0].split('_')[0]
        time_year = int(time.split('-')[0])  # or = year
        
        print(antarctic_mask_file)
        print(time)
        
        # For round 1, assume that the nc file is ~named~ 2001, but underlying data is 2000
        if round_val == 1:
            time_year = 2001
        
        time_month = int(time.split('-')[1])
        time_day = int(time.split('-')[2])
        if round_val==1 or round_val==2:
            time_hour = 22  # all files were 00
            time_mins = 30
        if round_val==3:
            time_hour = int(time.split('-')[4])*3-2  # 1.5 hours off of time provided... so subtract 2 hr and add a half hour
            time_mins = 30
            # if time_month >= 3: # TODO: figure this bit out...
            #     time_day = time_day-1


        # print(time_year)
        # print(time_month)
        # print(time_day)
        # print(time_hour)
        # print(time_mins)

        date_number = date2num(dt.datetime(time_year, time_month, time_day, time_hour, time_mins), 'hours since 1970-01-01')

        # ---------------------------------------------------------------------------------
        # Fix for indexing bugs is now in all_chey_arctic.ipynb and all_chey_antarctic.ipynb,
        # So this fix will not be needed after round 1 data is processed.
        # In this fix, if data is from 2000, adjust days by 4, otherwise, adjust days by 5 (to account for leap days)
        if round_val == 1:
            if time_year == 2000:
                leap_year_adjustment = 4
            if time_year >= 2001:
                leap_year_adjustment = 5
            if time_month in [1, 3, 5, 7, 8, 10, 12]: # months with 31 days
                days_in_month = 31
                # adjust year for indexing issue unless last few days in file (because of leap year, these got included in correct file)
                if time_month == 12 and time_day < (days_in_month - leap_year_adjustment):
                    time_year = time_year - 1
                else:
                    time_year = time_year - 1
                if time_day < (days_in_month - leap_year_adjustment):
                    time_day = time_day + leap_year_adjustment
                else:
                    time_day = ((time_day + leap_year_adjustment) - days_in_month) + 1 # add 4 days for leap year, subtract days in month, add 1 because month starts at 1 not 0.
                    time_month += 1 # use one of the first few days in the next month
                    if time_month == 13:  # no 13th month, so loop back to January
                        time_month = 1
            elif time_month in [4, 6, 9, 11]: # months with 30 days
                time_year = time_year - 1
                days_in_month = 30
                if time_day < (days_in_month - leap_year_adjustment):
                    time_day = time_day + leap_year_adjustment
                else:
                    time_day = ((time_day + leap_year_adjustment) - days_in_month) + 1
                    time_month += 1 # use one of the first few days in the next month
            elif time_month == 2:
                time_year = time_year - 1
                days_in_month = 28
                if time_day < (days_in_month - leap_year_adjustment):
                    time_day = time_day + leap_year_adjustment
                else:
                    time_day = ((time_day + leap_year_adjustment) - days_in_month) + 1
                    time_month += 1 # use one of the first few days in the next month
        # ---------------------------------------------------------------------------------

        if time_month < 10:
            time_m_formatted = '0'+str(time_month)
        else:
            time_m_formatted = str(time_month)
        if time_day < 10:
            time_d_formatted = '0'+str(time_day)
        else:
            time_d_formatted = str(time_day)

        # TODO: Some files (mostly 2002) start at _0 and sometimes just end with -2 and if repeat date, include _1.nc
        # Below is a bug fix for that, but we should ensure the workflow that generated this issue gets fixed
        if antarctic_mask_file[-5:-4] == '_':
            sample_id = int(antarctic_mask_file[-4:-3])
        else:
            sample_id = 0
        
        if sample_id == 0:
            time_index+=1
        
        qa_aa_ds = xr.open_dataset(antarctic_mask_file)
        
        # reorientation needed for viewing with matplotlib
        qa_aa_ds.ar_masks['phony_dim_0'] = qa_aa_ds.ar_masks['phony_dim_0'][::-1]
        qa_aa_ds.reindex(phony_dim_0=list(reversed(qa_aa_ds.phony_dim_0)))
        
        # fill in date and datesec
        date[time_index] = str(time_year)+time_m_formatted+time_d_formatted
        if round_val==1 or round_val==2:
            datesec[time_index] = '81000'  # corresponds to 22:30
        if round_val==3:
            datesec[time_index] = str((3600*time_hour)+(60*time_mins))
        # fill in ar_mask
        ar_mask[time_index,sample_id,:,:]=qa_aa_ds.ar_masks

        pr = xr.DataArray(pr)
        psl = xr.DataArray(psl)
        tmq = xr.DataArray(tmq)
        ivt = xr.DataArray(ivt)

        if round_val==1 or round_val ==2:
            ncfile['pr'][time_index,:,:] = pr_ds.sel(time=cftime.DatetimeNoLeap(time_year, time_month, time_day, time_hour, time_mins, 0, 0, has_year_zero=True), method='nearest').isel(height=0).to_array().dropna('variable').dropna('nvertices').dropna('nb2')[5,:,:,1,1]
            ncfile['psl'][time_index,:,:] = psl_ds.sel(time=cftime.DatetimeNoLeap(time_year, time_month, time_day, time_hour, time_mins, 0, 0, has_year_zero=True), method='nearest').to_array().dropna('variable').dropna('nvertices').dropna('nbnd')[5,:,:,1,1]
            ncfile['tmq'][time_index,:,:] = tmq_ds.sel(time=cftime.DatetimeNoLeap(time_year, time_month, time_day, time_hour, time_mins, 0, 0, has_year_zero=True), method='nearest').to_array().dropna('variable').dropna('nvertices').dropna('nbnd')[5,:,:,1,1]
            ncfile['ivt'][time_index,:,:] = ivt_ds.sel(time=cftime.DatetimeNoLeap(time_year, time_month, time_day, time_hour, time_mins, 0, 0, has_year_zero=True), method='nearest').to_array().dropna('variable').dropna('nvertices').dropna('bound')[7,:,:,1,1]
        if round_val==3:
            ncfile['pr'][time_index,:,:] = pr_ds.sel(time=cftime.DatetimeNoLeap(time_year, time_month, time_day, time_hour, time_mins, 0, 0, has_year_zero=True), method='nearest').isel(height=0).to_array().dropna('variable').dropna('nvertices').dropna('nb2')[3,:,:,1,1]
            ncfile['psl'][time_index,:,:] = psl_ds.sel(time=cftime.DatetimeNoLeap(time_year, time_month, time_day, time_hour, time_mins, 0, 0, has_year_zero=True), method='nearest').to_array().dropna('variable').dropna('nvertices').dropna('nbnd')[3,:,:,1,1]
            ncfile['tmq'][time_index,:,:] = tmq_ds.sel(time=cftime.DatetimeNoLeap(time_year, time_month, time_day, time_hour, time_mins, 0, 0, has_year_zero=True), method='nearest').to_array().dropna('variable').dropna('nvertices').dropna('nbnd')[3,:,:,1,1]
            ncfile['ivt'][time_index,:,:] = ivt_ds.sel(time=cftime.DatetimeNoLeap(time_year, time_month, time_day, time_hour, time_mins, 0, 0, has_year_zero=True), method='nearest').to_array().dropna('variable').dropna('nvertices').dropna('bound')[6,:,:,1,1]
            
        print('all data added for {}'.format(str(time_year)+time_m_formatted+time_d_formatted))

        # time reported on netcdf frame:
        time_val = cftime.DatetimeNoLeap(time_year, time_month, time_day, time_hour, time_mins, 0, 0, has_year_zero=True) - cftime.DatetimeNoLeap(1970, 1, 1, 0, 0, 0, 0, has_year_zero=True)
        ncfile['time'][time_index] = (time_val.days * 24) + (time_val.seconds / 3600)
    
    # TODO: figure out how to run last few dates in year separately?
    
    ncfile.close()
print('wrote netcdf at {}'.format(dt.datetime.now()))

starting at 2024-11-01 08:40:29.560687
/glade/work/tking/cgnet/QA_xml/round_3/h5/qa1/antarctic/netcdfs/data-2003-01-22-10-1.nc
2003-01-22-10-1
all data added for 20030122
/glade/work/tking/cgnet/QA_xml/round_3/h5/qa1/antarctic/netcdfs/data-2003-01-22-10-1_1.nc
2003-01-22-10-1
all data added for 20030122
/glade/work/tking/cgnet/QA_xml/round_3/h5/qa1/antarctic/netcdfs/data-2003-01-25-80-8.nc
2003-01-25-80-8
all data added for 20030125
/glade/work/tking/cgnet/QA_xml/round_3/h5/qa1/antarctic/netcdfs/data-2003-01-25-80-8_1.nc
2003-01-25-80-8
all data added for 20030125
/glade/work/tking/cgnet/QA_xml/round_3/h5/qa1/antarctic/netcdfs/data-2003-02-23-20-2.nc
2003-02-23-20-2
all data added for 20030223
/glade/work/tking/cgnet/QA_xml/round_3/h5/qa1/antarctic/netcdfs/data-2003-02-23-20-2_1.nc
2003-02-23-20-2
all data added for 20030223
/glade/work/tking/cgnet/QA_xml/round_3/h5/qa1/antarctic/netcdfs/data-2003-02-28-60-6.nc
2003-02-28-60-6
all data added for 20030228
/glade/work/tking/cgnet/QA_xml/

In [7]:
# TODO:
#       1. MOVE TEMP_FILLED.NC to 2003 filename
#       2. start ML!

In [None]:
# view the data in polar format from command line:
# ncks -v pr -d sample_id,1 -d time,1 temp.nc out.nc
# ncks -d sample_id,1 -d time,1 temp.nc out.nc

# PART 2:

### use the standard ESMF mapping procedure to go from the projected stereographic polar mask grids to our regular gridded data from CESM. Use Steve's python code and John's mapping file commands to do this

In [35]:
# We want to do this to get masks and state information on two different grids
# need new mapping file to do this, then run through Steve's routine

In [None]:
# Steve Yeager has a utility function for remapping CAM-SE output (see remap_camse function below):
#     https://github.com/sgyeager/mypyutils/blob/main/mypyutils/regrid_utils.py

import xarray as xr
import numpy as np
import scipy.sparse as sps
import cf_xarray

def remap_camse(ds, dsw, varlst=[]):
    #dso = xr.full_like(ds.drop_dims('ncol'), np.nan)
    dso = ds.drop_dims('ncol').copy()
    lonb = dsw.xc_b.values.reshape([dsw.dst_grid_dims[1].values, dsw.dst_grid_dims[0].values])
    latb = dsw.yc_b.values.reshape([dsw.dst_grid_dims[1].values, dsw.dst_grid_dims[0].values])
    weights = sps.coo_matrix((dsw.S, (dsw.row-1, dsw.col-1)), shape=[dsw.dims['n_b'], dsw.dims['n_a']])
    if not varlst:
        for varname in list(ds):
            if 'ncol' in(ds[varname].dims):
                varlst.append(varname)
        if 'lon' in varlst: varlst.remove('lon')
        if 'lat' in varlst: varlst.remove('lat')
        if 'area' in varlst: varlst.remove('area')
    for varname in varlst:
        shape = ds[varname].shape
        invar_flat = ds[varname].values.reshape(-1, shape[-1])
        remapped_flat = weights.dot(invar_flat.T).T
        remapped = remapped_flat.reshape([*shape[0:-1], dsw.dst_grid_dims[1].values,
                                          dsw.dst_grid_dims[0].values])
        dimlst = list(ds[varname].dims[0:-1])
        dims={}
        coords={}
        for it in dimlst:
            dims[it] = dso.dims[it]
            coords[it] = dso.coords[it]
        dims['lat'] = int(dsw.dst_grid_dims[1])
        dims['lon'] = int(dsw.dst_grid_dims[0])
        coords['lat'] = latb[:,0]
        coords['lon'] = lonb[0,:]
        remapped = xr.DataArray(remapped, coords=coords, dims=dims, attrs=ds[varname].attrs)
        dso = xr.merge([dso, remapped.to_dataset(name=varname)])
    return dso

In [None]:
# Here is a notebook demonstrating how this is used:
#     /glade/u/home/yeager/analysis/python/toshare/CLM_field_regrid.ipynb