<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#A-micro-workflow-to-batch-update-CMR-records-(C,-G,-or-V)." data-toc-modified-id="A-micro-workflow-to-batch-update-CMR-records-(C,-G,-or-V).-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>A micro workflow to batch update CMR records (C, G, or V).</a></span></li><li><span><a href="#The-main-step-to-batch-fix-any-CMR-records" data-toc-modified-id="The-main-step-to-batch-fix-any-CMR-records-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The main step to batch fix any CMR records</a></span></li></ul></div>

## A micro workflow to batch update CMR records (C, G, or V).

In [1]:
from numpy import nan
from json import dumps
from os.path import dirname
from requests.exceptions import MissingSchema as MissingSchemaError
from requests import get
from pandas import DataFrame, notnull

uat = "https://cmr.uat.earthdata.nasa.gov/search"
cmr = "https://cmr.earthdata.nasa.gov/search"

The search url to get all POCLOUD variables:

In [2]:
url = f"{uat}/variables.umm_json?provider=POCLOUD&page_size=2000"
print(url)

https://cmr.uat.earthdata.nasa.gov/search/variables.umm_json?provider=POCLOUD&page_size=2000


Get ALL variable records for POCLOUD. Print the number of results.

In [3]:
res = get(url).json()
res['hits']

210

Reformat the variable search metadata and the umm metadata into two data frames for convenient processing:

In [4]:
meta = DataFrame.from_records([r['meta'] for r in res['items']])
meta.iloc[0]

revision-id                                      3
deleted                                      False
format           application/vnd.nasa.cmr.umm+json
provider-id                                POCLOUD
user-id                                   jmcnelis
native-id                                    agc_c
concept-id                     V1234656554-POCLOUD
revision-date                 2020-06-12T21:48:04Z
concept-type                              variable
Name: 0, dtype: object

In [5]:
def _meta_table(columns: list=list(meta.columns), index: str="concept-id"):
    """Reindexes the search metadata table and subsets the columns."""
    return meta[columns].copy().set_index(meta[index])


_meta_table([
    'native-id',
    'concept-id',
    'revision-id',
    'revision-date',
    'user-id',
])

Unnamed: 0_level_0,native-id,concept-id,revision-id,revision-date,user-id
concept-id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
V1234656554-POCLOUD,agc_c,V1234656554-POCLOUD,3,2020-06-12T21:48:04Z,jmcnelis
V1234656555-POCLOUD,agc_ku,V1234656555-POCLOUD,2,2020-06-12T21:48:30Z,jmcnelis
V1234656556-POCLOUD,agc_numval_c,V1234656556-POCLOUD,2,2020-06-12T21:48:31Z,jmcnelis
V1234656557-POCLOUD,agc_numval_ku,V1234656557-POCLOUD,2,2020-06-12T21:48:31Z,jmcnelis
V1234656558-POCLOUD,agc_rms_c,V1234656558-POCLOUD,2,2020-06-12T21:48:31Z,jmcnelis
...,...,...,...,...,...
V1234656688-POCLOUD,wind_speed_model_u_era,V1234656688-POCLOUD,2,2020-06-12T21:50:04Z,jmcnelis
V1234656689-POCLOUD,wind_speed_model_v,V1234656689-POCLOUD,2,2020-06-12T21:50:05Z,jmcnelis
V1234656690-POCLOUD,wind_speed_model_v_era,V1234656690-POCLOUD,2,2020-06-12T21:50:05Z,jmcnelis
V1234656691-POCLOUD,wind_speed_rad,V1234656691-POCLOUD,2,2020-06-12T21:50:06Z,jmcnelis


Make a similar table for the UMM metadata and display the first row:

In [6]:
umm = DataFrame.from_records([r['umm'] for r in res['items']])
umm.iloc[0]

VariableType                                              SCIENCE_VARIABLE
DataType                                                             int16
Offset                                                                   0
Scale                                                                 0.01
Characteristics          {'GroupPath': '/', 'IndexRanges': {'LatRange':...
FillValues                 [{'Value': 32767, 'Type': 'SCIENCE_FILLVALUE'}]
Sets                     [{'Name': 'agc_c', 'Type': 'General', 'Size': ...
Dimensions               [{'Name': 'time', 'Size': 2240, 'Type': 'TIME_...
Definition                                            C band corrected AGC
Name                                                                 agc_c
AcquisitionSourceName                                      radar altimeter
ValidRanges                                [{'Min': -32768, 'Max': 32767}]
Units                                                                   dB
LongName                 

We need to process replacement `GroupPath` fields for all variables (to resolve issues identified by M. Gangl).

It's in the `Characteristics` field:

In [7]:
umm.Characteristics

0      {'GroupPath': '/', 'IndexRanges': {'LatRange':...
1      {'GroupPath': '/', 'IndexRanges': {'LatRange':...
2      {'GroupPath': '/', 'IndexRanges': {'LatRange':...
3      {'GroupPath': '/', 'IndexRanges': {'LatRange':...
4      {'GroupPath': '/', 'IndexRanges': {'LatRange':...
                             ...                        
205    {'GroupPath': '/', 'IndexRanges': {'LatRange':...
206    {'GroupPath': '/', 'IndexRanges': {'LatRange':...
207    {'GroupPath': '/', 'IndexRanges': {'LatRange':...
208    {'GroupPath': '/', 'IndexRanges': {'LatRange':...
209    {'GroupPath': '/', 'IndexRanges': {'LatRange':...
Name: Characteristics, Length: 210, dtype: object

## The main step to batch fix any CMR records

*Just define a simple function to replace the bad metadata in place.*

Apply a simple lambda over that column to remove the trailing variable names:

In [8]:
def _fix_GroupPath(x):
    if x is not nan:
        p = dirname(x['GroupPath'])
    else:
        return None
    x['GroupPath'] = "/" if p=="" else p
    return x
        

# Modify and replace the characteristics of all variables to fix GroupPath.
umm.Characteristics = umm.Characteristics.apply(lambda x: _fix_GroupPath(x))

# Display the first five rows.
umm.head()

Unnamed: 0,VariableType,DataType,Offset,Scale,Characteristics,FillValues,Sets,Dimensions,Definition,Name,AcquisitionSourceName,ValidRanges,Units,LongName,VariableSubType
0,SCIENCE_VARIABLE,int16,0.0,0.01,"{'GroupPath': '/', 'IndexRanges': {'LatRange':...","[{'Value': 32767, 'Type': 'SCIENCE_FILLVALUE'}]","[{'Name': 'agc_c', 'Type': 'General', 'Size': ...","[{'Name': 'time', 'Size': 2240, 'Type': 'TIME_...",C band corrected AGC,agc_c,radar altimeter,"[{'Min': -32768, 'Max': 32767}]",dB,C band corrected AGC,
1,SCIENCE_VARIABLE,int16,0.0,0.01,"{'GroupPath': '/', 'IndexRanges': {'LatRange':...","[{'Value': 32767, 'Type': 'SCIENCE_FILLVALUE'}]","[{'Name': 'agc_ku', 'Type': 'General', 'Size':...","[{'Name': 'time', 'Size': 2240, 'Type': 'TIME_...",Ku band corrected AGC,agc_ku,radar altimeter,"[{'Min': -32768, 'Max': 32767}]",dB,Ku band corrected AGC,
2,SCIENCE_VARIABLE,int16,0.0,1.0,"{'GroupPath': '/', 'IndexRanges': {'LatRange':...","[{'Value': 127, 'Type': 'SCIENCE_FILLVALUE'}]","[{'Name': 'agc_numval_c', 'Type': 'General', '...","[{'Name': 'time', 'Size': 2240, 'Type': 'TIME_...",number of valid points used to compute C band AGC,agc_numval_c,radar altimeter,"[{'Min': 0, 'Max': 20}]",count,number of valid points used to compute C band AGC,
3,SCIENCE_VARIABLE,int16,0.0,1.0,"{'GroupPath': '/', 'IndexRanges': {'LatRange':...","[{'Value': 127, 'Type': 'SCIENCE_FILLVALUE'}]","[{'Name': 'agc_numval_ku', 'Type': 'General', ...","[{'Name': 'time', 'Size': 2240, 'Type': 'TIME_...",number of valid points used to compute Ku band...,agc_numval_ku,radar altimeter,"[{'Min': 0, 'Max': 20}]",count,number of valid points used to compute Ku band...,
4,SCIENCE_VARIABLE,int16,0.0,0.01,"{'GroupPath': '/', 'IndexRanges': {'LatRange':...","[{'Value': 32767, 'Type': 'SCIENCE_FILLVALUE'}]","[{'Name': 'agc_rms_c', 'Type': 'General', 'Siz...","[{'Name': 'time', 'Size': 2240, 'Type': 'TIME_...",RMS of the C band AGC,agc_rms_c,radar altimeter,"[{'Min': -32768, 'Max': 32767}]",dB,RMS of the C band AGC,


Replace any null values that were converted to NaN back to Python None:

In [9]:
umm = umm.where(notnull(umm), None)

Notice the updated `GroupPath` for all records.

Loop the rows, rebuild the records, and write curl commands to an external script for record keeping:

In [10]:
# Get a new CMR token by calling dedicated shell script.
TOKEN = !bash /Users/jmcnelis/Configuration/scripts/cmr/get-echo-token.sh
TOKEN = TOKEN[0]

# Open a shell script for writing.
scr = open("grouppath-fix.sh", "w")

# Write the shebang:
scr.write("#!/bin/bash")

# Loop the umm metadata table.
for ix, row in umm.iterrows():
    
    # Turn the row back into a dict, dropping Nones:
    rec = {k:v for k,v in row.to_dict().items() if v is not None}
    
    # Write the curl command to a script.
    scr.write(f"""
curl -i -XPUT \
-H "Content-type: application/vnd.nasa.cmr.umm+json" \
-H "Echo-Token: {TOKEN}" \
https://cmr.uat.earthdata.nasa.gov/ingest/providers/POCLOUD/variables/{row.Name} \
--data-binary '{dumps(rec)}'
""")
    
# Close the script.
scr.close()