## First Attempt at converting the ES-DOC Errata ISSUES and into a CSV to use to filter our CMIP6 catalogs

- download the web-page: https://errata.es-doc.org/static/index.html as file.html
![es-doc](../assets/es-doc.png)

In [None]:
! /bin/grep "<tr id=" file.html > issues.txt
# For now, edit issues.txt, keeping only the lines we need  

In [10]:
import os
import pandas as pd
from glob import glob
import qgrid

In [13]:
filepath = 'issues.txt'
with open(filepath) as fp:
    issues = []
    line = fp.readline()
    while line:
        issue = line.split('"')[1]
        issues += [issue]
        command = '/usr/bin/esgissue retrieve --id ' + issue
        #print(command)
        #os.system(command)
        line = fp.readline()

In [4]:
import json

df = pd.DataFrame(columns=['uid', 'title', 'description', 'project', 'severity', 'status','urls'])

df_list = []
for item,issue in enumerate(issues):
    file_dsets = '/home/naomi/.esdoc/errata/dsets_dw/dset_'+issue+'.txt'
    file_issue = '/home/naomi/.esdoc/errata/issue_dw/issue_'+issue+'.json'
    
    with open(file_issue) as json_file:
        dict_issue = json.load(json_file)
   
    try:
        dlist = dict_issue['urls']
    except:
        dict_issue['urls'] = []
    
    df = df.append(dict_issue,ignore_index=True)
    df_dsets = pd.read_csv(file_dsets,delim_whitespace=True,header=None)
    df_dsets = df_dsets.rename(columns={0: "file_id"}).set_index([df_dsets.index])
    df_list += [list(df_dsets.file_id.values)]

df['file_ids'] = df_list
df = df.rename(columns={"uid": "issue_uid"})
#df.file_ids.values[0]

In [5]:
df

Unnamed: 0,issue_uid,title,description,project,severity,status,urls,file_ids
0,29e52dd0-8034-1002-8579-587903b1eb40,Incorrect ocean oxygen data,Certain ocean oxygen fields are erroneous and ...,cmip6,critical,wontfix,[],[CMIP6.AerChemMIP.NCAR.CESM2-WACCM.hist-1950HC...
1,ce889690-1ef3-6f46-9152-ccb27fc42490,Missing scaling factor from all lossch4 data.,lossch4 is missing a 1.e6/6.022e23 scaling fac...,cmip6,critical,resolved,[],[CMIP6.AerChemMIP.NCAR.CESM2-WACCM.hist-1950HC...
2,b53253df-011d-d063-ae4d-3a45834a5364,Issues in ScenarioMIP ssp370 CNRM-ESM2-1 r1i1p...,Problem of reproductibility with inconsistenci...,cmip6,critical,new,[],[CMIP6.ScenarioMIP.CNRM-CERFACS.CNRM-ESM2-1.ss...
3,acd51055-f7bc-3d58-b980-a6fb11ee287b,Convert climatologies to monthly time serie,"co2mass, ch4global and n2oglobal variables hav...",cmip6,medium,resolved,[],[CMIP6.CFMIP.IPSL.IPSL-CM6A-LR.abrupt-0p5xCO2....
4,0472d183-b8ae-2a4a-8384-7363cdb6b297,Issues in aqua-P4K experiment,An error was detected in the aqua-p4K experime...,cmip6,critical,new,[],[CMIP6.CFMIP.CNRM-CERFACS.CNRM-CM6-1.aqua-p4K....
...,...,...,...,...,...,...,...,...
72,7f680e48-6bfd-60a1-9e10-0d4acdc0fdb0,Some sea ice variables in 3D instead of 1D,Some sea ice variables have been written as a ...,cmip6,low,resolved,[],[CMIP6.CMIP.IPSL.IPSL-CM6A-LR.1pctCO2.r1i1p1f1...
73,45f9e7b9-1844-7a92-8b54-10d954e621db,Time instantaneous data with time boundaries,Several files with an instantaneous time axis ...,cmip6,low,wontfix,[],[CMIP6.CFMIP.IPSL.IPSL-CM6A-LR.abrupt-0p5xCO2....
74,54720dfd-7c8f-7a68-9986-ceac4b77ffd0,Integers instead of PFTs names,"The ""veget"" coordinate is a list of integers (...",cmip6,low,resolved,[],[CMIP6.CMIP.IPSL.IPSL-CM6A-LR.1pctCO2.r1i1p1f1...
75,6f1ce955-84de-c62b-9a62-9fb7422369f2,Integers instead of ocean passages names,"The ""sector"" dimensions is a list of integers ...",cmip6,low,resolved,[],[CMIP6.CMIP.IPSL.IPSL-CM6A-LR.1pctCO2.r1i1p1f1...


In [14]:
keywords = ['issue_uid','source_id', 'experiment_id', 'member_id', 'table_id', 'variable_id', 'grid_label', 'version', 'file_id']
df_all = []
for item, file_id in enumerate(df.file_ids.values):
    dfs = pd.DataFrame(columns=keywords)
    for file in file_id:
        #print(file)
        try:
            [fill,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_version] = file.split('.')
        except:
            print('not working for ',file)
        [grid_label,version] = grid_version.split('#')
        klist = [source_id,experiment_id,member_id,table_id,variable_id,grid_label,version,file]
        kdict = dict(zip(keywords, klist))
        dfs = dfs.append(kdict,ignore_index=True)
        df_all += [dfs]
df_expand = pd.concat(df_all,sort=False)

In [15]:
qgrid.show_grid(df_expand)

QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': True, 'defau…