In [1]:
import os
#os.environ['IVOA_REGISTRY']="http://vao.stsci.edu/RegTAP/TapService.aspx"

import pyvo as vo
import warnings
# There are a number of relatively unimportant warnings that show up, so for now, suppress them:
warnings.filterwarnings("ignore", module="astropy.nddata.blocks.*")
warnings.filterwarnings("ignore", module="pyvo.utils.xml.*")
warnings.filterwarnings("ignore", module="urllib3.connectionpool.*")

#T Dower said:  " ForRegTAP, the preferred URL is now 
#  https://mast.stsci.edu/vo-tap/api/v0.1/registry. 
#  OAI-PMH is still on the old system."
navo_new_regtap = vo.dal.TAPService('https://mast.stsci.edu/vo-tap/api/v0.1/registry')
navo_old_regtap = vo.dal.TAPService('https://vao.stsci.edu/RegTAP/TapService.aspx')
gavo_regtap = vo.dal.TAPService('https://dc.zah.uni-heidelberg.de/__system__/tap/run')
euvo_regtap = vo.dal.TAPService('https://registry.euro-vo.org/regtap/tap')

from pyvo import registry
from astropy.coordinates import SkyCoord

# Registry Spring Cleaning notebook

Following up on Markus' [Confessions of a Registry Janitor](https://blog.g-vo.org/registry-a-janitor-speaks-out.html), I propose some regular checks of the metadata.  We already have checks of the validity of services, for instance, in the Operations group weather reports.  This would be compolementary.

## Check 1:  spot check numbers between different registries

What's the best way to get the current registries?  Testing one of them seems circular.  But the [RofR](https://rofr.ivoa.net) is still pointing to the old NAVO RegTAP.  OTOH, there's a bug in the new one.  

In [2]:
result = registry.search(datamodel="regtap").to_table()
print(result['ivoid','access_urls'])

            ivoid              ...
------------------------------ ...
        ivo://aip.gavo.org/tap ...
ivo://archive.stsci.edu/regtap ...
      ivo://esavo/registry/tap ...
         ivo://org.gavo.dc/tap ...
                ivo://purx/tap ...


In [3]:
def compare( query ):
    # Currently 
    #navo_regtap = vo.dal.TAPService('https://vao.stsci.edu/RegTAP/TapService.aspx')
    #navo_new_regtap = vo.dal.TAPService('https://mast.stsci.edu/vo-tap/api/v0.1/registry')
    navo_regtap = vo.dal.TAPService('https://mast.stsci.edu/vo-tap/api/v0.1/registry')
    gavo_regtap = vo.dal.TAPService('https://dc.zah.uni-heidelberg.de/__system__/tap/run')
    euvo_regtap = vo.dal.TAPService('https://registry.euro-vo.org/regtap/tap')
    sets=[]
    for name,regtap in [('NAVO',navo_regtap),('GAVO',gavo_regtap),("EUVO",euvo_regtap)]:
        try:
            sias = regtap.search(query)
            print(f"{name} RegTAP finds {len(sias)}")
            #strings=sias.to_table()['ivoid'] + ' "'+ sias.to_table()['cap_description']+'"'
            sets.append( set(sias.to_table()['ivoid'].data) )
        except Exception as e:
            print(f"{name} RegTAP gives error: {e}")
    print("Unique to NAVO:")
    print(sets[0].difference(sets[1],sets[2]))
    print("Unique to GAVO:")
    print(sets[1].difference(sets[0],sets[2]))
    print("Unique to EUVO:")
    print(sets[2].difference(sets[0],sets[1]))

    print("Missing from NAVO but in both of the others:")
    print( (sets[1] & sets[2]) - sets[0] )
    print("Missing from GAVO but in both of the others:")
    print( (sets[0] & sets[2]) - sets[1] )
    print("Missing from EUVO but in both of the others:")
    print( (sets[0] & sets[1]) - sets[2] )

    print("Anything not in all three")
    print( (sets[0] | sets[1] | sets[2]) - (sets[0] & sets[1] & sets[2]) )

    return(sets)

In [4]:
sets = compare("select * from rr.capability where standard_id like 'ivo://ivoa.net/std/sia%'")

NAVO RegTAP finds 477
GAVO RegTAP finds 477
EUVO RegTAP finds 474
Unique to NAVO:
{'ivo://nasa.heasarc/skyview/planck030', 'ivo://irsa.ipac/spitzer/images/level1', 'ivo://vopdc.obspm/dfbs', 'ivo://irsa.ipac/spitzer/images/level2', 'ivo://irsa.ipac/dss/images', 'ivo://org.gavo.dc/maidanak/res/rawframes/rawframes'}
Unique to GAVO:
{'ivo://org.gavo.dc/lswscans/res/positions/siap'}
Unique to EUVO:
set()
Missing from NAVO but in both of the others:
{'ivo://org.gavo.dc/dasch/q/im', 'ivo://fai.kz/maksutov_50_telescope/q/i', 'ivo://fai.kz/schmidt_telescope_lc/q/i', 'ivo://iucaa/crts/siap', 'ivo://cxc.harvard.edu/cscr2.1.siap'}
Missing from GAVO but in both of the others:
{'ivo://irsa.ipac/herschel/images/z0mgs_dust'}
Missing from EUVO but in both of the others:
{'ivo://vopdc.obspm/gepi/vopsat/esor', 'ivo://vopdc.obspm/gepi/vopsat', 'ivo://vopdc.obspm/gepi/vopsat/srcj'}
Anything not in all three
{'ivo://org.gavo.dc/dasch/q/im', 'ivo://vopdc.obspm/gepi/vopsat/srcj', 'ivo://vopdc.obspm/gepi/vopsa

In [5]:
sets = compare("select * from rr.capability where standard_id like 'ivo://ivoa.net/std/hips%'")

NAVO RegTAP finds 26
GAVO RegTAP finds 637
EUVO RegTAP finds 576
Unique to NAVO:
set()
Unique to GAVO:
{'ivo://cds/p/dm/simbad-biblio/pub-dates/2022', 'ivo://cds/p/vphas/dr4/u', 'ivo://cds/p/act-planck/dr4dr6/map_healpix_u', 'ivo://cds/p/planckrevised/co21', 'ivo://cds/p/co-dame-2022', 'ivo://cds/p/skymapper/dr4/g', 'ivo://cds/p/dm/simbad-biblio/pub-dates/2021', 'ivo://cds/p/act-planck/dr4dr6/f090', 'ivo://cds/p/ast3ii/dr1', 'ivo://cds/p/denis/i', 'ivo://cds/p/vphas/dr4/halpha', 'ivo://cds/p/fds/dr1/r', 'ivo://cds/p/euclid/ero/nisp.h', 'ivo://cds/p/jwst/f444w', 'ivo://cds/p/vphas/dr4/color', 'ivo://cds/p/vphas/dr4/g', 'ivo://cds/p/planckrevised/co32', 'ivo://cds/p/jwst/f200w', 'ivo://cds/p/dm/flux-rp/i/355/gaiadr3', 'ivo://cds/p/galacticnucleus/dr1/color', 'ivo://cds/p/act-planck/dr4dr6/f220', 'ivo://cds/p/jwst/f150w', 'ivo://cds/p/jwst/f212n', 'ivo://cds/p/euclid/ero/nisp.j', 'ivo://cds/p/act-planck/dr4dr6/color_mw', 'ivo://cds/p/act-planck/dr4dr6/map_healpix_b', 'ivo://cds/p/jwst/ope

In [6]:
sets = compare("select * from rr.capability where standard_id like 'ivo://ivoa.net/std/cone%' and ivoid not like '%vizier%'")

NAVO RegTAP finds 1987
GAVO RegTAP finds 1956
EUVO RegTAP finds 1980
Unique to NAVO:
{'ivo://irsa.ipac/euclid/catalogs/mercat', 'ivo://sao.ru/dsa-cats/wsdb', 'ivo://astron.nl/hetdex/lotss-dr1-raw/cone'}
Unique to GAVO:
set()
Unique to EUVO:
{'ivo://astronet.ru/cas/wise'}
Missing from NAVO but in both of the others:
{'ivo://kasi_vo/nsvs/cs', 'ivo://cds.simbad/scs', 'ivo://cxc.harvard.edu/cscr2.1', 'ivo://wfau.roe.ac.uk/glimpse-dsa', 'ivo://au.csiro/psrda/atnf_pulsar.scs'}
Missing from GAVO but in both of the others:
{'ivo://astronet.ru/cas/ucac2', 'ivo://astronet.ru/cas/sdssdr7-field', 'ivo://astronet.ru/cas/sdssdr5-photoobjall', 'ivo://astronet.ru/cas/tycho2-suppl_2', 'ivo://astronet.ru/cas/ucac4', 'ivo://astronet.ru/cas/twomass-xsc', 'ivo://astronet.ru/cas/ucac3', 'ivo://astronet.ru/cas/gsc2_3_2', 'ivo://au.csiro/psrda/atnf_pulsar_scs', 'ivo://astronet.ru/cas/gaiadr2-gaia_source', 'ivo://astronet.ru/cas/tycho2-suppl_1', 'ivo://astronet.ru/cas/sdssdr6-phototag', 'ivo://astronet.ru/cas/

The hard part is then looking at those and understanding why.  What other information would we want to look at?

## Check 2:  UCDs 

#### Look at all UCDs in the Registry

In [7]:
from astropy.io.votable.ucd import check_ucd
query="""
  select distinct ucd, count(*) as cnt
  from rr.table_column 
  group by ucd 
  order by cnt desc
  """
result = gavo_regtap.search(query)

all_ucds = result.to_table()
invalid_ucds = []
for i,u in enumerate(all_ucds['ucd'].data):
    if not check_ucd(u):
        invalid_ucds.append((u,all_ucds['cnt'][i]))
print(f"Found {len(invalid_ucds)} invalid UCDs")
print(f"  The top 10 bad UCD values by number of instances are")
x=[print(f"{c[0]:25}: {c[1]}") for c in invalid_ucds[0:10] ]

Found 79 invalid UCDs
  The top 10 bad UCD values by number of instances are
                         : 254736
??                       : 30342
vox:image_filesize       : 138
????                     : 70
vox:image_mjdateobs      : 59
image?                   : 49
vox:bandpass_id          : 47
vox:bandpass_hilimit     : 39
vox:bandpass_lolimit     : 39
vox:bandpass_refvalue    : 39


Note that the numbers of "??" and "????" have not changed since [Markus' post in 2023](https://blog.g-vo.org/registry-a-janitor-speaks-out.html)

In [8]:
invalid_ucds_cv = []
for i,u in enumerate(all_ucds['ucd'].data):
    if not check_ucd(u,check_controlled_vocabulary=True):
        invalid_ucds_cv.append((u,all_ucds['cnt'][i]))
print(f"Found {len(invalid_ucds_cv)} that are not valid under UCD1+ controlled vocabulary")
print(f"  The top 10 bad UCD values by number of instances are")
[print(f"{c[0]:25}: {c[1]}") for c in invalid_ucds_cv[0:10] ]

Found 935 that are not valid under UCD1+ controlled vocabulary
  The top 10 bad UCD values by number of instances are
                         : 254736
??                       : 30342
error                    : 13825
code_misc                : 8538
phot_mag                 : 6291
fit_param                : 4505
obs.field                : 4334
number                   : 3090
id_number                : 2694
phot_intensity_adu       : 2512


[None, None, None, None, None, None, None, None, None, None]

#### UCDs at different publishers

Getting the publishers with the most resources in the Registry excluding Vizier.  Let's check those.  

In [9]:
publishers = gavo_regtap.search("""
    select distinct role_ivoid, count(*) as cnt , role_name
    from rr.res_role 
    where base_role = 'publisher' and role_name != 'CDS'
    group by role_ivoid, role_name
    order by cnt desc
    """).to_table()[0:10]
publishers

role_ivoid,cnt,role_name
object,int32,object
ivo://nasa.heasarc/asd,1091,NASA/GSFC HEASARC
ivo://irsa.ipac/irsa,563,NASA/IPAC Infrared Science Archive
,244,The GAVO DC team
,217,Planetary Data System
ivo://wfau.roe.ac.uk,123,"WFAU, Institute for Astronomy, University of Edinburgh"
ivo://archive.stsci.edu/stsci-arc,96,Space Telescope Science Institute Archive
ivo://svo.cab,71,SVO CAB
ivo://noirlab.edu,65,NSF NOIRLab Astro Data Lab Team
,57,Paris Astronomical Data Centre
,36,ASTRON


Adding from in person attendees:

In [10]:
from astropy.table import Table, vstack

inperson = """\
ESO, PDS, AAS, NED, China-VO, INFN, \
CfA, CXC, Rubin, INAF, PADC, \
Hyderabad, UCLA \
""".split(', ')

for institute in inperson:
    query=f""" 
    select distinct role_ivoid, count(*) as cnt , role_name
    from rr.res_role 
    where base_role = 'publisher' and role_name != 'CDS'
    and ( role_ivoid ilike '%{institute}%' or role_name ilike '%{institute}%' )
    group by role_ivoid, role_name
    order by cnt desc
    """
    r=gavo_regtap.search(query).to_table()
    if len(r) > 0:
        publishers=vstack([publishers, r[0]])
publishers

role_ivoid,cnt,role_name
object,int32,object
ivo://nasa.heasarc/asd,1091,NASA/GSFC HEASARC
ivo://irsa.ipac/irsa,563,NASA/IPAC Infrared Science Archive
,244,The GAVO DC team
,217,Planetary Data System
ivo://wfau.roe.ac.uk,123,"WFAU, Institute for Astronomy, University of Edinburgh"
ivo://archive.stsci.edu/stsci-arc,96,Space Telescope Science Institute Archive
ivo://svo.cab,71,SVO CAB
ivo://noirlab.edu,65,NSF NOIRLab Astro Data Lab Team
,57,Paris Astronomical Data Centre
,36,ASTRON


In [11]:
## Helper function to grab metadata, group it by publisher, 
##   and look for invalid values, print a summary
def validate_publishers(query, publist, badvallist, label, quiet=False):
    import pandas as pd #  Handy functions
    warnings.filterwarnings("ignore", message=".*This pattern is interpreted as a regular expression.*")
    for i,p in enumerate([pp.strip() for pp in publist['role_name'].data]):
        #  Look at all the metadata from this publisher
        if not quiet: print(f"\nlooking at publisher {p}")
        try: 
            results = gavo_regtap.search(query.replace("xxxx",p))
        except Exception as e:
            print(f"    Encountered exception {e} during query on publisher {p}")
            continue
        if len(results) != 0 and not quiet:  
            print(f"    publisher {p} publishes {len(results)} distinct values of {label}")
        elif not quiet: 
            print(f"    publisher {p} publishes no such metadata (?)")
            continue #  ?

        ##  
        df = pd.DataFrame(data={
            label:results.to_table()[label].data.data,
            "cnt":results.to_table()['cnt'].data.data
        })
        pcount = 0
        for c in badvallist:  #  invalid_ucds or invalid_ucds_cv (this is huge)
            #  c is a tuple of the string and the count
            if c[0]=='':  
                matches = df[label].astype(str).str.len() == 0
            elif '?' in c[0]:
                matches = df[label].str.contains("?",regex=False)
            else:
                matches = df[label] == c[0]
            cnt = df[matches]['cnt'].sum() # should only be one 
            if cnt == 0:  
                continue
            print(f"    value '{c[0]}' used {cnt} times")
            pcount += 1
            if pcount > 10:  break

In [12]:
query = f"""
        select ucd, count(*) as cnt from ( rr.res_role natural join rr.table_column )
        where role_name = 'xxxx'
        group by ucd 
        """
validate_publishers( query, publishers, invalid_ucds, "ucd")


looking at publisher NASA/GSFC HEASARC
    publisher NASA/GSFC HEASARC publishes 2451 distinct values of ucd
    value '' used 8596 times

looking at publisher NASA/IPAC Infrared Science Archive
    publisher NASA/IPAC Infrared Science Archive publishes no such metadata (?)

looking at publisher The GAVO DC team
    publisher The GAVO DC team publishes 830 distinct values of ucd
    value '' used 1373 times
    value 'vox:image_filesize' used 44 times
    value 'vox:image_mjdateobs' used 2 times

looking at publisher Planetary Data System
    publisher Planetary Data System publishes 41 distinct values of ucd
    value '' used 2604 times

looking at publisher WFAU, Institute for Astronomy, University of Edinburgh
    publisher WFAU, Institute for Astronomy, University of Edinburgh publishes 855 distinct values of ucd
    value '' used 301789 times
    value '??' used 60549 times
    value '????' used 60549 times
    value 'image?' used 60549 times
    value '???' used 60549 times
    

In [13]:
query = f"""
        select ucd, count(*) as cnt from ( rr.res_role natural join rr.table_column )
        where role_name = 'xxxx'
        group by ucd 
        """
validate_publishers( query, publishers, invalid_ucds_cv, "ucd")


looking at publisher NASA/GSFC HEASARC
    publisher NASA/GSFC HEASARC publishes 2451 distinct values of ucd
    value '' used 8596 times

looking at publisher NASA/IPAC Infrared Science Archive
    publisher NASA/IPAC Infrared Science Archive publishes no such metadata (?)

looking at publisher The GAVO DC team
    publisher The GAVO DC team publishes 830 distinct values of ucd
    value '' used 1373 times
    value 'vox:image_filesize' used 44 times
    value 'vox:image_mjdateobs' used 2 times
    value 'eq.pos.ra' used 8 times
    value 'eq.pos.dec' used 4 times

looking at publisher Planetary Data System
    publisher Planetary Data System publishes 41 distinct values of ucd
    value '' used 2604 times

looking at publisher WFAU, Institute for Astronomy, University of Edinburgh
    publisher WFAU, Institute for Astronomy, University of Edinburgh publishes 855 distinct values of ucd
    value '' used 301789 times
    value '??' used 60549 times
    value 'error' used 25260 times
 

In [14]:
culprits = []
for i,u in enumerate(all_ucds['ucd'].data):
    if not check_ucd(u,check_controlled_vocabulary=True):
        culprits.append((u,all_ucds['cnt'][i]))
print(f"Found {len(culprits)} that are not valid under UCD1+ controlled vocabulary")
print(f"  The top 10 bad UCD values by number of instances are")
x=[print(f"{c[0]:25}: {c[1]}") for c in culprits[0:10] ]

Found 935 that are not valid under UCD1+ controlled vocabulary
  The top 10 bad UCD values by number of instances are
                         : 254736
??                       : 30342
error                    : 13825
code_misc                : 8538
phot_mag                 : 6291
fit_param                : 4505
obs.field                : 4334
number                   : 3090
id_number                : 2694
phot_intensity_adu       : 2512


## Check 3:  authors

In [15]:
query = f"""
    select distinct role_name, count(*) as cnt 
    from rr.res_role 
    where base_role = 'creator' 
    group by role_name
    """
gavo_regtap.search(query).to_table()

  warn("Partial result set. Potential causes MAXREC, async storage space, etc.",


role_name,cnt
object,int32
"Lagrange A.-M.,Langlois M.",1
"Burstein D.,Bohlin R.C.",1
DAVILA H.,1
"Dumusque X.,Fulton B.J.",1
"Queloz D.,Rauer H.",1
Hillwig T.C.,7
Ranadive P.,1
Lalitha S.,9
Moriarty-Schieven G.,6
...,...


Have
* Last F.
* Last F., Last2 F.
* Last, F.
* F. Last, Last2. F.

At least where there are commas they are used to separate two authors, rather than "Last, F" or something.

In [16]:
names = gavo_regtap.search("select distinct role_name, count(*) as cnt from rr.res_role where base_role = 'creator' group by role_name").to_table()
names

  warn("Partial result set. Potential causes MAXREC, async storage space, etc.",


role_name,cnt
object,int32
"Lagrange A.-M.,Langlois M.",1
"Burstein D.,Bohlin R.C.",1
DAVILA H.,1
"Dumusque X.,Fulton B.J.",1
"Queloz D.,Rauer H.",1
Hillwig T.C.,7
Ranadive P.,1
Lalitha S.,9
Moriarty-Schieven G.,6
...,...


## Check 4:  subjects and the UAT

In [17]:
subjects = gavo_regtap.search("select res_subject, count(*) as cnt from rr.res_subject group by res_subject order by cnt desc").to_table()
subjects

res_subject,cnt
object,int32
visible-astronomy,7270
galaxies,4355
infrared-photometry,4333
spectroscopy,4294
photometry,3837
radial-velocity,2856
surveys,2761
redshifted,2669
variable-stars,2033
...,...


In [18]:
import urllib.request, json 
with urllib.request.urlopen("https://raw.githubusercontent.com/astrothesaurus/UAT/master/UAT.json") as url:
    uat = json.load(url)

In [19]:
#  Generator that goes through the nested JSON and looks for a key anywhere down in it
def item_generator(json_input, lookup_key):
    if isinstance(json_input, dict):
        for k, v in json_input.items():
            if k == lookup_key:
                yield v
            else:
                yield from item_generator(v, lookup_key)
    elif isinstance(json_input, list):
        for item in json_input:
            yield from item_generator(item, lookup_key)

In [20]:
uat_name_list = [x.lower() for x in item_generator(uat,'name')]
print(f"Found {len(uat_name_list)} names in the UAT")
print(uat_name_list[0:10])

Found 4335 names in the UAT
['astrophysical processes', 'astrophysical magnetism', 'cosmic magnetic fields theory', 'emerging flux tubes', 'magnetic fields', 'geomagnetic fields', 'magnetic anomalies', 'primordial magnetic fields', 'gravitation', 'relativity']


In [21]:
invalid_subjects = []
correct_subjects = []
for i,s in enumerate(subjects['res_subject'].data):
    if s.lower() in uat_name_list:
        correct_subjects.append((s,subjects['cnt'][i]))
    else:
        invalid_subjects.append((s,subjects['cnt'][i]))
print(f"Found {len(invalid_subjects)} Registry res_subject entries \
that are not in the UAT and {len(correct_subjects)} that are.")
print(f"  The top 10 bad subject values by number of instances are")
x=[print(f"{c[0]}: {c[1]}") for c in invalid_subjects[0:10] ]

Found 1013 Registry res_subject entries that are not in the UAT and 240 that are.
  The top 10 bad subject values by number of instances are
visible-astronomy: 7270
infrared-photometry: 4333
radial-velocity: 2856
variable-stars: 2033
Wide-band photometry: 1937
multiple-stars: 1869
x-ray-sources: 1723
open-star-clusters: 1677
chemical-abundances: 1611
infrared-sources: 1433


In [22]:
import re
result = [u for u in uat_name_list if re.search("^star.*",u)]
print(f"Found {len(result)} matches to 'star' such as")
print(result[0:10])

Found 17 matches to 'star' such as
['star-planet interactions', 'starburst galaxies', 'starburst galaxies', 'starburst galaxies', 'star atlases', 'star counts', 'star counts', 'star lore', 'starspots', 'starspots']


In [23]:
query = f"""
        select top 10 res_subject, count(*) as cnt from ( rr.res_role natural join rr.res_subject )
        where role_name = 'xxxx'
        group by res_subject order by cnt desc
        """
validate_publishers( query, publishers, invalid_subjects, "res_subject")


looking at publisher NASA/GSFC HEASARC
    publisher NASA/GSFC HEASARC publishes 10 distinct values of res_subject
    value 'Survey Source' used 654 times
    value 'Observation' used 84 times
    value 'Star' used 70 times
    value 'Galaxy' used 25 times
    value 'GRB' used 31 times
    value 'AGN' used 22 times
    value 'Cluster of Galaxies' used 15 times
    value 'Optical Counterpart' used 11 times
    value 'XRB' used 11 times

looking at publisher NASA/IPAC Infrared Science Archive
    publisher NASA/IPAC Infrared Science Archive publishes 10 distinct values of res_subject
    value '' used 137 times
    value 'extragalactic survey' used 101 times
    value 'all sky survey' used 58 times
    value 'high redshift galaxies' used 17 times

looking at publisher The GAVO DC team
    publisher The GAVO DC team publishes 10 distinct values of res_subject
    value 'proper-motions' used 31 times
    value 'milky-way-galaxy' used 13 times
    value 'virtual-observatories' used 75 tim

## Check 5:  concepts

In [24]:
reg_uat_concept_list = gavo_regtap.search("select distinct uat_concept from rr.subject_uat").to_table()["uat_concept"].data
print(f"There are {len(reg_uat_concept_list)} distinct uat_concept values in the registry's subject_uat table")

There are 471 distinct uat_concept values in the registry's subject_uat table


In [25]:
bad=[]
for c in reg_uat_concept_list:
    # lower case and replace - with space
    if c.lower().replace("-"," ") not in uat_name_list:
        bad.append(c)
print(f"There are {len(bad)} concepts not found in the UAT such as:")
print(bad[0:10])

There are 47 concepts not found in the UAT such as:
['active-galactic-nuclei ', 'astrl', 'astronomical-simulations ', 'early-type-galaxies', 'early-type-stars', 'earth-planet', 'earth-planet-', 'exoplanet-atmospheric-composition', 'gamma-ray-astronomy', 'gamma-ray-bursts']


In [26]:
query = f"""
        select top 10 uat_concept, count(*) as cnt from ( rr.res_role natural join rr.subject_uat )
        where role_name = 'xxxx'
        group by uat_concept order by cnt desc
        """
validate_publishers( query, publishers, bad, "uat_concept")


looking at publisher NASA/GSFC HEASARC
    publisher NASA/GSFC HEASARC publishes 10 distinct values of uat_concept

looking at publisher NASA/IPAC Infrared Science Archive
    publisher NASA/IPAC Infrared Science Archive publishes 10 distinct values of uat_concept

looking at publisher The GAVO DC team
    publisher The GAVO DC team publishes 10 distinct values of uat_concept

looking at publisher Planetary Data System
    publisher Planetary Data System publishes 6 distinct values of uat_concept

looking at publisher WFAU, Institute for Astronomy, University of Edinburgh
    publisher WFAU, Institute for Astronomy, University of Edinburgh publishes 7 distinct values of uat_concept

looking at publisher Space Telescope Science Institute Archive
    publisher Space Telescope Science Institute Archive publishes 10 distinct values of uat_concept

looking at publisher SVO CAB
    publisher SVO CAB publishes 8 distinct values of uat_concept

looking at publisher NSF NOIRLab Astro Data Lab 

## Check 6: Spatial coverage

Spatial coverage enables registry-wide spatial searches.  But HEASARC for example specifies full sky coverage for all of its services even when the data are not full sky but a sample distributed across the full sky.   

In [27]:
query = f"""
        select distinct role_name, count(*) as cnt 
        from ( rr.res_role natural join rr.stc_spatial )
        where base_role = 'publisher' and ( coverage = '' or coverage = '0/0-11' )
        group by role_name 
        order by cnt desc
        """
gavo_regtap.search(query).to_table()

role_name,cnt
object,int32
NASA/GSFC HEASARC,1010
CDS,84
The GAVO DC team,39
\nChandra X-ray Observatory\n,8
ASTRON,7
BSDC,7
NASA/IPAC Infrared Science Archive,7
Canadian Astronomy Data Centre,4
ChiVO,2
Paris Astronomical Data Centre - IMCCE,2


## Check 7: Relationships

This is an outstanding discussion I believe, so this is not necessarily wrong, depending on who you ask.  ;) 

In [28]:
gavo_regtap.search("""
    select distinct role_name, count(*) as cnt
    from ( rr.relationship natural join rr.res_role )
    where relationship_type = 'related-to' and base_role = 'publisher'
    group by role_name
    order by cnt desc
""").to_table()

role_name,cnt
object,int32
CDS,162080
"WFAU, Institute for Astronomy, University of Edinburgh",51
\n International Virtual Observatory Alliance\n,12
CSIRO,7
IDOC D2S,7
IDOC GINCO,6
Paris Astronomical Data Centre - GEPI,6
Mullard Space Science Laboratory,4
Paris Astronomical Data Centre - LESIA,4
...,...


## To be expanded.  Now what to do with this?  

* Report cross-checks between registries to their admins.  
* Compile a report of issues as above and advertise at IVOA Interop's Registry (or Ops?) session.  
* Compile a report of issues found for each publisher and email them yearly to request updates.  


## Scratch 