# Human-to-machine-readable LSST Data Products Schema Converter
<small>v0.3</small>

Extract schemas from human-readable sources stored in Google Docs 
(or similar), and converts it to LSST-specific [Felis/YAML format](https://felis.lsst.io/).

Rerun this notebook when you need to generate Felis-format updates for the baseline schema.

## Data Model of the Spreadsheets

- Schemas are stored in a spreadsheet structure convertable
  to .csv format.

- The first column (the 'marker') is used to mark which rows
  have semantic meaning within this spec. The `marker` MUST
  be equal to 'TABLE', 'HEADER', 'COLUMN', or be empty. Any
  rows where the `marker` is empty will be ignored.
  
- A row with `marker=TABLE` signifies the start of a table
  definition. The second column in that row specifies the
  table name. The third column should contain a description
  of the table. Other columns are ignored; this allows 
  additional human-readable data to be added to this row.
  
  Example:
  ```
     "TABLE","SSObject","LSST-computed per-object quantities. 1:1 relationship with MPCORB. Recomputed daily, upon MPCORB ingestion.",""
  ```

- The first row with a non-empty `marker` column following the
  `marker=TABLE` row MUST have `marker=HEADERS` (the "headers"
  row). The headers row names
  the DDL-related data to follow. Spreadsheet columns with
  header names not listed in `valid_names` are ignored; this
  allows additional human-readable columns to be maintained 
  with the data (an example may be the pipeline that originates
  a certain column, or various comments).
  
  Example:
  ```
      "HEADERS","Column name","Type","Not NULL","unit","UCD","Description","Origin","","","",""
  ```

- The `marker=HEADERS` row MUST be followed by one or more
  `marker=COLUMN` rows. Each of these rows defines the properties
  of a column in the current table, with the properties corresponding
  to those defined by the headers row.
  
  Example:
  ```
      "COLUMN","MOIDTrueAnomaly","FLOAT","","deg","","True anomaly of the MOID point","SSOCP","","","",""
  ```

- Multiple tables can be given in a single spreadsheet by repeating
  the table structure above.

- The format can easily be extended to capture relationships, 
  indices and constraints by adding further `marker` keywords.

## Daily Usage

In daily usage, the source-of-truth table schemas are be maintained in 
a spreadsheet tool (e.g., Google Docs). The first -- `marker` -- column
is kept hidden so as not to confuse the user and/or clutter the workspace.

- A human maintainer can easily maintain insight into the schema.
- It can be edited with equal ease (subject to usual change control requirements).
- It can be shared with other stakeholders w/o a need for special tools or
  understanding of formats.
- Derived formats (e.g., SQL DML, [the proposed YAML serialization](https://gist.github.com/brianv0/b9f3d8c0e4bc61899293816f2eb16ff1), etc.) can be extracted by running tools
  reading the data model defined above.

A working example can be seen at:

   https://docs.google.com/spreadsheets/d/1E0rTlvuJC0CvpLNsuWLK0x70uhpZww4v6GB5QkiQr-Q

## Core Reader Code

In [1]:
import pandas as pd

valid_headers = ['Column name', 'Type', 'Not NULL', 'unit', 'UCD', 'Description', 'Unique']

def extract_tables(df):
    """
    Given a DataFrame with holding data that follows the data model described above,
    extract the individual table schemas and yield them to the caller.
    """
    df = df.copy()
    df.columns = ["marker"] + list(range(1, len(df.columns)))
    df = df[df["marker"].notnull()]

    table_list = df[df.marker == "TABLE"]
    for iloc, (start, table) in enumerate(table_list.iterrows()):
        # Extract the [start, end] index range for the table
        table_name = table.iloc[1]
        description = table.iloc[2]
        try:
            end = table_list.index[iloc+1]-1
        except IndexError:
            end = df.index[-1]

        # Extract table headers
        headers = df.loc[start+1]
        assert(headers['marker'] == "HEADERS")
        headers = headers[headers.notnull() & headers.isin(valid_headers)]

        # Extract table data
        table=df[headers.index].loc[start+2:end]
        table.columns = headers.values
        table = table.fillna('').reset_index(drop=True)
        
        # attach description metadata
        table.description = description

        yield (table_name, table)

def read_google_csv(url):
    # Input example: https://docs.google.com/spreadsheets/d/1E0rTlvuJC0CvpLNsuWLK0x70uhpZww4v6GB5QkiQr-Q
    # For URL formatting spec, see https://stackoverflow.com/questions/33713084/download-link-for-google-spreadsheets-csv-export-with-multiple-sheets
    url += "/gviz/tq?tqx=out:csv"
    return pd.read_csv(url, header=None)

## Load the schema from Google Spreadsheets (Solar System Data Product tables)

In [2]:
import os.path
try:
    df = pd.read_csv('_cache.csv', header=None)
except FileNotFoundError:
    df = read_google_csv("https://docs.google.com/spreadsheets/d/1E0rTlvuJC0CvpLNsuWLK0x70uhpZww4v6GB5QkiQr-Q")
    df.to_csv('_cache.csv', index=False, header=False) # cache it

tables = dict( extract_tables(df) )
tables.keys()

dict_keys(['MPCORB', 'MPCORBDESIGMAP', 'SSObject', 'SSSource'])

In [3]:
tables['MPCORB'].description

'The orbit catalog produced by the Minor Planet Center. Ingested daily. O(10M) rows by survey end. The columns are described at https://minorplanetcenter.net//iau/info/MPOrbitFormat.html'

## 1. Pretty-print a table

In [4]:
def show_table(tables, name):
    from IPython.display import display, Markdown

    display(Markdown('### ' + name))
    display(tables[name])

show_table(tables, 'SSSource')

### SSSource

Unnamed: 0,Column name,Type,Not NULL,unit,UCD,Description,Unique
0,ssObjectId,BIGINT,y,,meta.id;src,Unique identifier of the object.,
1,diaSourceId,BIGINT,y,,meta.id;src,Unique identifier of the observation,
2,mpcUniqueId,BIGINT,,,,MPC unique identifier of the observation,
3,nearbyObj,BIGINT[6],,,,Closest Objects (3 stars and 3 galaxies) in Le...,
4,nearbyObjDist,FLOAT[6],,,,Distances to nearbyObj,
5,nearbyObjLnP,FLOAT[6],,,,Natural log of the probability that the observ...,
6,eclipticLambda,DOUBLE,y,deg,,Ecliptic longitude,
7,eclipticBeta,DOUBLE,y,deg,,Ecliptic latitude,
8,galacticL,DOUBLE,y,deg,,Galactic longitude,
9,galacticB,DOUBLE,y,deg,,Galactic latitute,


## 2. Dump the schema in Felis/YAML format

In [5]:
import yaml

baseline = "https://raw.githubusercontent.com/lsst/cat/master/yml/baselineSchema.yaml"
import urllib.request, os.path

try:
    byml = yaml.safe_load(open('_baseline.yml'))
except FileNotFoundError:
    urllib.request.urlretrieve(baseline, '_baseline.yml')
    byml = yaml.safe_load(open('_baseline.yml'))

In [6]:
import yaml, re

_re_type_to_datatype = re.compile(r'^([A-Za-z]+)(\((\d+)\))*(\[(\d+)\])*$')
_dct_type_to_datatype = {
#   <typename>  <datatype>    <is_array?>
    'INTEGER':  ('int',        False),
    'BIGINT':   ('long',       False),
    'FLOAT':    ('float',      False),
    'DOUBLE':   ('double',     False),
    'VARCHAR':  ('char',       True),
    'DATETIME': ('timestamp',  False),
    'BLOB':     ('binary',     True),
}

def type_to_datatype(type_):
    # Helper to convert from textual type representation,
    # to the corresponding machine datatype. Includes
    # an extension where a type can be suffixed by `[N]` to
    # indicate this column should be repeated N times (an
    # array).
    m = _re_type_to_datatype.match(type_)
    if m is None:
        raise Exception("Could not parse type spec '%s'" % type_)

    ty, length, repcount = m.group(1), m.group(3), m.group(5)
    ty = ty.upper()
    if length is not None:
        length = int(length)
    if repcount is not None:
        repcount = int(repcount)

    datatype, expectLength = _dct_type_to_datatype[ty]
    if length is not None:
        ty = f"{ty}({length})"
    if expectLength and length is None:
        length = 1

    return datatype, ty, length, repcount

# Keep the ordering of dumped key/value pairs in YAML.
#
# Inspired by https://stackoverflow.com/a/21912744, but assuming
# we're running Python 3.6+ where dict preserves key ordering.
def ordered_dump(data, stream=None, Dumper=yaml.Dumper, **kwds):
    class OrderedDumper(Dumper):
        pass
    def _dict_representer(dumper, data):
        return dumper.represent_mapping(
            yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
            data.items())
    OrderedDumper.add_representer(dict, _dict_representer)
    return yaml.dump(data, stream, OrderedDumper, **kwds)

def dump_felis_yml(tables, *table_names):
    yml = []
    for table_name in table_names:
        table = tables[table_name]

        # Table information
        table_yml = {
            'name':          table_name,
            '@id':           f"#{table_name}",
            'description':   table.description,
            'primaryKey':    f"#{table_name}.{table.loc[0]['Column name']}",
            'mysql:engine':  'MyISAM',
            'mysql:charset': 'utf8mb4',
            'columns':       [],
        }

        # Column information
        for _, row in table.iterrows():
            nameBase, type_ = row['Column name'], row['Type']
            datatype, type_, length, repcount = type_to_datatype(type_)

            # if repcount is not None, replicate columns repcount times,
            # with suffixes [1..repcount]
            suffixes = map(str, range(1, repcount+1)) if repcount is not None else ['']
            for suffix in suffixes:
                name = nameBase + suffix

                column = {
                    'name':           name,
                    '@id':            f"#{table_name}.{name}",
                    'datatype':       datatype,
                    'length':         length,
                    'description':    row['Description'],
                    'mysql:datatype': type_,
                }

                if column['length'] is None:
                    del column['length']

                if row['UCD']:
                    column['ivoa:ucd'] = row['UCD']

                if row['unit']:
                    column['fits:tunit'] = row['unit']
                    
                table_yml['columns'].append(column)

        yml.append(table_yml)

    return ordered_dump(yml, default_flow_style=False)

yml = dump_felis_yml(tables, 'MPCORB', 'MPCORBDESIGMAP', 'SSObject', 'SSSource')

In [7]:
with open('solar-system-schema.felis.yml', 'w') as fp:
    fp.write(yml)

## 3. Dump the schema in LaTeX table format

In [62]:
def escape_superscript(s):
    i = s.find('^')
    if i == -1: return s
    return r"%s$^{\rm %s}$%s" % (s[:i], s[i:i+2], s[i+2:])

def dump_latex(out, name, table):
    out.write("% generated with https://github.com/mjuric/lsst-schema-converter; DO NOT EDIT BY HAND !!\n")
    out.write(r"\begin{schema}{{\tt %s} Table}{{\tt %s} Table}{tbl:%s}" % (name, name, name) + "\n")

    for _, row in table.iterrows():
        name, type_, desc = row['Column name'], row['Type'], row['Description']
        unit = row['unit'] if row['unit'] else "~"

        name = name.replace('_', r'\_')
        unit = escape_superscript(unit)
        desc = escape_superscript(desc)
        
        out.write(fr"{name} & {type_} & {unit} & {desc} \\" + "\n")

    out.write(r"\end{schema}" + "\n")
    out.write("% generated with https://github.com/mjuric/lsst-schema-converter; DO NOT EDIT BY HAND !!\n")

In [63]:
# Write it out to a series of LaTeX files
for name in ['MPCORB', 'SSObject', 'SSSource']:
    with open(f"{name}.tex", "w") as out:
        dump_latex(out, name, tables[name])