# Emulating a DRP `Object` Catalog with a Simple Analytic Model

_Ji Won Park, Phil Marshall_

Created: July 19, 2019 at the LSST DESC hack day

Last run: 2019-07-19

The goals for this demo notebook are to:

* Show what the `Analytic` model class does, and 
* Check that its outputs are sensible. 

We'll do this by making an emulated `Object` catalog using a very simple analytic emulator, and making the same plots that we use to evaluate BNN emulator performance. The idea is that the analytic model can serve as the baseline for any ML-based emulator.

### Requirements

For this notebook to run to completion, you will need a copy of the test object dataset, and to have installed the dependencies.

In [2]:
! pip install -r requirements.txt

Collecting astropy==3.0.3 (from -r requirements.txt (line 3))
[?25l  Downloading https://files.pythonhosted.org/packages/02/43/fb11c837f4ed422d867997db8207c5dd4a3ccccd83b9601d080086832335/astropy-3.0.3-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (7.5MB)
[K    100% |████████████████████████████████| 7.5MB 2.3MB/s eta 0:00:01
[?25hCollecting numpy>=1.16 (from -r requirements.txt (line 4))
[?25l  Downloading https://files.pythonhosted.org/packages/0f/c9/3526a357b6c35e5529158fbcfac1bb3adc8827e8809a6d254019d326d1cc/numpy-1.16.4-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (13.9MB)
[K    100% |████████████████████████████████| 13.9MB 1.8MB/s eta 0:00:01
[?25hCollecting torch>=0.4 (from -r requirements.txt (line 5))
[?25l  Downloading https://files.pythonhosted.org/packages/98/14/8fb914c6f13e9d889f4e14f6d811901fc48fde6be7f052756d08e861f960/torch-1.1.0.post2-c

In [4]:
# ! curl -o obj_master_tract4850.csv "https://drive.google.com/file/d/1bEnSJ6YnkWyhXNaQdyjRWE8x3SS6XtVV/view?usp=sharing"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  120k    0  120k    0     0   232k      0 --:--:-- --:--:-- --:--:--  231k


### Setting-up

We have some standard imports to do, and then the things we need to do in order to use objects from the `torch` library.

In [5]:
import torch
import numpy as np
import json
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2
%matplotlib inline

In [6]:
np.random.seed(2809)
torch.manual_seed(2809)
torch.cuda.manual_seed(2809)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if device=='cuda':
    torch.set_default_tensor_type('torch.cuda.FloatTensor')
else:
    torch.set_default_tensor_type('torch.FloatTensor')
print("device: ", device)

device:  cpu


## Emulating an `Object` Catalog

The `Analytic` model has the same behavior as the `BNN` models - so we first follow the same steps to get the data in shape.

In [7]:
args = json.load(open("args.txt"))

############
# Data I/O #
############

from derp_data import DerpData
import itertools

# X base columns
truth_cols = list('ugrizy') + ['ra_truth', 'dec_truth', 'redshift', 'star',]
truth_cols += ['mag_true_%s_lsst' %bp for bp in 'ugrizy']
truth_cols += ['size_bulge_true', 'size_minor_bulge_true', 'ellipticity_1_bulge_true', 'ellipticity_2_bulge_true', 'bulge_to_total_ratio_i']
truth_cols += ['size_disk_true', 'size_minor_disk_true', 'ellipticity_1_disk_true', 'ellipticity_2_disk_true',]
opsim_cols = ['m5_flux', 'PSF_sigma2', 'filtSkyBrightness_flux', 'airmass', 'n_obs']
# Y base columns
drp_cols = ['extendedness', 'ra_obs', 'dec_obs', 'Ixx', 'Ixy', 'Iyy', 'IxxPSF', 'IxyPSF', 'IyyPSF', ]
drp_cols_prefix = ['cModelFlux_', 'psFlux_']
drp_cols_suffix = []
#drp_cols_suffix = ['_ext_photometryKron_KronFlux_instFlux', '_base_CircularApertureFlux_70_0_instFlux', 
drp_cols += [t[0] + t[1] for t in list(itertools.product(drp_cols_prefix, list('ugrizy')))]
drp_cols += [t[1] + t[0] for t in list(itertools.product(drp_cols_suffix, list('ugrizy')))]


# Define dataset
data = DerpData(data_path='raw_data/obj_master_tract4850.csv',
    data_path2=None,
    X_base_cols=truth_cols + opsim_cols, 
    Y_base_cols=drp_cols, 
    args=args, ignore_null_rows=True, save_to_disk=True)
if not args['data_already_processed']:
    data.export_metadata_for_eval(device_type=device.type)
# Read metadata if reading processed data from disk:
data_meta = json.load(open("data_meta.txt"))

X_cols = data_meta['X_cols']
Y_cols = data_meta['Y_cols']
train_indices = data_meta['train_indices']
val_indices = data_meta['val_indices']
X_dim = data_meta['X_dim']
Y_dim = data_meta['Y_dim']

from torch.utils.data.sampler import SubsetRandomSampler
from torch.utils.data import DataLoader

# Split train vs. val
train_sampler = SubsetRandomSampler(train_indices)
val_sampler = SubsetRandomSampler(val_indices)

# Define dataloader
kwargs = {'num_workers': 1, 'pin_memory': True} if device=='cuda' else {}
train_loader = DataLoader(data, batch_size=args['batch_size'], sampler=train_sampler, **kwargs)
val_loader = DataLoader(data, batch_size=args['batch_size'], sampler=val_sampler, **kwargs)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Deleting null rows: 70355 --> 67840
Standardized X except:  ['star', 'mag_true_u_lsst', 'mag_true_g_lsst', 'mag_true_r_lsst', 'mag_true_i_lsst', 'mag_true_z_lsst', 'mag_true_y_lsst', 'u_flux', 'g_flux', 'r_flux', 'i_flux', 'z_flux', 'y_flux']
Standardized Y except:  []
X has null columns:  []
Y has null columns:  []
Overall star frac: 0.21
Training star frac: 0.21
Validation star frac: 0.21
Saving processed data to disk...


FileNotFoundError: [Errno 2] No such file or directory: 'data/X.npy'

Let's take a look at the output columns of the catalog we are aiming to emulate:

In [8]:
data_meta['Y_cols']

NameError: name 'data_meta' is not defined

Now we instantiate the simple `Analytic` model, and have it predict the output catalog. Note that there is no training: the analytic model has a hard-coded astronomy model for the DRP object properties and their errors. So, we just compute the predicted mean properties and the log variances on them, and pass them both to a sampling function to make the emulated table.

In [12]:
import models
import solver
"""
Plan:

trainval_data = DerpData()
val_data = 

analytic = models.Analytic()
mean, logvar = analytic(X_val)
sample = solver.sample(mean, logvar)
"""

ModuleNotFoundError: No module named 'tensorflow.compiler'

The `Analytic` model is very simple, it just adds Gaussian noise to the true parameters according to simple formulae for photometric, astrometric, etc errors. To 

In [11]:
help(models.Analytic)

Help on class Analytic in module models:

class Analytic(torch.nn.modules.module.Module)
 |  Painfully simple analytic astronomy model.  
 |  
 |  For positions:
 |      output ra, dec = input ra, dec + eps, eps ~ N(0, astrom^2)
 |  For both cModel flux and psFlux:
 |       output flux = input flux + eps, eps ~ N(0, sigma^2) where sigma = photometric noise from Javi's map
 |  For extendedness:
 |       output extendedness = not star
 |       Better would be to check for average psf > size
 |  For shapes:
 |      Ixx, Ixy, Iyy (SLRealized, copy and comment FIXME). 
 |      Propagate flux and position errors into sigma
 |  For PSF moments:
 |      Ixy = 0, Ixx, Iyy from PSF sigma. Zero uncertainty... 
 |  
 |  The forward method produces a model Gaussian sampling distribution characterized by 
 |  vectors of means and sigmas, that can be passed to the appropriate sampling function.
 |  
 |  Method resolution order:
 |      Analytic
 |      torch.nn.modules.module.Module
 |      builtins.

## Visualizing the Emulated Catalog

Let's compare the observed quantities, from the DRP `Object` table, and our simple emulations of them. Both these quantities are noisy - what we would like is for them to have similar noise properties. Simple scatter plots, of `x_emulated` vs `x_observed` may not be very illuminating; plotting the mean and stdev of `x`, in bins of `x` for both `x_emulated` and `x_observed`, should give more insight.