# Generator API for surveySimPP Integrators (DRAFT)

Juric, Schwamb, Bernardinelli et al.

## Gentle intorduction to generators

What are Python generators? They are objects that generate sequences, and can be used in `for` loops.

If you've ever run code such as:
```python
for i in range(10):
    print(i)
```
you've used a generator -- the `range(10)` function returns one. These objects can be repeatedly "asked" for the next item in the sequence, until they're exhausted and the loop exits.

As generators return one item at a time, they are incredibly useful as they allow streaming through large sequences, that may not necessarily fit into memory all at once. An example of this is reading a file line-by-line; this is also a generator, that returns the next line each time the `for` loop asks it to:

```python
with open("big_file.txt") as fp:
    for line in fp:
        ... do something with line ...
```

These examples are all built-in generators, but Python allows us to write custom generators as well. The simplest way to do it is by writing a function that uses the `yield` keyword to return results. For example:

```python
def even_numbers(end):
    i = 0
    while i < end:
        if i % 2 == 0:
            yield i
```

This generator would be called by the user as follows:
```python
for i in even_numbers(100_000_000_000):
    print(i)
```
to print out all even numbers amongst the first hundred trillion integers. Note that at no point in time there's more than /one/ number kept in memory -- though to a user of the for loop it seems like they're iterating through a list.

Here's another generator that returns prime numbers:

In [25]:
def is_prime(n):
      if n == 2 or n == 3: return True
      if n < 2 or n%2 == 0: return False
      if n < 9: return True
      if n%3 == 0: return False
      r = int(n**0.5)
      # since all primes > 3 are of the form 6n ± 1
      # start with f=5 (which is prime)
      # and test f, f+2 for being prime
      # then loop by 6. 
      f = 5
      while f <= r:
            if n % f == 0: return False
            if n % (f+2) == 0: return False
            f += 6
      return True    

# here's the generator
def primes(start, end = None):
    if end is None:
        start, end = 0, start

    for k in range(start, end):
        if is_prime(k):
            yield k

In [26]:
for k in primes(30):
    print(f"The next prime is {k}")

The next prime is 2
The next prime is 3
The next prime is 5
The next prime is 7
The next prime is 11
The next prime is 13
The next prime is 17
The next prime is 19
The next prime is 23
The next prime is 29


Take a look [here](https://realpython.com/introduction-to-python-generators/) to learn more about generators.

## Why generators for surveySimPP's ephemerides generation module?

The input to `surveySimPP` is a catalog of ephemerides of objects that may be in a set of survey visits. The module to compute this is the *ephemerides generator* (for brevity, we'll call it ephemgen here).

For example, a typical LSST simulation uses a catalog of $\sim 10$ million objects. The LSST is expected to collect about $\sim 3$ million visits over the 10 years of the survey, with each visit being approximated by a circular FoV of $r \sim 1.8$. 

**The `ephemgen` module's task is to take this catalog of $\sim 10$M objects, compute for each visit which of those objects are in the field of view, and pass this on to surveySimPP for further, more precise, processing.**

`surveySimPP` considers the outputs of `ephemgen` on a visit-by-visit basis. One can think of this relationship as `ephgen` producing lists of ephemerides of objects present in each visit, passing it on to `surveySimPP` for further processing, then moving on to computing the ephemerides in the next field, etc.

It's a *pipeline* (a *stream*) of data, and generators are Python's way to build streaming pipelines. You could easily imagine `surveySimPP` doing something like:

```python
visits = load_lsst_visit_radecs("pointings.txt")
ephgen = EphemeridesGenerator(visits, generator_config)
for (visit_id, sources) in ephgen:
    # sources is a Dataframe (or a numpy structured array) of ephemerides of objects
    # that may be in visit visit_id
    sources = rejectOutsideFootprint(sources)
    sources = rejectTooFaint(sources)
    sources = addTrailingParameters(sources)
    ... do more processing ...

    save_generated_sources(sources)
```

(actually, one may argue that `surveySimPP` itself should also be a generator -- rather than saving the sources itself, it should yield them back to an outer loop. We leave that for a future proposal.)

The pseudocode above is readable and should be easy to understand. The details of the ephemerides generator are all very well hidden from the rest of the code, allowing both for easily swapping different `ephemgen` implementations, but also allowing for a lot of freedom on how different `ephemgen`s could be implemented.

This is why we argue that a generator-based API is a good next step in the design of surveySimPP APIs.

## Proposed API

Any compliant `ephemgen` implementation shall satisfy the following requirements:

### Instantiation

```
ephemgen = Ephemgen(visits, orbits, **config)
```

where `visits` is a `ndarray`-like object holding a list of survey visits, guaranteed to have at least the following columns:
* `visitId`: unique integer identifying the visit
* `visitTimeTAI`: MJD of the visit midpoint time, in TAI
* `ra`: J2000 R.A. of the visit center
* `dec`: J2000 Dec of the visit center

This table may also hold the following additional columns:
* `obscode`: Obscode of the observing site. If not present, assumed to be 500 (geocenter)
* `fov`: Field of view radius (circumscribed), in radians, of the visit. If not present, assumed to be 2 degrees.
* `filter`: Bandpass name through which this visit was taken

If these optional columns are not specified, they **must** be passable as keywords to the `Ephemgen` constructor and will be considered to apply to every visit (a common case of a single survey with a single camera).

The `orbits` table is a numpy structured array-like object of state vectors in heliocentric, cometary, coordinates. It must have at least the following columns:
* `objectId`: a string of 20 characters of less, uniquely identifying this object
* `format`: type of the coordinate record; at the moment, only `COM` is allowed.
* `q`: perihelion distance, AU
* `e`: eccentricity
* `i`: inclination (degrees)
* `Omega`: longitude of ascending node (degrees)
* `argperi`: argument of the perihelion (degrees)
* `t_p`: time of perihelion (MJD, TDB)
* `t_0`: epoch (MJD, TDB)

It may have the following columns:
* `H`: absolute magnitude in an unspecified band
* `H<filter>`: absolute magnitude in the \<filter\> band (e.g. `Hr` or `Hg`, etc.)
* `G`: slope parameter

See further down for how these data are used.

The `config` dictionary may contain additional initialization data for the specific implementation of `ephemgen`. It can be any kind of dictionary, but must not use keys that overlap with any of the column names that may appear in the `visits` argument.

### Behavior

The returned `ephemgen` object shall implement the [generator interface](https://wiki.python.org/moin/Generators).

It shall be callable as follows:
```
for (visit_metadata, sources) in ephemgen:
    ... do postprocessing ...
```
Each iteration must return the visit metadata and a table of sources for the visit.

`visit_metadata` shall be a NamedTuple-like object whose member variables correspond to the columns in the `visits` table that was passed to Ephemgen constructor. I.e., it will have `visit_metadata.visitId`, `visit_metadata.ra`, `visit_metadata.dec`, etc.

`sources` shall be an ndarray-like table holding the ephemerides of objects computed to be in the FoV of the particular visit. It shall contain the following columns:
* `visitId`: unique integer identifying the visit
* `visitTimeTAI`: MJD of the visit midpoint time, in TAI
* `objectId`: unique string identifying the object
* `ra`: J2000 R.A. coordinate of the object (degrees)
* `dec`: J2000 Dec coordinate of the object (degrees)
* `dist`: Topocentric distance (km)
* `mag`: Apparent magnitude (see below for details)
* `pmRa`: Proper motion, R.A. (degrees)
* `pmDec`: Proper motion, Dec (degrees)
* `rVel`: Radial velocity (km/s)
* ... what else do we want to require here...?

The returned table may have additional quantities, but they must be prefixed by `<ephemgen>.<fieldname>` where `<ephemgen>` is the name of this ephemeris generator, and `<fieldname>` is the name of the additional column the generator wishes to provide. Example: `assist.helio.dist` for heliocentric distance computed by the ASSIST ephemeris generator. These entries will be passed down, unchanged, through the postprocessing.

The magnitude is interpreted as follow:

* If `filter` was specified for a given visit, the magnitude will be computed using the `H<filter>` entry for the given object. That entry /must/ exist, or the code must return an NaN in this field.

* If the `filter` was not specified for the given visit, the magnitude will be computed using the `H` entry. That entry /must/ exist, or the code must return an NaN in this field.

## Questions

* Should we use UT instead of TAI for the observation time? Pros: consistent with current practice: Cons: inconsistent with non-asteroid survey practice (incl. Rubin).

## Drawbacks

* This does not lend itself naturally to distributed computation (at the Python level), but it's not obvious what may be the solution. On a practical level, just splitting the input catalog into chunks, running them separately, and then merging the results will get the job done.

## Alternatives

* Use Ray (https://ray.io)? Pros: built-in parallelization, no need for external driver/split/merge code. Cons: heavy dependency, makes the API more complicated.