# Things to know

## When do we run the data production?

The data processing happens under two circumstances:

* When a full re-processing of the data is needed. In that case, all the following plans have to be run in order on all nights/targets available at the moment. It is usually done when a new code version is available, due to important changes is several pieces of the codes (SNF-0202 -> SNF-0203). When it happens, the first step is to produce these lists of nights/targets using the two scripts listed below. All the plan will then be run using `wrap_batch_jobs`, as explain below and in the [cookbook](SNfactoryDataProcessing.html) section.
    * `list_night` to create the list of nights on which the plans will be run;
    * `list_followed_targets`, to create the list of targets on which the plans will be run.
* When there are new observations, usually for a few nights. The plans will thus be run in their **incremental** mode. This is only true for the ones running on the night level. We usually run the incremental production from `plan_file_quality` to `plan_gs_psf`.
* When a new "flux production" is needed to be run. This is actually what happens most of the time. Since the first few steps of the pipeline (up to `plan_extract_star`) are now quite stable, the re-processing of the data is usually done based on the incremental production of the few first steps. It is thus usually done as followed:
    * data transfer -> `plan_gs_psf`: incremental mode of the production (nights oriented plans). Usually done on a few new nights during which SNIFS has been used. This includes the following steps:
        * Data transfer and summit cleaning;
        * DB header update, data filling, and DB update (`snf_header`, `snf_db_make`, `snf_db_fill`, `SyncTarget`, `FlagRunKind`);
        * `plan_file_quality`;
        * `plan_cube_generation`;
        * `plan_extract_star`;
        * `plan_multi_standard`;
        * `tabPhotometricity`
        * `plan_photometric_ratios`;
        * `plan_gs_psf`;

    * `plan_flux_solution` -> `plan_analyse_timeseries`: full reprocessing (target oriented plans). Usually done on the full sample (~1300 targets). This includes the following steps:
        * `plan_flux_solution`;
        * `plan_flux_calibration`;
        * `plan_cubefit`;
        * `plan_extract_star`
        * `plan_analyze_timeseries`.
        
**Note**: some of the plans can be both night and target oriented, e.g., `plan_extract_star`.


## Post-mortem jobs

The `Osiris` job runs continously in the demon queue, and takes care of post-mortem analysis of our jobs: it fetches the exit status, stdout and stderr output logs of the finished jobs, stores them and changes the job status in the DB. This only concerns jobs that are launched via `batch_jobs` and `snf_qsub`, and Osiris runs every 10min updating any jobs that were finished in the meantime. Demon jobs can run indefinitely but are limited to 24h CPU time, so to be on the safe side Osiris will stop automatically after 7 days, when a backup job (`Anubis`) takes over, changes its name to Osiris, spawns a new child job, and so on...

### Launching Osiris by hand

In case `Osiris` does not seem to be properly working (the Job.Status are not being flagged as **ENDED**) or if neither `Osiris` nor `Anubis` are in the queue, we need to relaunch them:

        qdel Osiris Anubis` (if they are running and seem bugged)
        qsub $SNF_TASKS/Processing/scripts/osiris.sh

The logs (if any) of the post-mortem jobs are stored in `\$HOME/db/SGE/osiris`.

`Osiris` will only update jobs which are launched *after* it starts. In other words, if you've been launching hundreds of jobs for two days and =Osiris= was stuck, this procedure will not fix the jobs. For that you either need to update by hand each job (see next section) or you can force the timestamp taken as reference for `Osiris`, like this:

        date
        Wed Sep 26 09:40:48 CEST 2012
        date --date "24 september 2012"
        Mon Sep 24 00:00:00 CEST 2012
        date --date "24 september 2012" +%s
        1348437600
        qsub $SNF_TASKS/Processing/scripts/osiris.sh 1348437600

This will force `Osiris` to search for jobs to update which were stored in the last 2 days (we're the 26th, the timestamp we gave it is for the 24th).

### Update job status by hand

The `Osiris` job updates the status of the finished jobs every 10 minutes. Sometimes it may miss a job, which will have a Job.Status in the DB as **Ending**. You can update the status of the job by hand, if you have the original .out and .err files for the job:

        cd $JOBDIR/&lt;job output dir&gt;/
        update_sge_job &lt;filename&gt;.out &lt;filename&gt;.err

## Launching jobs

We use two different script to launch jobs: `batch_jobs` and `wrap_batch_jobs`


### `batch_jobs`
`batch_jobs` is a wrapper to create a bunch of job scripts via one of these plan script (in order of their use in the cookbook):

* `plan_file_quality`        (z)
* `plan_cube_generation`     (c)
* `plan_extract_star`        (e)
* `plan_multi_standard`      (m)
* `plan_photometric_ratios`  (h)
* `plan_gs_psf`              (g)
* `plan_flux_solution`       (x)
* `plan_transmission`        (t)
* `plan_flux_calibration`    (f)
* `plan_ddt`                 (d)
* `plan_cubefit`             (b)
* `plan_analyze_timeseries`  (a)
   
The most import options of `batch_jobs` are:

* --prefix: set job prefix (required);
* --meta_prefix: change the SNF- preprefix (e.g. to your initials for debug runs);
* --mode: set mode, e.g. plan_file_preprocessing or p (required when creating, optional with --submit);
* --outver: Force the output version; if present this argument will be added to the -a option;
* --args: pass ARGS to plan_* command (use quotes!);
* --nodb: use option '--no_register' in plan scripts;
* --keep: keep going even if a plan_* or submit fails;
* --submit: submit the already created job scripts using snf_qsub (or qsub if --nodb was used to create);
* --autosubmit: automatically submit the jobs if script creation succeeds.

### `wrap_batch_jobs`

`wrap_batch_jobs` is a wrapper to `batch_jobs`, used to launch a production on a large list (more than 5-10) of target or nights. This wrapper takes a list and a `batch_jobs` command line, splits this list into several sub-lists, and launches the shell script production on the batch queue system. Launching a production takes only a few minutes with `wrap_batch_jobs` while it could take from 1 to ~10 hours using `batch_jobs` only.

## Run a production
The SNfactory data production is made through the use of the scripts and plans introduced above. There are many steps in the production flow, each of them being dependant from (at least) the previous one. The productions are run at the CC IN2P3 under the *snprod* account. After connecting to the CC, there are a few steps and things to know before runing any kind of production code. For a more detailed presentation of each step of the production plan, have a look at the following [SNf twiki page](https://snf-doc.lbl.gov/twiki/bin/view/Tasks/NewDataProcessing).

### Code versions
The SNfactory code has changed a lot since the begining of the project, and some of its parts are still evolving. Each time we have though that the code was stable enough to run on a large scalde production, we have tagged it into a stable CVS version. By default, when connection to the CC under snprod, the last stable version is used, as seen in the prompt: *snprod@ccage009 [<font color='red'>SNF-02-02</font>] ~ $ *. The list of available code version can be obtained using the *snf_version* script:

    snprod@ccage009 [SNF-02-02] ~ $ `snf_version`
    Choose version among available ones:
    HEAD
    SNF-01-00
    SNF-01-01
    ...
    SNF-02-00
    SNF-02-01
    SNF-02-02
    
To go from the current version to an other version, run the *snf_version* script again with the new version as argument:

    snprod@ccage009 [SNF-02-01] ~ $ `snf_version` SNF-02-02
    snprod@ccage009 [SNF-02-02] ~ $

What actually changed during this operation are most of the environment variables, pointing to the code or other diretories used. For example:

    snprod@ccage009 [SNF-02-01] ~ $ echo $JOBDIR
    /afs/in2p3.fr/group/snovae/snprod/jobs/SNF-02-01
    snprod@ccage009 [SNF-02-02] ~ $ echo $JOBDIR
    /afs/in2p3.fr/group/snovae/snprod/jobs/SNF-02-02
    
    snprod@ccage009 [SNF-02-01] ~ $ echo $SNF_TASKS
    /afs/in2p3.fr/group/snovae/snf/SNFactory/Tasks/SNF-02-01
    snprod@ccage009 [SNF-02-02] ~ $ echo $SNF_TASKS
    /afs/in2p3.fr/group/snovae/snf/SNFactory/Tasks/SNF-02-02



### Job directory
Jobs are usually launched **from** the following directory when working under *snprod*:

    snprod@ccage019 [SNF-02-02] ~ $ cd $JOBDIR
    snprod@ccage019 [SNF-02-02] jobs/SNF-02-02 $ pwd
    /afs/in2p3.fr/group/snovae/snprodJob/SNF-02-02
    
Jobs will then be launch from the directory corresponding to the plan you want to launch (e.g. PES for *plan_extract_star*). Each plan is either night-oriented or target-oriented. In both cases, corresponding directories are automatically created by the *batch_jobs* script: 

    snprod@ccage019 [SNF-02-02] jobs/SNF-02-02 $ ls PES
    04  05  06  07  08  09  10  11  12  13  14  CMD  nights.list  test.list


### Launch a job
To launch jobs, we usually use the following script to, first, create the list of night and targets:

    `list_followed_targets`
    `list_night`
    
and then launch the different plans using:

    `batch_jobs` or `wrap_batch_jobs`
    
if you respectively have a few jobs to launch or a long list of jobs to launch. *batch_jobs* will interactively launch jobs for a given list of nights/targets, while wrap_batch_jobs will do it on different workers in parallel, making the jobs creation/launching much faster. Have a look at the different options and the many *CMD* files present in the job directories to find out what options have to be used for a given job. Here is an example using the last PES production made under SNF-02-02:

    snprod@ccage019 [SNF-02-02] SNF-02-02/PES $ more CMD
    `batch_jobs` -p ES${SNF_VERSION_LITE} -m e -a "--truncateR 5100,9700 -R" test.list 

When a job or list of job have been launch, you can check if it is in queue or running using the *qstat2* command:

    snprod@ccage019 [SNF-02-02] SNF-02-02/PES $ qstat2
    Jobname                  State  Time                            CPU
    Osiris                       r  2014-11-24T09:16:17           503/0
    SNF-0202-NEWMFRh-EG131       r  2014-11-26T14:25:59    60272/100000
    SNF-0202-NEWMFRh-P177D       r  2014-11-27T16:39:25     1993/100000
    Anubis                     hqw  2014-11-24T09:16:30             0/0
    ---
      3 running
      1 queued
      
As it has been said above in the section on the SNfactory database, we currently can run up to 360 jobs at the time, other launched jobs are stored in the queue waiting for a slot to be freed. The DB can support up to 400 simultaneous connections, which should be enough to handle any production and personnal connections at the same time. All these limitations have been increased between July and December 2015, and were respectively for the number of jobs and possible connections to the DB of 120 and 250 before July 2015.
      
Here is a list of CC-IN2P3 documentation links about job submission:

* Job submission: http://cc.in2p3.fr/docenligne/969#jobcheckstatus
* Queues and limits: http://cctools.in2p3.fr/mrtguser/info_sge_queue.php
* Snovae info on the CC: http://cctools.in2p3.fr/mrtguser/info_manips_detail.php?group=snovae
* General info: http://cctools.in2p3.fr/mrtguser/

### Managing jobs
Managing a job or a list of job goes from launching them to cheking their results and outputs. Several tools have been written for this puporse:
    
* `qstat2`: check if a job has been correctly launched, if it is running, finished, or still in queue (see above).
* `jobErrors`: check the .err and .out of a job for errors. Works only on regular jobs, i.e not on MFR (python scripts), DDT (no .err) or cubefit (not *SNf* standard* outputs) jobs.
* `check_job_output`: compare the number of files that should have beem registered at the end of a job and what have actually been registered in the DB after the job ends.
* `job_attrition`: for a given directory, check for *missing* jobs, i.e, not launched, launch but not registered, launched but killed, etc.
* `manage_jobs`: check if a job or list of jobs exist in the DB, to check or change their status, the number of associated processes, and to deleted a job in case of problem. When deleting a job, be very careful and first make sure that you won't delete jobs that we want to keep. When doing so, you will erase all files from disk and from the DB, and will not be able to come back to the previous state. Use option -db instead of -d. It will create a file that you will have to launch manually. Check its content before submiting it to the queue. Option -n will show you the list of jobs with no associated processes, which should be empty.
* `srb_cleanup`: compare data on disk and in the DB. Look for missing data either in the DB or on disk.
* `job_env_check`: check the environement variables of a job, or campare it for two jobs.

### Personal Production

There is a few options in `batch_jobs` allowing to submit "personal/test" production which will not interfere with the main production.

#### No registration in the DB

To not record the result of your prod in the central archive disks and in the DB, you just need to add the "-L" option when running the `batch_jobs` script. In this case, not only the produced files will not be registered in the DB, but the job will run in the directory from which `batch_jobs` -s had submitted the jobs, allowing to recover the produced file in this same directory.

#### Registration in the DB

Each SNFactory member (should) has a "production ID" (Central test=0 , Steven=1 , Pierre=2, ... ask yours if you don't know it). This "production ID" can be used at production time to identify your test production and to not interfere with the main production. This "production ID" is passed to the generation/submition of the jobs throught the --outver="Production ID" option of `batch_jobs`.

## Manual DB-update recipe

The DB is "automatically" updated when new data comes while the first steps of the pipeline are run (see steps [A](SNfactoryDataProcessing.html#a-header-db-update), [B](SNfactoryDataProcessing.html#b-db-update) and [C](SNfactoryDataProcessing.html#c-update-target-run-info) of the data processing cookbook). Unfortunately, it sometimes appears that the Target.Kind (and Type) and Run.Kind of specific targets is not updated correctly, or at least not the way we would like it to be. Some of these (good) targets will thus have their Kind set to 'Unknown', or some other undesirable values (same for the Run.Kind). It is then necessary for these targets and their runs to be manually modified in the DB so they can be processed by the pipeline like other regular targets. To do so, we use the following command lines, which will have to be adapted to the specific targets/runs that you would like to modify in the DB.

### Change a Target.Kind

First, let's import the DB tables we will work on (usually the Target and Run tables)

In [2]:
from processing.process.models import Target, Run

We then have to get the target (could be a run, see below) we would like to modify in the DB. 

**Note**: the selected target (G27-45) hasn't been observed by SNfactory but by an other PI (Willman), so it is quite safe to interactively change some of its values, as loong as we change them back at the end of the example session. 

In [64]:
tg = Target.objects.get(Name='G27-45')
print tg.Name, tg.Kind, tg.Type, tg.PI

G27-45 unkonwn NotSnf Willman


To modify the Kind or Type (or any other values) of this target in the DB, simply change them in your ipython first, and save them in the DB

In [47]:
tg.Kind = 'NewKind'
tg.Type = 'NewType'
tg.save()

Now check that the changes have been correctly saved in the DB

In [48]:
tg = Target.objects.get(Name='G27-45')
print tg.Name, tg.Kind, tg.Type

G27-45 NewKind NewType


And changed them back to their original values (you won't change them back in the real life)

In [49]:
tg.Kind = 'unkonwn'
tg.Type = 'NotSnf'
tg.save()
tg = Target.objects.get(Name='G27-45')
print tg.Name, tg.Kind, tg.Type

G27-45 unkonwn NotSnf


**Note**: You can use `Target.objects.filter(Name__in='SOMEPATTERN')` or `__regex`, `__contains`, `__startswith` to get several targets in case you want to apply the same type of modifications to a serie of targets. See the [DB data access](DataAccess.html#from-the-db) section for examples of more comple queries.

### Change a Run.Kind

Any table can be changed using the same scheme. For example, you can apply the same method and change the Run.Kind for all runs of the above target by looping on the runs (but we will only change one of them in this example)

In [56]:
for r in tg.Runs_FK.all():
    print r.IdRun, r.Kind, r.Type

8255097 unkonwn PHOTO
8255098 unkonwn PHOTO
8255064 unkonwn ACQUISITION
8255065 unkonwn PHOTO
8255066 unkonwn PHOTO
8255067 unkonwn PHOTO
8255068 unkonwn PHOTO
8255069 unkonwn PHOTO
8255092 unkonwn ACQUISITION
8255093 unkonwn ACQUISITION
8255094 unkonwn PHOTO
8255095 unkonwn PHOTO
8255096 unkonwn PHOTO


To change one of them

In [57]:
r = Run.objects.get(IdRun=8255098)
print r.IdRun, r.TargetId_FK.Name, r.Kind

8255098 G27-45 unkonwn


In [58]:
r.Kind = 'NewRunKind'
print r.Kind
r.save()

NewRunKind


And then make sure it has been saved in the DB

In [59]:
for r in tg.Runs_FK.all():
    print r.IdRun, r.Kind, r.Type

8255097 unkonwn PHOTO
8255098 NewRunKind PHOTO
8255064 unkonwn ACQUISITION
8255065 unkonwn PHOTO
8255066 unkonwn PHOTO
8255067 unkonwn PHOTO
8255068 unkonwn PHOTO
8255069 unkonwn PHOTO
8255092 unkonwn ACQUISITION
8255093 unkonwn ACQUISITION
8255094 unkonwn PHOTO
8255095 unkonwn PHOTO
8255096 unkonwn PHOTO


And of course, change it back to its original value and save it, since we do not really want to change anything in this example

In [60]:
r = Run.objects.get(IdRun=8255098)
print r.IdRun, r.TargetId_FK.Name, r.Kind

8255098 G27-45 NewRunKind


In [61]:
r.Kind = 'unkonwn'
r.save()

And check again before leaving this example

In [62]:
r = Run.objects.get(IdRun=8255098)
print r.IdRun, r.TargetId_FK.Name, r.Kind

8255098 G27-45 unkonwn


You can of course do that while looping on the list of runs for a given target (`tg.Run_FK.all()`), but be carful to always check what you have changed locally before saving anything in the DB, i.e., before applying the `.save()`.