## Example - difi on LSST SSP

[Assumed Inputs](#Assumed-Inputs)  
[Analyzing Observations](#Analyzing-Observations-(Can-I-Find-It%3F))  
[Analyzing Linkages](#Analyzing-Linkages-(Did-I-Find-It%3F))

In [1]:
import os
import sys
import numpy as np
import pandas as pd

import difi

### Installing `difi`

Choose an installation directory for `difi` and `cd` into that directory.   
```git clone git@github.com:moeyensj/difi.git```  
```git checkout population-statistics```  

Then, pick one of the following options:  
1) Install a new environment, and then install `difi`:  
```conda create -n difi_py36 -c defaults -c conda-forge --file requirements.txt python=3.6```  
```conda activate difi_py36```  
```python -m ipykernel install --user --name=difi_py36 --display-name="difi (Python 3.6)"```  
```python setup.py develop```  
You may need to install ipykernel in your conda environment.

2) Or, activate an existing environment, and then install `difi`:  
```conda activate ENV``` 
```conda install c defaults -c conda-forge --file requirements.txt```  
```python setup.py develop```  

In [2]:
DATA_DIR = "/epyc/projects/lsst_ssp/difi/test_data"

In [3]:
! ls {DATA_DIR}

Veres_5x5deb_14nights.csv
Veres_5x5deb_14nights.tar.gz
heliolinc2_clusters_14nights_5x5_mean_state_filter_rms_3arcsec.csv


### Assumed Inputs

Lets take a look at a sample observations file used by the LSST software. 

In [4]:
observations = pd.read_csv(os.path.join(DATA_DIR, "Veres_5x5deb_14nights.csv"), 
                           sep=",", 
                           dtype={
                               "obsId" : str,
                               "obj" : str,
                               "objId" : str,
                           },
                           index_col=False)

In [5]:
# Observation IDs, object names / IDs should all be object type
observations.dtypes

obj        object
time      float64
RA        float64
DEC       float64
x_obs     float64
y_obs     float64
z_obs     float64
vx_obs    float64
vy_obs    float64
vz_obs    float64
night       int64
obsId      object
objId      object
dtype: object

In [6]:
observations["obj"].value_counts()

FD           44948
NS           16363
S1000sTba       18
S1002KYPa       18
S100bYpla       17
             ...  
S100ojnDa        1
S100wdGOa        1
S101v7uba        1
S100zr7ya        1
S100XL7va        1
Name: obj, Length: 3479, dtype: int64

Before we continue lets tell `difi` what columns to use. Between the observations and linkageMembers dataframe, `difi` needs to know about just three columns:
- linkage_id: the ID assigned to each linkage
- obs_id : the observation ID from which linkages are made
- truth : the truth for every observation

So lets define a dictionary that tells `difi` what columns to use for this information.

In [7]:
columnMapping = {
    # difi column name : data column name
    "linkage_id" : "clusterId",
    "obs_id" : "obsId",
    "truth" : "obj",
    "epoch_mjd" : "time",
}

Lets add a class column and make sure all noise classes have unique designations:

In [8]:
# This is new in difi, I removed the old falsePositiveIDs and unknownIDs flag. You can now pass a column name to difi
# that lets it know what class each observation belongs to
# Also all synthethic error observations must now have unique designations.

duplicatedIDs = ["NS", "FD"]
for j in duplicatedIDs:
    mask = observations[columnMapping["truth"]].isin([j]) 
    newIDs = np.array(["{}_{:08d}".format(t, i) for i, t in enumerate(observations[mask][columnMapping["truth"]].values)])
    observations.loc[mask, columnMapping["truth"]] = newIDs
    
observations.loc[observations["obj"].str.match("^S1"), "class"] = "MBA"
observations.loc[observations["obj"].str.match("^S0"), "class"] = "NEO"
observations.loc[observations["obj"].str.match("^FD"), "class"] = "NOISE 1"
observations.loc[observations["obj"].str.match("^NS"), "class"] = "NOISE 2"

In [9]:
linkageMembers_temp = pd.read_csv(os.path.join(DATA_DIR, "heliolinc2_clusters_14nights_5x5_mean_state_filter_rms_3arcsec.csv"), 
                            sep=",", 
                            index_col=False)

linkageMembers_temp["clusterId"] = linkageMembers_temp["clusterId"].astype(str)

In [10]:
linkageMembers_temp.head()

Unnamed: 0,clusterId,obsId,r,drdt,cluster_epoch,x_a,y_a,z_a,vx_a,vy_a,vz_a
0,0,[ 7810 10379 14869 17221],2.0,-0.05,52397.573056,-1.912016,-0.517735,-0.27535,-0.049106,-0.004258,-0.013888
1,1,[ 7944 10411 14995 17255],2.0,-0.05,52397.573056,-1.913942,-0.514824,-0.268834,-0.049628,-0.00289,-0.013386
2,2,[ 7903 10405 14953 17249],2.0,-0.05,52397.573056,-1.912431,-0.5198,-0.26761,-0.049297,-0.004211,-0.012578
3,3,[ 6766 7786 13913 14849],2.0,-0.05,52397.573056,-1.91579,-0.48332,-0.310907,-0.049265,-0.002261,-0.014801
4,4,[ 6759 7898 13908 14948],2.0,-0.05,52397.573056,-1.918823,-0.476485,-0.302475,-0.049448,-0.001904,-0.014106


Data wrangling into the `difi` format. 

In [11]:
allLinkages = linkageMembers_temp[["clusterId", "r", "drdt", 'cluster_epoch', 'x_a', 'y_a', 'z_a','vx_a', 'vy_a', 'vz_a']]
linkageMembers_temp = linkageMembers_temp[["clusterId", "obsId"]]

In [12]:
# Split each linkage into its different observation IDs
linkage_list = linkageMembers_temp[columnMapping["obs_id"]].str.strip("[").str.strip("]").str.split().tolist()

# Build initial DataFrame
linkageMembers = pd.DataFrame(pd.DataFrame(linkage_list, index=linkageMembers_temp["clusterId"].values).stack(), columns=[columnMapping["obs_id"]])

# Reset index 
linkageMembers.reset_index(1, drop=True, inplace=True)

# Make linkage_id its own column
linkageMembers[columnMapping["linkage_id"]] = linkageMembers.index

# Re-arrange column order 
linkageMembers = linkageMembers[[columnMapping["linkage_id"], columnMapping["obs_id"]]]

# Not all linkages have the same number of detections, empty detections needs to be dropped
linkageMembers[columnMapping["obs_id"]].replace("", np.nan, inplace=True)
linkageMembers.sort_values(by=["clusterId", "obsId"], inplace=True)
linkageMembers.dropna(inplace=True)
linkageMembers.reset_index(drop=True, inplace=True)


In [13]:
keep = linkageMembers["clusterId"].unique()[linkageMembers["clusterId"].value_counts().values >= 4]

In [14]:
linkageMembers = linkageMembers[linkageMembers["clusterId"].isin(keep)]

In [15]:
linkageMembers

Unnamed: 0,clusterId,obsId
0,0,10379
1,0,14869
2,0,150
3,0,17221
4,0,24132
...,...,...
10245,9,5905
10246,9,6747
10247,9,7232
10248,9,7759


### Analyzing Observations (Can I Find It?) 

Determing how a linking algorithm performs involves knowing what it should be able to link. `difi` comes with a function that analyzes findablility with a simple assumption: any truths with at least x many observations should be findable. We leave it to the user to determine more complicated findability metrics for their specific science cases. 

Lets see what should be findable with MOPS with this simple metric.

In [16]:
allTruths, summary = difi.analyzeObservations(observations,
                                              minObs=4,
                                              classes="class",
                                              verbose=True,
                                              columnMapping=columnMapping)

Analyzing observations...


The `analyzeObservations` function returns two dataframes. Let's take a look both of them in a little detail. 

The allTruths dataframe lists each unique truth as a row with columns that account for the number of observations that each unique truth has and also if it is findable (if it has more than `minObs` observations). 

In [17]:
allTruths

Unnamed: 0,obj,num_obs,findable
0,S1000sTba,18,1
1,S1002KYPa,18,1
2,S100bYpla,17,1
3,S100dNFZa,17,1
4,S100azRha,17,1
...,...,...,...
64783,FD_00014321,1,0
64784,NS_00008816,1,0
64785,NS_00015713,1,0
64786,FD_00017219,1,0


One can trivially select the objects that should or should not be findable thanks to `pandas`.

In [18]:
findable_known_truths = allTruths[allTruths["findable"] == 1][columnMapping["truth"]].unique()

As stated earlier, the `analyzeObservations` function has a very simple findability criteria. The user can write their own metric and then generate a dataframe in the same style as the one above. Doing so will allow `difi` to calculate summary statistics, however, this is not necessary to proceed with determining if truths were linked.

Before we go to seeing how MOPS performed, lets take a look at the summary dataframe. 

In [19]:
summary

Unnamed: 0,class,num_obs,num_truths,findable
0,All,77695,64788,1918
1,MBA,16337,3458,1913
2,NEO,47,19,5
3,NOISE 1,44948,44948,0
4,NOISE 2,16363,16363,0


In [20]:
summary.columns

Index(['class', 'num_obs', 'num_truths', 'findable'], dtype='object')

The summary dataframe is a very simple single row dataframe with some summary numbers about the given observations. 

A few things to note: any observations with truths that are either unknownIDs or falsePositiveIDs are ignored when counting the number of truths that should be findable. For example, in the function call a few cells earlier the `falsePositiveIDs` kwarg was set to `["-1", "-2"]`.

In [21]:
allTruths[allTruths["findable"] == 0].head()

Unnamed: 0,obj,num_obs,findable
1918,S100xdA3a,3,0
1919,S100dZUwa,3,0
1920,S10007Bwa,3,0
1921,S100tEIna,3,0
1922,S100rZf5a,3,0


However... MOPS can't link single detections and has a stricter findability metric. We can write our own function to calculate findability to get a more accurate measure of what should be findable by MOPS.

In [22]:
# This is where the user's specific science and domain knowledge comes in to
# help determine what is findable and what is not

def calcNight(mjd, midnight=0.166):
    night = mjd + 0.5 - midnight
    return night.astype(int)

def calcFindableMOPS(observations, 
                     trackletMinObs=2, 
                     trackMinNights=3, 
                     columMapping=columnMapping):
    # Groupby night, then count number of occurences per night
    night_designation_count = observations.groupby(["nid"])[columnMapping["truth"]].value_counts()
    night_designation_count = pd.DataFrame(night_designation_count)
    night_designation_count.rename(columns={columnMapping["truth"]: "num_obs"}, inplace=True)
    night_designation_count.reset_index(inplace=True)

    # Remove nightly detections that would not be linked into a tracklet
    night_designation_count = night_designation_count[night_designation_count["num_obs"] >= trackletMinObs]

    # Groupby object then count number of nights
    try: 
        designation_night_count = pd.DataFrame(night_designation_count.groupby([columnMapping["truth"]])["nid"].value_counts())
    except:
        # No objects satisfy the requirements, return empty array
        return np.array([])
    designation_night_count.rename(columns={"nid": "num_nights"}, inplace=True)
    designation_night_count.reset_index(inplace=True)

    # Grab objects that meet the night requirement
    tracklet_nights_possible = designation_night_count[columnMapping["truth"]].value_counts()
    return tracklet_nights_possible.index[tracklet_nights_possible >= trackMinNights].values

observations["nid"] = calcNight(observations[columnMapping["epoch_mjd"]])
findable_by_mops = calcFindableMOPS(observations)

These are the objects that should actually be findable by MOPS.

In [23]:
linkageMembers["clusterId"].nunique()

1113

In [24]:
len(findable_by_mops)

762

Now lets modify the allTruths dataframe to update findability. 

In [25]:
### CAREFUL HERE, SET TO TRUE IF YOU WANT TO COMPARE TO MOPS FINDABILITY
MAKE_FINDABILITY_MOPS = True

if MAKE_FINDABILITY_MOPS:
    allTruths["findable"] = np.zeros(len(allTruths), dtype=int)
    allTruths.loc[allTruths[columnMapping["truth"]].isin(findable_by_mops), "findable"] = 1

Lets confirm this worked as intended.

In [26]:
if MAKE_FINDABILITY_MOPS:
    assert len(findable_by_mops) == len(allTruths[allTruths["findable"] == 1])
    assert set(findable_by_mops) == set(allTruths[allTruths["findable"] == 1][columnMapping["truth"]].unique())

### Analyzing Linkages (Did I Find It?)

In [27]:
allLinkages, allTruths, summary = difi.analyzeLinkages(observations, 
                                                       linkageMembers, 
                                                       allLinkages=None, 
                                                       allTruths=allTruths,
                                                       classes="class",
                                                       minObs=4, 
                                                       contaminationThreshold=20, 
                                                       verbose=True,
                                                       columnMapping=columnMapping)

All
----------------------------------------------------------------
                                   Number  (% class)  (% total)
All linkages:                        1113  (100.00%)  (100.00%)
Pure linkages:                        297  ( 26.68%)  ( 26.68%)
Complete pure linkages:                27  (  2.43%)  (  2.43%)
Partial linkages:                       0  (  0.00%)  (  0.00%)
Pure and partial linkages:            297  ( 26.68%)  ( 26.68%)
Mixed linkages:                       816  ( 73.32%)  ( 73.32%)

                                   Number  (% class) (% findable)
Unique All:
 ..in pure linkages:                  294  (  0.45%)  ( 38.58%)
 ..in complete pure linkages:          27  (  0.04%)  (  3.54%)
 ..in partial linkages:                 0  (  0.00%)  (  0.00%)
 ..in pure and partial linkages:      294  (  0.45%)  ( 38.58%)
 ..in mixed linkages:                1237  (  1.91%)  (162.34%)
 ..only in pure linkages:             294  (  0.45%)  ( 38.58%)
 ..only in partial l

The `analyzeLinkages` function returns three dataframes:
- allLinkages: each linkage is summarized as its own row. 
- allTruths: each truth is summarized as its own row. 
- summary: summary statistics in a single row. 

Lets now take a look at each individually.

In [28]:
linkageMembers[linkageMembers["clusterId"] == "0"]

Unnamed: 0,clusterId,obsId
0,0,10379
1,0,14869
2,0,150
3,0,17221
4,0,24132
5,0,25029
6,0,2519
7,0,2554
8,0,32197
9,0,3459


In [29]:

allLinkages[allLinkages["pure"] == 1]

Unnamed: 0,clusterId,num_members,num_obs,pure,partial,mixed,contamination,linked_truth
4,1000,1,4,1,0,0,0.0,S100j0LGa
5,1001,1,4,1,0,0,0.0,S100vmNaa
6,1002,1,4,1,0,0,0.0,S100gdjya
8,1004,1,4,1,0,0,0.0,S100mOYia
11,1007,1,4,1,0,0,0.0,S100EG7Fa
...,...,...,...,...,...,...,...,...
1100,889,1,6,1,0,0,0.0,S100M8CFa
1103,891,1,4,1,0,0,0.0,S100liu1a
1104,892,1,4,1,0,0,0.0,S100wLEQa
1107,895,1,5,1,0,0,0.0,S1001XkWa


In [31]:
clusterId = "889"
observations[observations["obsId"].isin(linkageMembers[linkageMembers["clusterId"].isin([clusterId])]["obsId"].values)]

Unnamed: 0,obj,time,RA,DEC,x_obs,y_obs,z_obs,vx_obs,vy_obs,vz_obs,night,obsId,objId,class,nid
56762,S100M8CFa,52404.116356,170.153954,-14.702835,-0.659029,-0.765055,-8e-06,0.012833,-0.011501,8.7e-05,13,56762,1268,MBA,52404
60265,S100M8CFa,52404.116805,170.153952,-14.702787,-0.659024,-0.76506,-8e-06,0.012833,-0.0115,8.7e-05,13,60265,1268,MBA,52404
61511,S100M8CFa,52404.117252,170.153956,-14.702751,-0.659018,-0.765065,-8e-06,0.012834,-0.0115,8.7e-05,13,61511,1268,MBA,52404
68166,S100M8CFa,52404.129973,170.154021,-14.701764,-0.658855,-0.765211,-7e-06,0.012854,-0.011491,8.4e-05,13,68166,1268,MBA,52404
68969,S100M8CFa,52404.130421,170.154028,-14.701729,-0.658849,-0.765216,-7e-06,0.012855,-0.011491,8.4e-05,13,68969,1268,MBA,52404
70934,S100M8CFa,52404.130869,170.154026,-14.701713,-0.658843,-0.765221,-7e-06,0.012855,-0.01149,8.4e-05,13,70934,1268,MBA,52404


For each linkage defined in the `linkageMembers` format, the number of unique 'truths' is counted ('num_members'), the number of unique observations in each linkage ('num_obs'), whether the linkage is 'pure', 'partial' or 'mixed', the contamination percentage (if the linkage is considered 'partial') and if the linkage is either 'pure' or 'partial' then the linked truth ('linked_truth').  

Here we briefly summarize the different linkage types possible:
- 'pure': all observations in a linkage belong to a unique truth
- 'partial': up to a certain percentage of non-unique thruths are allowed so along as one truth has at least the minimum require number of unique observations
- 'mixed': a linkage containing different observations belonging to different truths, we avoid using the word 'false' for these clusters as they may contain unknown truths depending on the use case. We leave interpretation up to the user. 

In [32]:
allLinkages[allLinkages["pure"] == 1]["linked_truth"]

4       S100j0LGa
5       S100vmNaa
6       S100gdjya
8       S100mOYia
11      S100EG7Fa
          ...    
1100    S100M8CFa
1103    S100liu1a
1104    S100wLEQa
1107    S1001XkWa
1110    S100fqrIa
Name: linked_truth, Length: 297, dtype: object

The allTruths dataframe shows for each truth if it has been found in either a pure or partial linkage. If found in either it sets the found column to 1. 

Lastly, the summary dataframe contains overall statistics on the number of truths found, the completeness (if calculable) and so on...

In [33]:
summary

Unnamed: 0,class,completeness,findable,found,findable_found,findable_missed,not_findable_found,not_findable_missed,linkages,pure_linkages,pure_complete_linkages,partial_linkages,mixed_linkages,unique_in_pure,unique_in_pure_complete,unique_in_partial,unique_in_pure_and_partial,unique_in_pure_only,unique_in_partial_only,unique_in_mixed
0,All,18.897638,762,294,144,618,150,63876,1113,297,27,0,816,294,27,0,294,294,0,1237
1,MBA,18.897638,762,294,144,618,150,2546,1113,297,27,0,816,294,27,0,294,294,0,1001
2,NEO,0.0,0,0,0,0,0,19,3,0,0,0,3,0,0,0,0,0,0,2
3,NOISE 1,0.0,0,0,0,0,0,44948,90,0,0,0,90,0,0,0,0,0,0,167
4,NOISE 2,0.0,0,0,0,0,0,16363,65,0,0,0,65,0,0,0,0,0,0,67
