## Example - difi on MOPS

[Assumed Inputs](#Assumed-Inputs)  
[Analyzing Observations](#Analyzing-Observations-(Can-I-Find-It%3F))  
[Analyzing Linkages](#Analyzing-Linkages-(Did-I-Find-It%3F))

In [1]:
import os
import sys
import numpy as np
import pandas as pd

sys.path.append("..")

import difi

In [2]:
DATA_DIR = "data/mops/"

### Assumed Inputs

Lets take a look at a sample observations file used by the MOPS software. 

In [3]:
observations = pd.read_csv(os.path.join(DATA_DIR, "night_52391_through_52406.dets"), 
                           sep=" ", 
                           index_col=False, 
                           header=None, 
                           names=["det_id", 
                                  "field_id", 
                                  "object_name", 
                                  "ra_deg", 
                                  "dec_deg", 
                                  "epoch_mjd", 
                                  "mag", 
                                  "mag_sigma"],
                           dtype={"det_id" : str})

In [4]:
observations

Unnamed: 0,det_id,field_id,object_name,ra_deg,dec_deg,epoch_mjd,mag,mag_sigma
0,137541512,1719609,4,171.392970,-14.233830,52391.002282,20.7856,0.105232
1,137541513,1719609,5,171.308411,-14.222651,52391.002282,19.7737,0.037289
2,137541533,1719609,24,171.105427,-13.838795,52391.002282,21.1030,0.146543
3,137541550,1719609,41,171.097453,-14.453792,52391.002282,20.7811,0.104742
4,137541564,1719609,54,171.740297,-14.035202,52391.002282,18.1394,0.008938
5,137541588,1719609,78,170.537911,-14.648319,52391.002282,17.7483,0.006731
6,137541598,1719609,88,170.692361,-14.499369,52391.002282,20.4922,0.077495
7,137541599,1719609,89,170.531883,-14.755033,52391.002282,20.1186,0.052732
8,137541608,1719609,98,171.096560,-14.599865,52391.002282,20.0936,0.051408
9,137541612,1719609,102,171.884102,-13.515641,52391.002282,20.3130,0.064372


MOPS outputs its linkages in a text with the observation IDs for each linkage written in a single line. Notice how there also no linkage IDs.

In [5]:
! head {DATA_DIR}night_52391_through_52406.track

137541512 137543165 137615070 137620728 138216303 138216866 138221227 
137541512 137543165 137615070 137620728 138216303 138216866 138221227 144513728 144533645 
137541512 137543165 137615070 137620728 138216303 138216866 138221227 144513728 144533645 146991832 147084549 
137541512 137543165 137615070 137620728 138216303 138216866 138221227 144514371 144534274 
137541512 137543165 137615070 137620728 142747928 142763154 
137541512 137543165 137615070 137620728 142748009 142763229 
137541512 137543165 137615070 137620728 142748009 142763229 144513839 144533746 
137541512 137543165 137615070 137620728 142748120 142763338 
137541512 137543165 137615070 137620728 142748305 142763529 
137541512 137543165 137615070 137620728 142748337 142763570 


Before we continue lets tell `difi` what columns to use. Between the observations and linkageMembers dataframe, `difi` needs to know about just three columns:
- linkage_id: the ID assigned to each linkage
- obs_id : the observation ID from which linkages are made
- truth : the truth for every observation

So lets define a dictionary that tells `difi` what columns to use for this information.

In [6]:
columnMapping = {
    # difi column name : data column name
    "linkage_id" : "track_id",
    "obs_id" : "det_id",
    "truth" : "object_name"
}

We can convert this format into the linkageMembers format required by `difi` using the following code. 

In [7]:
linkageMembers = difi.readLinkagesByLineFile("data/mops/night_52391_through_52406.track",
                                             columnMapping=columnMapping)

In [8]:
linkageMembers

Unnamed: 0,track_id,det_id
0,1,137541512
1,1,137543165
2,1,137615070
3,1,137620728
4,1,138216303
5,1,138216866
6,1,138221227
7,2,137541512
8,2,137543165
9,2,137615070


The linkage dataframe has just two columns, both of which `difi` needs. The first column has the linkage ID, then for each linkage each unique observation in that linkage is listed in the second column. 

Lets take a look at two specific examples:

In [9]:
linkageMembers[linkageMembers["track_id"] == 4]

Unnamed: 0,track_id,det_id
27,4,137541512
28,4,137543165
29,4,137615070
30,4,137620728
31,4,138216303
32,4,138216866
33,4,138221227
34,4,144514371
35,4,144534274


In [10]:
linkageMembers[linkageMembers["track_id"] == 14]

Unnamed: 0,track_id,det_id
97,14,137541512
98,14,137543165
99,14,137615070
100,14,137620728
101,14,144513738
102,14,144533658


### Analyzing Observations (Can I Find It?) 

Determing how a linking algorithm performs involves knowing what it should be able to link. `difi` comes with a function that analyzes findablility with a simple assumption: any truths with at least x many observations should be findable. We leave it to the user to determine more complicated findability metrics for their specific science cases. 

Lets see what should be findable with MOPS with this simple metric.

In [11]:
allTruths, summary = difi.analyzeObservations(observations,
                                              minObs=5,
                                              unknownIDs=[],
                                              falsePositiveIDs=["-1", "-2"],
                                              verbose=True,
                                              columnMapping=columnMapping)

Analyzing observations...
Known truth observations: 103188
Unknown truth observations: 0
False positive observations: 3293
Percent known truth observations (%): 96.907
Percent unknown truth observations (%): 0.000
Percent false positive observations (%): 3.093
Unique truths: 17476
Unique known truths : 17474
Unique known truths with at least 5 detections: 8523

Total time in seconds: 0.06351709365844727



The `analyzeObservations` function returns two dataframes. Let's take a look both of them in a little detail. 

The allTruths dataframe lists each unique truth as a row with columns that account for the number of observations that each unique truth has and also if it is findable (if it has more than `minObs` observations). 

In [12]:
allTruths

Unnamed: 0,object_name,num_obs,findable
0,-1,2346,0
1,-2,947,0
2,7359,26,1
3,7928,26,1
4,7888,26,1
5,7839,25,1
6,7704,25,1
7,7684,25,1
8,7930,25,1
9,7863,25,1


One can trivially select the objects that should or should not be findable thanks to `pandas`.

In [13]:
findable_known_truths = allTruths[allTruths["findable"] == 1][columnMapping["truth"]].unique()
not_findable_known_truths = allTruths[(allTruths["findable"] == 0) & (~allTruths[columnMapping["truth"]].isin(["-1", "-2"]))][columnMapping["truth"]].unique()

As stated earlier, the `analyzeObservations` function has a very simple findability criteria. The user can write their own metric and then generate a dataframe in the same style as the one above. Doing so will allow `difi` to calculate summary statistics, however, this is not necessary to proceed with determining if truths were linked.

Before we go to seeing how MOPS performed, lets take a look at the summary dataframe. 

In [14]:
summary

Unnamed: 0,num_unique_truths,num_unique_known_truths,num_unique_known_truths_findable,num_known_truth_observations,num_unknown_truth_observations,num_false_positive_observations,percent_known_truth_observations,percent_unknown_truth_observations,percent_false_positive_observations
0,17476,17474,8523,103188,0,3293,96.907429,0.0,3.092571


In [15]:
summary.columns

Index(['num_unique_truths', 'num_unique_known_truths',
       'num_unique_known_truths_findable', 'num_known_truth_observations',
       'num_unknown_truth_observations', 'num_false_positive_observations',
       'percent_known_truth_observations',
       'percent_unknown_truth_observations',
       'percent_false_positive_observations'],
      dtype='object')

The summary dataframe is a very simple single row dataframe with some summary numbers about the given observations. 

A few things to note: any observations with truths that are either unknownIDs or falsePositiveIDs are ignored when counting the number of truths that should be findable. For example, in the function call a fews cell earlier the `falsePositiveIDs` kwarg was set to `["-1", "-2"]`.

In [16]:
allTruths[allTruths["findable"] == 0].head()

Unnamed: 0,object_name,num_obs,findable
0,-1,2346,0
1,-2,947,0
8525,436045,4,0
8526,423369,4,0
8527,421386,4,0


However... MOPS can't link single detections and has a stricter findability metric. We can write our own function to calculate findability to get a more accurate measure of what should be findable by MOPS.

In [17]:
# This is where the user's specific science and domain knowledge comes in to
# help determine what is findable and what is not

def calcNight(mjd, midnight=0.166):
    night = mjd + 0.5 - midnight
    return night.astype(int)

def calcFindableMOPS(observations, 
                     trackletMinObs=2, 
                     trackMinNights=3, 
                     falsePositiveIDs=["-1", "-2"],
                     unknownIDs=[],
                     columMapping=columnMapping):
    # Groupby night, then count number of occurences per night
    night_designation_count = observations[~observations[columnMapping["truth"]].isin(falsePositiveIDs + unknownIDs)].groupby(["nid"])[columnMapping["truth"]].value_counts()
    night_designation_count = pd.DataFrame(night_designation_count)
    night_designation_count.rename(columns={columnMapping["truth"]: "num_obs"}, inplace=True)
    night_designation_count.reset_index(inplace=True)

    # Remove nightly detections that would not be linked into a tracklet
    night_designation_count = night_designation_count[night_designation_count["num_obs"] >= trackletMinObs]

    # Groupby object then count number of nights
    try: 
        designation_night_count = pd.DataFrame(night_designation_count.groupby([columnMapping["truth"]])["nid"].value_counts())
    except:
        # No objects satisfy the requirements, return empty array
        return np.array([])
    designation_night_count.rename(columns={"nid": "num_nights"}, inplace=True)
    designation_night_count.reset_index(inplace=True)

    # Grab objects that meet the night requirement
    tracklet_nights_possible = designation_night_count[columnMapping["truth"]].value_counts()
    return tracklet_nights_possible.index[tracklet_nights_possible >= trackMinNights].values

observations["nid"] = calcNight(observations["epoch_mjd"])
findable_by_mops = calcFindableMOPS(observations)

These are the objects that should actually be findable by MOPS.

In [18]:
findable_by_mops

array([  7973,   7674,   7636, ..., 416418, 499300, 423577])

Now lets modify the allTruths dataframe to update findability. 

In [19]:
allTruths["findable"] = np.zeros(len(allTruths), dtype=int)
allTruths.loc[allTruths[columnMapping["truth"]].isin(findable_by_mops), "findable"] = 1

Lets confirm this worked as intended.

In [20]:
assert len(findable_by_mops) == len(allTruths[allTruths["findable"] == 1])
assert set(findable_by_mops) == set(allTruths[allTruths["findable"] == 1][columnMapping["truth"]].unique())

### Analyzing Linkages (Did I Find It?)

In [21]:
allLinkages, allTruths, summary = difi.analyzeLinkages(observations, 
                                                       linkageMembers, 
                                                       allLinkages=None, 
                                                       allTruths=allTruths,
                                                       summary=summary,
                                                       minObs=6, 
                                                       contaminationThreshold=0.2, 
                                                       unknownIDs=[],
                                                       falsePositiveIDs=["-1", "-2"],
                                                       verbose=True,
                                                       columnMapping=columnMapping)

Analyzing linkages...
Known truth pure linkages: 4085
Known truth partial linkages: 2678
Unknown truth pure linkages: 0
Unknown truth partial linkages: 0
False positive pure linkages: 6
False positive partial linkages: 0
Mixed linkages: 43231
Total linkages: 50000
Mixed linkage percentage (%): 86.462
Unique known truths linked: 3116
Unique known truths missed: 3558
Completeness (%): 46.689

Total time in seconds: 0.6326088905334473


The `analyzeLinkages` function returns three dataframes:
- allLinkages: each linkage is summarized as its own row. 
- allTruths: each truth is summarized as its own row. 
- summary: summary statistics in a single row. 

Lets now take a look at each individually.

In [22]:
allLinkages

Unnamed: 0,track_id,num_members,num_obs,pure,partial,mixed,contamination,linked_truth
0,1,1,7,1,0,0,,4.0
1,2,1,9,1,0,0,,4.0
2,3,1,11,1,0,0,,4.0
3,4,2,9,0,0,1,,
4,5,2,6,0,0,1,,
5,6,2,6,0,0,1,,
6,7,2,8,0,0,1,,
7,8,2,6,0,0,1,,
8,9,2,6,0,0,1,,
9,10,2,6,0,0,1,,


For each cluster defined in the `linkageMembers` format, the number of unique 'truths' is counted ('num_members'), the number of unique observations in each linkage ('num_obs'), whether the linkage is 'pure', 'partial' or 'mixed', the contamination percentage (if the linkage is considered 'partial') and if the linkage is either 'pure' or 'partial' then the linked truth ('linked_truth').  

Here we briefly summarize the different linkage types possible:
- 'pure': all observations in a linkage belong to a unique truth
- 'partial': up to a certain percentage of non-unique thruths are allowed so along as one truth has at least the minimum require number of unique observations
- 'mixed': a linkage containing different observations belonging to different truths, we avoid using the word 'false' for these clusters as they may contain unknown truths depending on the use case. We leave interpretation up to the user. 

In [23]:
allTruths

Unnamed: 0,object_name,num_obs,findable,found_pure,found_partial,found
0,-1,2346,0,1,0,1
1,-2,947,0,0,0,0
2,7359,26,1,1,1,1
3,7928,26,1,1,1,1
4,7888,26,1,1,1,1
5,7839,25,1,1,1,1
6,7704,25,1,1,1,1
7,7684,25,1,1,1,1
8,7930,25,1,1,1,1
9,7863,25,1,1,1,1


The allTruths dataframe shows for each truth if it has been found in either a pure or partial linkage. If found in either it sets the found column to 1. 

Lastly, the summary dataframe contains overall statistics on the number of truths found, the completeness (if calculable) and so on...

In [24]:
summary

Unnamed: 0,num_unique_truths,num_unique_known_truths,num_unique_known_truths_findable,num_known_truth_observations,num_unknown_truth_observations,num_false_positive_observations,percent_known_truth_observations,percent_unknown_truth_observations,percent_false_positive_observations,num_unique_known_truths_found,num_unique_known_truths_missed,percent_completeness,num_known_truths_pure_linkages,num_known_truths_partial_linkages,num_unknown_truths_pure_linkages,num_unknown_truths_partial_linkages,num_false_positive_pure_linkages,num_false_positive_partial_linkages,num_mixed_linkages,num_total_linkages
0,17476,17474,8523,103188,0,3293,96.907429,0.0,3.092571,3116,3558,46.688642,4085,2678,0,0,6,0,43231,50000


Notice that we passed the allTruths dataframe as a kwarg to the `analyzeLinkages` function, this allows the function to access the 'findable' column and calculate completeness. You do not need to pass the allTruths dataframe nor the summary dataframe to use `analyzeLinkages`. Below is an example. 

In [25]:
allTruths.drop(columns=["findable"], inplace=True)

allLinkages, allTruths, summary = difi.analyzeLinkages(observations, 
                                                       linkageMembers, 
                                                       minObs=5, 
                                                       contaminationThreshold=0.2, 
                                                       unknownIDs=[],
                                                       falsePositiveIDs=["-1", "-2"],
                                                       verbose=True,
                                                       columnMapping=columnMapping)

Analyzing linkages...
Known truth pure linkages: 4085
Known truth partial linkages: 2678
Unknown truth pure linkages: 0
Unknown truth partial linkages: 0
False positive pure linkages: 6
False positive partial linkages: 34
Mixed linkages: 43197
Total linkages: 50000
Mixed linkage percentage (%): 86.394
Unique known truths linked: 3116
Unique known truths missed: nan
Completeness (%): nan

Total time in seconds: 0.59568190574646


