## Example - difi on THOR

[Assumed Inputs](#Assumed-Inputs)  
[Analyzing Observations](#Analyzing-Observations-(Can-I-Find-It%3F))  
[Analyzing Linkages](#Analyzing-Linkages-(Did-I-Find-It%3F))

In [1]:
import os
import sys
import numpy as np
import pandas as pd

sys.path.append("..")

import difi

In [2]:
DATA_DIR = "data/thor/"

### Assumed Inputs

Lets take a look at a sample observations file used by the THOR software. 

In [3]:
observations = pd.read_csv(os.path.join(DATA_DIR, "projected_obs.txt"), sep=" ", index_col=False)

In [4]:
observations

Unnamed: 0,obsId,visitId,fieldId,fieldRA_deg,fieldDec_deg,exp_mjd,night,designation,code,mjd_utc,...,HEclObsy_X_au,HEclObsy_Y_au,HEclObsy_Z_au,EccAnom,TrueAnom,PosAngle_deg,theta_x_deg,theta_y_deg,theta_x_eq_deg,theta_y_eq_deg
0,2762372,446,446,176.282918,4.230249,59740.246528,59740,NS,,,...,-0.192653,-0.996779,0.000045,,,,0.489358,2.817035,1.717123,2.787966
1,144825,446,446,176.282918,4.230249,59740.246528,59740,b2059,I11,59740.246528,...,-0.192653,-0.996779,0.000045,125.676341,129.980599,108.604453,-0.641889,2.190939,0.346042,2.563633
2,144824,446,446,176.282918,4.230249,59740.246528,59740,b1866,I11,59740.246528,...,-0.192653,-0.996779,0.000045,140.549907,147.632484,122.050248,1.372264,0.791495,1.686559,0.279560
3,144823,446,446,176.282918,4.230249,59740.246528,59740,b1839,I11,59740.246528,...,-0.192653,-0.996779,0.000045,258.896559,250.194633,119.132929,-0.009073,1.042414,0.456107,1.103462
4,144822,446,446,176.282918,4.230249,59740.246528,59740,b1539,I11,59740.246528,...,-0.192653,-0.996779,0.000045,46.326690,52.505146,113.130574,1.716755,0.621586,1.946502,-0.042785
5,144821,446,446,176.282918,4.230249,59740.246528,59740,b1474,I11,59740.246528,...,-0.192653,-0.996779,0.000045,119.709166,127.918445,130.267875,1.104170,2.431128,2.151497,2.137097
6,144820,446,446,176.282918,4.230249,59740.246528,59740,b1466,I11,59740.246528,...,-0.192653,-0.996779,0.000045,212.840821,207.915614,126.843938,0.102350,0.515342,0.329567,0.502012
7,144826,446,446,176.282918,4.230249,59740.246528,59740,b2080,I11,59740.246528,...,-0.192653,-0.996779,0.000045,257.213574,245.918476,138.790294,-1.270692,1.931149,-0.372587,2.534027
8,144819,446,446,176.282918,4.230249,59740.246528,59740,b1381,I11,59740.246528,...,-0.192653,-0.996779,0.000045,198.005796,195.210983,121.170538,1.223943,1.661516,1.930031,1.268023
9,144817,446,446,176.282918,4.230249,59740.246528,59740,b0578,I11,59740.246528,...,-0.192653,-0.996779,0.000045,113.426548,116.543209,119.622949,0.236565,1.059000,0.701295,1.022722


In [5]:
observations.columns

Index(['obsId', 'visitId', 'fieldId', 'fieldRA_deg', 'fieldDec_deg', 'exp_mjd',
       'night', 'designation', 'code', 'mjd_utc', 'Delta_au', 'RA_deg',
       'Dec_deg', 'dDelta/dt_au_p_day', 'dRA/dt_deg_p_day',
       'dDec/dt_deg_p_day', 'VMag', 'Alt_deg', 'PhaseAngle_deg',
       'LunarElon_deg', 'LunarAlt_deg', 'LunarPhase', 'SolarElon_deg',
       'SolarAlt_deg', 'r_au', 'HLon_deg', 'HLat_deg', 'TLon_deg', 'TLat_deg',
       'TOCLon_deg', 'TOCLat_deg', 'HOCLon_deg', 'HOCLat_deg', 'TOppLon_deg',
       'TOppLat_deg', 'HEclObj_X_au', 'HEclObj_Y_au', 'HEclObj_Z_au',
       'HEclObj_dX/dt_au_p_day', 'HEclObj_dY/dt_au_p_day',
       'HEclObj_dZ/dt_au_p_day', 'HEclObsy_X_au', 'HEclObsy_Y_au',
       'HEclObsy_Z_au', 'EccAnom', 'TrueAnom', 'PosAngle_deg', 'theta_x_deg',
       'theta_y_deg', 'theta_x_eq_deg', 'theta_y_eq_deg'],
      dtype='object')

There are potentially many different columns that carry useful information depending on the use case. To get `difi` to work we only need to let it know about two columns in the observations dataframe: i) the observation ID column and ii) the truth column. Note that it is totally viable to not know what object every single observation belongs to. You can set flags appropriately to account for these objects. 

Lets also take a look at an example output from the THOR linking algorithm.

In [6]:
linkageMembers = pd.read_csv(os.path.join(DATA_DIR, "clusterMembers.txt"), sep=" ", index_col=False)

In [7]:
linkageMembers

Unnamed: 0,cluster_id,obsId
0,1,146778
1,1,1457822
2,1,1457597
3,1,8762195
4,1,2131794
5,2,803967
6,2,5192185
7,2,1465807
8,2,8789778
9,2,9994899


The linkage dataframe has just two columns, both of which `difi` needs. The first column has the linkage ID, then for each linkage each unique observation in that linkage is listed in the second column. 

Lets take a look at two specific examples:

In [8]:
linkageMembers[linkageMembers["cluster_id"] == 4]

Unnamed: 0,cluster_id,obsId
15,4,5124233
16,4,1108512
17,4,1439908
18,4,7518417
19,4,2114059


In [9]:
linkageMembers[linkageMembers["cluster_id"] == 14]

Unnamed: 0,cluster_id,obsId
66,14,145253
67,14,794486
68,14,6363315
69,14,7559849
70,14,9966076


Between the observations and linkageMembers dataframe, `difi` needs to know about just three columns:
- linkage_id: the ID assigned to each linkage
- obs_id : the observation ID from which linkages are made
- truth : the truth for every observation

So lets define a dictionary that tells `difi` what columns to use for this information.

In [10]:
columnMapping = {
    # difi column name : data column name
    "linkage_id" : "cluster_id",
    "obs_id" : "obsId",
    "truth" : "designation"
}

### Analyzing Observations (Can I Find It?) 

Determing how a linking algorithm performs involves knowing what it should be able to link. `difi` comes with a function that analyzes findablility with a simple assumption: any truths with at least x many observations should be findable. We leave it to the user to determine more complicated findability metrics for their specific science cases. 

Lets see what should be findable with THOR with this simple metric.

In [11]:
allTruths, summary = difi.analyzeObservations(observations,
                                              minObs=5,
                                              unknownIDs=[],
                                              falsePositiveIDs=["NS"],
                                              verbose=True,
                                              columnMapping=columnMapping)

Analyzing observations...
Known truth observations: 128734
Unknown truth observations: 0
False positive observations: 103070
Percent known truth observations (%): 55.536
Percent unknown truth observations (%): 0.000
Percent false positive observations (%): 44.464
Unique truths: 24450
Unique known truths : 24449
Unique known truths with at least 5 detections: 18273

Total time in seconds: 0.22783470153808594



The `analyzeObservations` function returns two dataframes. Let's take a look both of them in a little detail. 

The allTruths dataframe lists each unique truth as a row with columns that account for the number of observations that each unique truth has and also if it is findable (if it has more than `minObs` observations). 

In [12]:
allTruths

Unnamed: 0,designation,num_obs,findable
0,NS,103070,0
1,99955,8,1
2,00828,8,1
3,K10O64R,8,1
4,R8294,8,1
5,I4512,8,1
6,K11B44S,8,1
7,W8603,8,1
8,Q5248,8,1
9,D7237,8,1


One can trivially select the objects that should or should not be findable thanks to `pandas`.

In [13]:
findable_known_truths = allTruths[allTruths["findable"] == 1]["designation"].unique()
not_findable_known_truths = allTruths[(allTruths["findable"] == 0) & (~allTruths["designation"].isin(["NS"]))]["designation"].unique()

As stated earlier, the `analyzeObservations` function has a very simple findability criteria. The user can write their own metric and then generate a dataframe in the same style as the one above. Doing so will allow `difi` to calculate summary statistics, however, this is not necessary to proceed with determining if truths were linked.

Before we go to seeing how THOR performed, lets take a look at the summary dataframe. 

In [14]:
summary

Unnamed: 0,num_unique_truths,num_unique_known_truths,num_unique_known_truths_findable,num_known_truth_observations,num_unknown_truth_observations,num_false_positive_observations,percent_known_truth_observations,percent_unknown_truth_observations,percent_false_positive_observations
0,24450,24449,18273,128734,0,103070,55.535711,0.0,44.464289


In [15]:
summary.columns

Index(['num_unique_truths', 'num_unique_known_truths',
       'num_unique_known_truths_findable', 'num_known_truth_observations',
       'num_unknown_truth_observations', 'num_false_positive_observations',
       'percent_known_truth_observations',
       'percent_unknown_truth_observations',
       'percent_false_positive_observations'],
      dtype='object')

The summary dataframe is a very simple single row dataframe with some summary numbers about the given observations. 

A few things to note: any observations with truths that are either unknownIDs or falsePositiveIDs are ignored when counting the number of truths that should be findable. For example, in the function call a fews cell earlier the `falsePositiveIDs` kwarg was set to `["NS"]`.

In [16]:
allTruths[allTruths["findable"] == 0].head()

Unnamed: 0,designation,num_obs,findable
0,NS,103070,0
18274,Q3224,4,0
18275,K15RJ2O,4,0
18276,U0828,4,0
18277,A3895,4,0


### Analyzing Linkages (Did I Find It?)

In [17]:
allLinkages, allTruths, summary = difi.analyzeLinkages(observations, 
                                                       linkageMembers, 
                                                       allLinkages=None, 
                                                       allTruths=allTruths,
                                                       summary=summary,
                                                       minObs=5, 
                                                       contaminationThreshold=0.2, 
                                                       unknownIDs=[],
                                                       falsePositiveIDs=["NS"],
                                                       verbose=True,
                                                       columnMapping=columnMapping)

Analyzing linkages...
Known truth pure linkages: 29441
Known truth partial linkages: 9418
Unknown truth pure linkages: 0
Unknown truth partial linkages: 0
False positive pure linkages: 341
False positive partial linkages: 60
Mixed linkages: 74011
Total linkages: 113271
Mixed linkage percentage (%): 65.340
Unique known truths linked: 11780
Unique known truths missed: 6493
Completeness (%): 64.467

Total time in seconds: 0.9671170711517334


The `analyzeLinkages` function returns three dataframes:
- allLinkages: each linkage is summarized as its own row. 
- allTruths: each truth is summarized as its own row. 
- summary: summary statistics in a single row. 

Lets now take a look at each individually.

In [18]:
allLinkages[allLinkages["mixed"] == 1]

Unnamed: 0,cluster_id,num_members,num_obs,pure,partial,mixed,contamination,linked_truth
0,1,5,5,0,0,1,,
1,2,3,5,0,0,1,,
2,3,5,5,0,0,1,,
3,4,4,5,0,0,1,,
4,5,5,5,0,0,1,,
5,6,5,5,0,0,1,,
6,7,4,5,0,0,1,,
7,8,3,5,0,0,1,,
8,9,5,5,0,0,1,,
9,10,3,5,0,0,1,,


For each cluster defined in the `linkageMembers` format, the number of unique 'truths' is counted ('num_members'), the number of unique observations in each linkage ('num_obs'), whether the linkage is 'pure', 'partial' or 'mixed', the contamination percentage (if the linkage is considered 'partial') and if the linkage is either 'pure' or 'partial' then the linked truth ('linked_truth').  

Here we briefly summarize the different linkage types possible:
- 'pure': all observations in a linkage belong to a unique truth
- 'partial': up to a certain percentage of non-unique thruths are allowed so along as one truth has at least the minimum require number of unique observations
- 'mixed': a linkage containing different observations belonging to different truths, we avoid using the word 'false' for these clusters as they may contain unknown truths depending on the use case. We leave interpretation up to the user. 

In [19]:
allTruths

Unnamed: 0,designation,num_obs,findable,found_pure,found_partial,found
0,NS,103070,0,1,1,1
1,99955,8,1,1,0,1
2,00828,8,1,1,1,1
3,K10O64R,8,1,1,0,1
4,R8294,8,1,0,0,0
5,I4512,8,1,0,1,1
6,K11B44S,8,1,1,1,1
7,W8603,8,1,1,1,1
8,Q5248,8,1,0,0,0
9,D7237,8,1,1,0,1


The allTruths dataframe shows for each truth if it has been found in either a pure or partial linkage. If found in either it sets the found column to 1. 

Lastly, the summary dataframe contains overall statistics on the number of truths found, the completeness (if calculable) and so on...

In [20]:
summary

Unnamed: 0,num_unique_truths,num_unique_known_truths,num_unique_known_truths_findable,num_known_truth_observations,num_unknown_truth_observations,num_false_positive_observations,percent_known_truth_observations,percent_unknown_truth_observations,percent_false_positive_observations,num_unique_known_truths_found,num_unique_known_truths_missed,percent_completeness,num_known_truths_pure_linkages,num_known_truths_partial_linkages,num_unknown_truths_pure_linkages,num_unknown_truths_partial_linkages,num_false_positive_pure_linkages,num_false_positive_partial_linkages,num_mixed_linkages,num_total_linkages
0,24450,24449,18273,128734,0,103070,55.535711,0.0,44.464289,11780,6493,64.4667,29441,9418,0,0,341,60,74011,113271


Notice that we passed the allTruths dataframe as a kwarg to the `analyzeLinkages` function, this allows the function to access the 'findable' column and calculate completeness. You do not need to pass the allTruths dataframe nor the summary dataframe to use `analyzeLinkages`. Below is an example. 

In [21]:
observations = pd.read_csv(os.path.join(DATA_DIR, "projected_obs.txt"), sep=" ", index_col=False)
linkageMembers = pd.read_csv(os.path.join(DATA_DIR, "clusterMembers.txt"), sep=" ", index_col=False)

allLinkages, allTruths, summary = difi.analyzeLinkages(observations, 
                                                       linkageMembers, 
                                                       minObs=5, 
                                                       contaminationThreshold=0.2, 
                                                       unknownIDs=[],
                                                       falsePositiveIDs=["-1"],
                                                       verbose=True,
                                                       columnMapping=columnMapping)

Analyzing linkages...
Known truth pure linkages: 29782
Known truth partial linkages: 9478
Unknown truth pure linkages: 0
Unknown truth partial linkages: 0
False positive pure linkages: 0
False positive partial linkages: 0
Mixed linkages: 74011
Total linkages: 113271
Mixed linkage percentage (%): 65.340
Unique known truths linked: 11781
Unique known truths missed: nan
Completeness (%): nan

Total time in seconds: 1.1861398220062256


