## Example - difi on THOR

[Assumed Inputs](#Assumed-Inputs)  
[Analyzing Observations](#Analyzing-Observations-(Can-I-Find-It%3F))  
[Analyzing Linkages](#Analyzing-Linkages-(Did-I-Find-It%3F))

In [1]:
import os
import sys
import numpy as np
import pandas as pd

sys.path.append("..")

import difi

In [2]:
DATA_DIR = "data/thor/"

### Assumed Inputs

Lets take a look at a sample observations file used by the THOR software. 

In [3]:
observations = pd.read_csv(os.path.join(DATA_DIR, "projected_obs.txt"), sep=" ", index_col=False)

In [4]:
observations.head()

Unnamed: 0,ra,decl,field,nid,jd,exp_mjd,magpsf,sigmapsf,fid,ssnamenr,...,HEclObsy_Z_au,obs_id,designation,splitname,r_au,r_au_y,theta_x_deg,theta_y_deg,theta_x_eq_deg,theta_y_eq_deg
0,288.778972,-10.127515,385,612,2458367.0,58366.128588,19.0281,0.179443,2,72614,...,4.6e-05,96474,72614,72614,3.00519,3.00519,-1.34852,-0.421103,-1.714993,-0.997474
1,288.697314,-7.795108,385,612,2458367.0,58366.128588,19.6607,0.206323,2,251274,...,4.6e-05,96460,P1274,251274,2.499211,2.499211,-1.083862,1.172751,-1.796652,1.334933
2,291.193071,-6.171957,385,612,2458367.0,58366.128588,18.5207,0.13032,2,6446,...,4.6e-05,96459,06446,6446,2.233605,2.233605,0.918697,1.866463,0.699105,2.958085
3,287.070884,-10.26317,385,612,2458367.0,58366.128588,19.0319,0.142798,2,178915,...,4.6e-05,96458,H8915,178915,2.090124,2.090124,-2.590771,-0.270692,-3.423081,-1.133128
4,288.418863,-6.940926,385,612,2458367.0,58366.128588,18.4752,0.095678,2,51176,...,4.6e-05,96457,51176,51176,2.526557,2.526557,-1.160906,1.796177,-2.075103,2.189116


In [5]:
observations.columns

Index(['ra', 'decl', 'field', 'nid', 'jd', 'exp_mjd', 'magpsf', 'sigmapsf',
       'fid', 'ssnamenr', 'fieldRA_deg', 'fieldDec_deg', 'visit_id', 'mjd',
       'HEclObsy_X_au', 'HEclObsy_Y_au', 'HEclObsy_Z_au', 'obs_id',
       'designation', 'splitname', 'r_au', 'r_au_y', 'theta_x_deg',
       'theta_y_deg', 'theta_x_eq_deg', 'theta_y_eq_deg'],
      dtype='object')

There are potentially many different columns that carry useful information depending on the use case. To get `difi` to work we only need to let it know about two columns in the observations dataframe: i) the observation ID column and ii) the truth column. Note that it is totally viable to not know what object every single observation belongs to. You can set flags appropriately to account for these objects. 

Lets also take a look at an example output from the THOR linking algorithm.

In [6]:
linkageMembers = pd.read_csv(os.path.join(DATA_DIR, "clusterMembers.txt"), sep=" ", index_col=False)

In [7]:
linkageMembers.head(15)

Unnamed: 0,cluster_id,obs_id
0,1,96429
1,1,96524
2,1,96622
3,1,96644
4,1,96826
5,2,228575
6,2,228655
7,2,228668
8,2,228747
9,2,228839


The linkage dataframe has just two columns, both of which `difi` needs. The first column has the linkage ID, then for each linkage each unique observation in that linkage is listed in the second column. 

Lets take a look at two specific examples:

In [8]:
linkageMembers[linkageMembers["cluster_id"] == 4]

Unnamed: 0,cluster_id,obs_id
15,4,98994
16,4,99324
17,4,99518
18,4,99734
19,4,100071


In [9]:
linkageMembers[linkageMembers["cluster_id"] == 14]

Unnamed: 0,cluster_id,obs_id
66,14,93475
67,14,93578
68,14,93645
69,14,93704
70,14,93820
71,14,93873
72,14,93969


Between the observations and linkageMembers dataframe, `difi` needs to know about just three columns:
- linkage_id: the ID assigned to each linkage
- obs_id : the observation ID from which linkages are made
- truth : the truth for every observation

So lets define a dictionary that tells `difi` what columns to use for this information.

In [10]:
columnMapping = {
    # difi column name : data column name
    "linkage_id" : "cluster_id",
    "obs_id" : "obs_id",
    "truth" : "designation"
}

### Analyzing Observations (Can I Find It?) 

Determing how a linking algorithm performs involves knowing what it should be able to link. `difi` comes with a function that analyzes findablility with a simple assumption: any truths with at least x many observations should be findable. We leave it to the user to determine more complicated findability metrics for their specific science cases. 

Lets see what should be findable with THOR with this simple metric.

In [11]:
allTruths, summary = difi.analyzeObservations(observations,
                                              minObs=5,
                                              unknownIDs=[],
                                              falsePositiveIDs=["-1"],
                                              verbose=True,
                                              columnMapping=columnMapping)

Analyzing observations...
Known truth observations: 6114
Unknown truth observations: 0
False positive observations: 13
Percent known truth observations (%): 99.788
Percent unknown truth observations (%): 0.000
Percent false positive observations (%): 0.212
Unique truths: 1994
Unique known truths : 1993
Unique known truths with at least 5 detections: 412

Total time in seconds: 0.027384042739868164



The `analyzeObservations` function returns two dataframes. Let's take a look both of them in a little detail. 

The allTruths dataframe lists each unique truth as a row with columns that account for the number of observations that each unique truth has and also if it is findable (if it has more than `minObs` observations). 

In [12]:
allTruths.head()

Unnamed: 0,designation,num_obs,findable
0,22014,26,1
1,23845,18,1
2,67372,17,1
3,01724,16,1
4,d6586,16,1


One can trivially select the objects that should or should not be findable thanks to `pandas`.

In [13]:
findable_objects = allTruths[allTruths["findable"] == 1]["designation"].unique()
not_findable_objects = allTruths[allTruths["findable"] == 0]["designation"].unique()

As stated earlier, the `analyzeObservations` function has a very simple findability criteria. The user can write their own metric and then generate a dataframe in the same style as the one above. Doing so will allow `difi` to calculate summary statistics, however, this is not necessary to proceed with determining if truths were linked.

Before we go to seeing how THOR performed, lets take a look at the summary dataframe. 

In [14]:
summary

Unnamed: 0,num_unique_truths,num_unique_known_truths,num_unique_known_truths_findable,num_known_truth_observations,num_unknown_truth_observations,num_false_positive_observations,percent_known_truth_observations,percent_unknown_truth_observations,percent_false_positive_observations
0,1994,1993,412,6114,0,13,99.787824,0.0,0.212176


In [15]:
summary.columns

Index(['num_unique_truths', 'num_unique_known_truths',
       'num_unique_known_truths_findable', 'num_known_truth_observations',
       'num_unknown_truth_observations', 'num_false_positive_observations',
       'percent_known_truth_observations',
       'percent_unknown_truth_observations',
       'percent_false_positive_observations'],
      dtype='object')

The summary dataframe is a very simple single row dataframe with some summary numbers about the given observations. 

A few things to note: any observations with truths that are either unknownIDs or falsePositiveIDs are ignored when counting the number of truths that should be findable. For example, in the function call a fews cell earlier the `falsePositiveIDs` kwarg was set to `["-1"]`.

In [16]:
allTruths[allTruths["findable"] == 0].head()

Unnamed: 0,designation,num_obs,findable
12,-1,13,0
413,46529,4,0
414,L9405,4,0
415,47014,4,0
416,A8885,4,0


### Analyzing Linkages (Did I Find It?)

In [17]:
allLinkages, allTruths, summary = difi.analyzeLinkages(observations, 
                                                       linkageMembers, 
                                                       allLinkages=None, 
                                                       allTruths=allTruths,
                                                       summary=summary,
                                                       minObs=5, 
                                                       contaminationThreshold=0.2, 
                                                       unknownIDs=[],
                                                       falsePositiveIDs=["-1"],
                                                       verbose=True,
                                                       columnMapping=columnMapping)

Analyzing linkages...
Known truth pure linkages: 59
Known truth partial linkages: 0
Unknown truth pure linkages: 0
Unknown truth partial linkages: 0
False positive pure linkages: 0
False positive partial linkages: 0
Mixed linkages: 9
Total linkages: 68
Mixed linkage percentage (%): 13.235
Unique known truths linked: 46
Unique known truths missed: 366
Completeness (%): 11.165

Total time in seconds: 0.10628318786621094


The `analyzeLinkages` function returns three dataframes:
- allLinkages: each linkage is summarized as its own row. 
- allTruths: each truth is summarized as its own row. 
- summary: summary statistics in a single row. 

Lets now take a look at each individually.

In [18]:
allLinkages.head(10)

Unnamed: 0,cluster_id,num_members,num_obs,pure,partial,mixed,contamination,linked_truth
0,1,1,5,1,0,0,,10051
1,2,1,5,1,0,0,,84162
2,3,1,5,1,0,0,,84162
3,4,1,5,1,0,0,,14299
4,5,1,5,1,0,0,,20751
5,6,1,5,1,0,0,,60892
6,7,1,5,1,0,0,,84070
7,8,1,6,1,0,0,,48687
8,9,1,5,1,0,0,,43284
9,10,1,5,1,0,0,,8111


For each cluster defined in the `linkageMembers` format, the number of unique 'truths' is counted ('num_members'), the number of unique observations in each linkage ('num_obs'), whether the linkage is 'pure', 'partial' or 'mixed', the contamination percentage (if the linkage is considered 'partial') and if the linkage is either 'pure' or 'partial' then the linked truth ('linked_truth').  

Here we briefly summarize the different linkage types possible:
- 'pure': all observations in a linkage belong to a unique truth
- 'partial': up to a certain percentage of non-unique thruths are allowed so along as one truth has at least the minimum require number of unique observations
- 'mixed': a linkage containing different observations belonging to different truths, we avoid using the word 'false' for these clusters as they may contain unknown truths depending on the use case. We leave interpretation up to the user. 

In [19]:
allTruths.head(10)

Unnamed: 0,designation,num_obs,findable,found_pure,found_partial,found
0,22014,26,1,0,0,0
1,23845,18,1,0,0,0
2,67372,17,1,0,0,0
3,01724,16,1,0,0,0
4,d6586,16,1,0,0,0
5,14562,15,1,0,0,0
6,17633,15,1,0,0,0
7,04068,14,1,0,0,0
8,38026,14,1,0,0,0
9,90337,14,1,0,0,0


The allTruths dataframe shows for each truth if it has been found in either a pure or partial linkage. If found in either it sets the found column to 1. 

Lastly, the summary dataframe contains overall statistics on the number of truths found, the completeness (if calculable) and so on...

In [20]:
summary

Unnamed: 0,num_unique_truths,num_unique_known_truths,num_unique_known_truths_findable,num_known_truth_observations,num_unknown_truth_observations,num_false_positive_observations,percent_known_truth_observations,percent_unknown_truth_observations,percent_false_positive_observations,num_unique_known_truths_found,num_unique_known_truths_missed,percent_completeness,num_known_truths_pure_linkages,num_known_truths_partial_linkages,num_unknown_truths_pure_linkages,num_unknown_truths_partial_linkages,num_false_positive_pure_linkages,num_false_positive_partial_linkages,num_mixed_linkages,num_total_linkages
0,1994,1993,412,6114,0,13,99.787824,0.0,0.212176,46,366,11.165049,59,0,0,0,0,0,9,68


Notice that we passed the allTruths dataframe as a kwarg to the `analyzeLinkages` function, this allows the function to access the 'findable' column and calculate completeness. You do not need to pass the allTruths dataframe nor the summary dataframe to use `analyzeLinkages`. Below is an example. 

In [21]:
observations = pd.read_csv(os.path.join(DATA_DIR, "projected_obs.txt"), sep=" ", index_col=False)
linkageMembers = pd.read_csv(os.path.join(DATA_DIR, "clusterMembers.txt"), sep=" ", index_col=False)

allLinkages, allTruths, summary = difi.analyzeLinkages(observations, 
                                                       linkageMembers, 
                                                       minObs=5, 
                                                       contaminationThreshold=0.2, 
                                                       unknownIDs=[],
                                                       falsePositiveIDs=["-1"],
                                                       verbose=True,
                                                       columnMapping=columnMapping)

Analyzing linkages...
Known truth pure linkages: 59
Known truth partial linkages: 0
Unknown truth pure linkages: 0
Unknown truth partial linkages: 0
False positive pure linkages: 0
False positive partial linkages: 0
Mixed linkages: 9
Total linkages: 68
Mixed linkage percentage (%): 13.235
Unique known truths linked: 46
Unique known truths missed: nan
Completeness (%): nan

Total time in seconds: 0.08266019821166992


