## Tutorial - A Case of Solar System Small-Body Linking

[Assumed Inputs](#Assumed-Inputs)  
[i) observations](#i.-observations)  
[ii) truth classes [optional]](#ii.-truth-classes-[optional])  
[iii) linkage_members](#iii.-linkage_members)  
[Analyzing Observations](#Analyzing-Observations-(Can-I-Find-It%3F))  
[Analyzing Linkages](#Analyzing-Linkages-(Did-I-Find-It%3F))

In this tutorial, we are going to take a look at an example use case for `difi` -- Solar System small-body linking. One of the goals of Solar System science is discovering new asteroids, comets, and natural satellites -- celestial objects that are commonly refered to as small-bodies. There are a variety of different algorithms that complete this task but all of them produce essentially the same data products and need the same form of analysis. 

Each algorithm generates a proposed set of linkages of observations belonging to a unique set of objects. For example, when a linking algorithm is applied to a dataset containing several observations of the minor planet Ceres, this linking algorithm should be able to recover Ceres's observations as a proposed linkage. To test how well these kinds of linking algorithms work, and given knowledge of what observations belong to what object (ie, using a test dataset), we can use `difi` to analyze how well our linking algorithm performed. 

In [1]:
import pandas as pd

import difi

from difi import __version__
print("difi version: {}".format(__version__))

difi version: 1.2.dev38+g0411672.d20230417


### Assumed Inputs

#### i. observations

Let's take a look at a sample observations file from a linking algorithm called THOR. Do not worry too much about the details of all the columns nor the details of the linking algorithm, we will find that `difi` only needs a small selection of the columns in the file.

In [2]:
observations = pd.read_csv(
    "observations.txt", 
    sep=" ", 
    index_col=False, 
    dtype={
        "obs_id" : str,
        "designation" : str
    },
)

In [3]:
observations

Unnamed: 0,obs_id,exp_mjd,ra_deg,dec_deg,theta_x_deg,theta_y_deg,mag,mag_sigma,filter,night_id,designation
0,51197,58365.292836,350.311888,-14.885022,-2.653011,-4.924894,19.7123,0.183362,1,611,-1
1,53549,58365.292836,350.368406,-14.289740,-2.482205,-4.612564,20.3548,0.183661,1,611,-1
2,53559,58365.292836,344.513829,-15.935858,-5.953186,-4.210763,17.3449,0.055973,1,611,-1
3,53548,58365.292836,350.341918,-14.362091,-2.513279,-4.645991,15.0328,0.042089,1,611,-1
4,53547,58365.292836,343.688043,-18.177766,-6.905754,-5.277483,19.8840,0.194024,1,611,-1
...,...,...,...,...,...,...,...,...,...,...,...
50729,281797,58378.389271,348.498612,3.481443,3.237752,5.078468,19.2464,0.113087,2,624,W5896
50730,281796,58378.389271,348.569671,3.286327,3.230641,4.955594,20.0634,0.145169,2,624,-1
50731,281795,58378.389271,348.976259,3.540390,3.510203,4.999753,19.5664,0.107874,2,624,J6958
50732,281805,58378.389271,347.858284,7.527130,3.845336,7.450702,19.7387,0.149677,2,624,-1


Here are all observations belonging to a single known object:

In [4]:
observations[observations["designation"] == "W5896"]

Unnamed: 0,obs_id,exp_mjd,ra_deg,dec_deg,theta_x_deg,theta_y_deg,mag,mag_sigma,filter,night_id,designation
7894,276568,58366.337928,351.272363,3.682254,2.39301,4.910555,19.3779,0.082262,2,612,W5896
15179,277309,58369.324375,350.596603,3.661589,2.601101,4.957225,19.6316,0.118469,1,615,W5896
17019,278025,58369.361215,350.587706,3.661208,2.603601,4.957814,19.0834,0.097237,2,615,W5896
23781,278649,58372.287593,349.911988,3.620334,2.807353,5.000349,20.3351,0.202122,1,618,W5896
25174,279038,58372.367651,349.892463,3.618959,2.812828,5.001517,19.7352,0.157495,2,618,W5896
34277,279826,58375.31,349.208887,3.559579,3.019046,5.040883,19.64,0.148998,1,621,W5896
36421,280673,58375.336713,349.20238,3.558916,3.020881,5.041213,19.0318,0.087064,2,621,W5896
47879,281338,58378.319167,348.515399,3.48339,3.232827,5.077611,19.7751,0.092342,1,624,W5896
50729,281797,58378.389271,348.498612,3.481443,3.237752,5.078468,19.2464,0.113087,2,624,W5896


The observations file contains observation IDs, object IDs (designation), the location on the sky, the time of the observation, and how bright the object appeared in the night sky. 

`difi` really only cares about two of the columns in the observations file: the observation ID column and the object ID column, hereafter sometimes refered to as the truth or label column. 

Let's quickly take a look at the different truths in our observations.

In [5]:
observations["designation"].value_counts()

-1         12773
S9731         14
V2016         14
f1418         14
82134         14
           ...  
54314          1
H6854          1
V5597          1
K14SO1C        1
k9091          1
Name: designation, Length: 11227, dtype: int64

Notice how there is an object ID, "-1",  that appears 12773 times in the observations? This object ID is reserved by THOR for observations that have no known associated object. This means that these observations may contain undiscovered Solar System small-bodies and will certainly contain spurious detections such as noise and false positivies.
  
`difi` has no problem handling these kinds of observations. `difi` assumes that each individual noise or unknown observation has its own unique truth ID. So lets go ahead and make that the case:

In [6]:
observations.loc[observations["designation"] == "-1", "designation"] = [
    "unknown{:06d}".format(i) for i in range(len(observations[observations["designation"] == "-1"]))
]

In [7]:
observations["designation"].value_counts()

82134            14
J7220            14
79145            14
S9731            14
f1418            14
                 ..
unknown005653     1
unknown005654     1
unknown005655     1
unknown005656     1
k9091             1
Name: designation, Length: 23999, dtype: int64

#### ii. truth classes [optional]

We just saw that for observations with no known associated object (undiscovered objects or observations of false positives), `difi` needs each observation to be labeled with a unique object ID. How about the remaining observations of known truths? Could these objects belong to different classes of truths? In the case of the Solar System, this might be differentiating between Main Belt asteroids, near-Earth asteroids, trans-Neptunian objects, etc... Quite often it is useful to analyze observations and how a linking algorithm performs by looking at different populations or classes of objects. 

`difi` can handle two types of class inputs:
- i) a dictionary where each class name is a key and the unique truths that belong to that class are values
- ii) a column name in the observations dataframe that distinguishes the class of each observation

As part of this tutorial, there is another dataframe which can be read in that gives the class memberships of the different objects in our observations.


In [8]:
classes = pd.read_csv(
    "classes.txt",
    sep=" ", 
    index_col=False, 
    dtype={
        "designation" : str,
        "classes" : str,
    },
)

In [9]:
classes

Unnamed: 0,designation,class
0,18109,NEO
1,13553,NEO
2,E0039,NEO
3,G2168,NEO
4,D6849,NEO
...,...,...
11222,G3731,Trojans
11223,02060,Centaurs
11224,E5452,TNO
11225,C0178,TNO


Using this dataframe let us create the first kind of valid class definition: the dictionary. 

In [10]:
classes_dict = {}
for c in classes["class"].unique():
    classes_dict[c] = classes[classes["class"].isin([c])]["designation"].unique()

In [11]:
classes_dict

{'NEO': array(['18109', '13553', 'E0039', 'G2168', 'D6849', 'f8896', '17511',
        'K01SH0C', 'X7248', 'a9454', '06456', '05786', 'K15R83J', 'Q7729',
        'K18P22K', 'K01R17Q', 'O3566', 'P3841', 'e7338'], dtype=object),
 'MBA': array(['H9891', 'D8181', 'O5993', ..., 'K15B83J', 'k0678', 'k9091'],
       dtype=object),
 'MCA': array(['K03SG9A', 'K07VW0E', 'C8451', 'S8638', 'K15TO1R', 'd0872',
        'P6460', 'Y7509', '02577', 'I6475', 'P3006', 'K14OX8H', 'A0015',
        'j5211', 'K15TN8F', 'R5568', 'K11UD8O', '96006', 'K05GB2B',
        'J96T11O', 'K11U19S', 'Z4683', 'F1888', 'V0522', 'H6611', 'U6798',
        'n4661', 'e2103', 'T0124', 'K11C69Y', 'R3364', 'K09UC6Q',
        'K05T12V', '32827', 'U6919', 'j2651', 'K12V84Q', 'f3949', 'g1650',
        'm4284', 'S3948', 'Z0243', 'j3242', 'U6805', 'J6299', 'M6046',
        '18181', '48621', 'K14W24D', 'J7595', 'D9798', 'K05WI6O', 'f4008',
        'K16A91B', 'm6813', 'd3446', '09671', 'Y4093', 'T1252', 'e1968',
        'e1979', 'S8860'

Notice how the "Unknown" class has not been updated to reflect the changes we made to the observations earlier. Let's go ahead and do that:

In [12]:
classes_dict["Unknown"] = observations[observations["designation"].str.contains("^unknown", regex=True)]["designation"].unique()
print(classes_dict["Unknown"])

['unknown000000' 'unknown000001' 'unknown000002' ... 'unknown012770'
 'unknown012771' 'unknown012772']


In [13]:
classes_dict

{'NEO': array(['18109', '13553', 'E0039', 'G2168', 'D6849', 'f8896', '17511',
        'K01SH0C', 'X7248', 'a9454', '06456', '05786', 'K15R83J', 'Q7729',
        'K18P22K', 'K01R17Q', 'O3566', 'P3841', 'e7338'], dtype=object),
 'MBA': array(['H9891', 'D8181', 'O5993', ..., 'K15B83J', 'k0678', 'k9091'],
       dtype=object),
 'MCA': array(['K03SG9A', 'K07VW0E', 'C8451', 'S8638', 'K15TO1R', 'd0872',
        'P6460', 'Y7509', '02577', 'I6475', 'P3006', 'K14OX8H', 'A0015',
        'j5211', 'K15TN8F', 'R5568', 'K11UD8O', '96006', 'K05GB2B',
        'J96T11O', 'K11U19S', 'Z4683', 'F1888', 'V0522', 'H6611', 'U6798',
        'n4661', 'e2103', 'T0124', 'K11C69Y', 'R3364', 'K09UC6Q',
        'K05T12V', '32827', 'U6919', 'j2651', 'K12V84Q', 'f3949', 'g1650',
        'm4284', 'S3948', 'Z0243', 'j3242', 'U6805', 'J6299', 'M6046',
        '18181', '48621', 'K14W24D', 'J7595', 'D9798', 'K05WI6O', 'f4008',
        'K16A91B', 'm6813', 'd3446', '09671', 'Y4093', 'T1252', 'e1968',
        'e1979', 'S8860'

We now have a dictionary with class names as keys and the unique truths belonging to each class as values.

The second option is adding a column to the observations file with the associated class of each observation. We can do that as follows:

In [14]:
for c, v in classes_dict.items():
    observations.loc[observations["designation"].isin(v), "class"] = c

In [15]:
observations

Unnamed: 0,obs_id,exp_mjd,ra_deg,dec_deg,theta_x_deg,theta_y_deg,mag,mag_sigma,filter,night_id,designation,class
0,51197,58365.292836,350.311888,-14.885022,-2.653011,-4.924894,19.7123,0.183362,1,611,unknown000000,Unknown
1,53549,58365.292836,350.368406,-14.289740,-2.482205,-4.612564,20.3548,0.183661,1,611,unknown000001,Unknown
2,53559,58365.292836,344.513829,-15.935858,-5.953186,-4.210763,17.3449,0.055973,1,611,unknown000002,Unknown
3,53548,58365.292836,350.341918,-14.362091,-2.513279,-4.645991,15.0328,0.042089,1,611,unknown000003,Unknown
4,53547,58365.292836,343.688043,-18.177766,-6.905754,-5.277483,19.8840,0.194024,1,611,unknown000004,Unknown
...,...,...,...,...,...,...,...,...,...,...,...,...
50729,281797,58378.389271,348.498612,3.481443,3.237752,5.078468,19.2464,0.113087,2,624,W5896,MBA
50730,281796,58378.389271,348.569671,3.286327,3.230641,4.955594,20.0634,0.145169,2,624,unknown012771,Unknown
50731,281795,58378.389271,348.976259,3.540390,3.510203,4.999753,19.5664,0.107874,2,624,J6958,MBA
50732,281805,58378.389271,347.858284,7.527130,3.845336,7.450702,19.7387,0.149677,2,624,unknown012772,Unknown


Before we continue let's make sure our dataframes have the expected column names. Between the observations and linkage_members dataframe, `difi` needs to know about just three columns:
- linkage_id: the ID assigned to each linkage
- obs_id : the observation ID from which linkages are made
- truth : the truth for every observation

So lets define a dictionary that renames our columns to what `difi` needs.

In [16]:
column_mapping = {
    # data column name : difi column name
    "cluster_id" : "linkage_id",
    "obs_id" : "obs_id",
    "designation" : "truth"
}

observations.rename(columns=column_mapping, inplace=True)

#### iii. linkage_members

Up to now, we have only considered observations and the optional input of truth classes. The second input `difi` needs a data product that describes the proposed linkages. We term this dataframe as the linkage_members dataframe. The linkage_members dataframe has just two columns, both of which `difi` needs. The first column has the linkage ID, then for each linkage each unique observation in that linkage is listed in the second column. 

In [17]:
linkage_members = pd.read_csv(
    "linkage_members.txt",
    sep=" ", 
    index_col=False, 
    dtype={
        "cluster_id" : str,
        "obs_id" : str,
    },
)
linkage_members.rename(columns=column_mapping, inplace=True)

In [18]:
linkage_members

Unnamed: 0,linkage_id,obs_id
0,1,277110
1,1,275612
2,1,202500
3,1,276243
4,1,275927
...,...,...
393363,73394,283234
393364,73394,286532
393365,73394,280015
393366,73394,280537


In [19]:
linkage_members[linkage_members["linkage_id"] == "4"]

Unnamed: 0,linkage_id,obs_id
15,4,355788
16,4,357968
17,4,358149
18,4,356779
19,4,357048


In [20]:
linkage_members[linkage_members["linkage_id"] == "14"]

Unnamed: 0,linkage_id,obs_id
65,14,276814
66,14,282563
67,14,283293
68,14,277292
69,14,279113


### Analyzing Observations (Can I Find It?) 

Determing how a linking algorithm performs involves knowing what it should be able to link. 

`difi` comes with a function that analyzes findablility with one of two simple assumptions (we term these as findability metrics):
- 'min_obs' : Any truth with this many or more observations should be findable.
- 'nightly_linkages' : Any truth with enough observations to make an intra-night linkage of a user-defined length, and any object with enough nights during which such linkages can be made are considered findable. This metric is more catered towards the "tracklet" building methodology. 

The specific details and differences between the two metrics don't matter too much for this example but in short: the min_obs metric just requires a certain number of observations in a linking window whereas the nightly_linkages metric requires a specific cadence of observations for objects to be findable. 

If these metrics don't satisfy the desired use case, don't worry. The `analyzeObservations` function can handle a callable as its metric keyword argument. This callable should return a dataframe a `pandas.DataFrame` with one column of the truth IDs that are findable, and a column named 'obs_ids' containing `~numpy.ndarray`s of the observations that made each truth findable.

Lets see what should be findable with our generic linking algorithm with the simplest metric: 'min_obs'

In [21]:
# Any truth with 6 or more observations is considered findable
all_truths, findable_observations, summary = difi.analyzeObservations(
    observations,
    classes=None,
    metric="min_obs",
    min_obs=6,
    detection_window=None,
)

The `analyzeObservations` function returns three dataframes. Let's take a look at all three in a little detail. 

The all_truths dataframe lists each unique truth as a row with columns that account for the number of observations that each unique truth has and also if it is findable (if it has more than `min_obs` observations). 

In [22]:
all_truths

Unnamed: 0,truth,num_obs,findable
0,79145,14,1
1,82134,14,1
2,J7220,14,1
3,S9731,14,1
4,V2016,14,1
...,...,...,...
23994,unknown012768,1,0
23995,unknown012769,1,0
23996,unknown012770,1,0
23997,unknown012771,1,0


One can trivially select the objects that should or should not be findable thanks to `pandas`.

In [23]:
findable = all_truths[all_truths["findable"] == 1]["truth"].unique()
not_findable = all_truths[all_truths["findable"] == 0]["truth"].unique()

In [24]:
findable

array(['79145', '82134', 'J7220', ..., 'n3871', 'n3994', 'n6485'],
      dtype=object)

In [25]:
not_findable

array(['00436', '01231', '01964', ..., 'unknown012770', 'unknown012771',
       'unknown012772'], dtype=object)

The next dataframe returned is findable_observations: this data product has an column of all the truths that were deemed findable by the findability metric and a column "obs_ids" that contains arrays of observation IDs that made each truth findable.

In [26]:
findable_observations

Unnamed: 0,truth,obs_ids
0,00237,"[15409, 15557, 15889, 56886, 57428, 57449, 59479]"
1,00559,"[55428, 55671, 56379, 57416, 57722, 59510]"
2,00733,"[276972, 199900, 277515, 200860, 278789, 20158..."
3,00894,"[283362, 285148, 285603, 279936, 280453, 281270]"
4,01010,"[127295, 127683, 128790, 129495, 130158, 130849]"
...,...,...
2127,n4964,"[55668, 56269, 56935, 57267, 57508, 57996, 59677]"
2128,n5598,"[282645, 284091, 284522, 278967, 285478, 28026..."
2129,n5849,"[271492, 272311, 273190, 274319, 270748, 27527..."
2130,n6485,"[283969, 284261, 285598, 286160, 286911, 287995]"


The last dataframe that was returned is the summary dataframe, which gives some per-class summary statistics.

In [27]:
summary

Unnamed: 0,class,num_members,num_obs,findable
0,All,23999,50734,2132


Let us now make `difi` aware of the classes we defined earlier so we can take a look at population statistics.

In [28]:
# Any truth with 6 or more observations is considered findable
all_truths, findable_observations, summary = difi.analyzeObservations(
    observations,
    classes=classes_dict,
    metric="min_obs",
    min_obs=6,
    detection_window=None,
)

Now that we have handed our class definitions to the `analyzeObservations` function, our summary dataframe will have updated with a per-class summary of the truths that should be findable.

In [29]:
summary

Unnamed: 0,class,num_members,num_obs,findable
0,All,23999,50734,2132
1,MBA,11111,37529,2103
2,Unknown,12773,12773,0
3,MCA,89,337,22
4,NEO,19,68,5
5,Trojans,4,11,0
6,TNO,2,9,1
7,Centaurs,1,7,1


As stated earlier, the `analyzeObservations` function has a two built-in findability criteria. Let's take a look at the second one: 'nightly_linkages'. The `analyzeObservations` function calls a function that calculates which objects should be findable. The 'min_obs' metric from earlier is defined by `calcFindableMinObs`, while the 'nightly_linkages' metric is defined by `calcFindableNightlyLinkages`. Let's take a look at the latter to get an idea of the parameters we can configure.

In [30]:
difi.calcFindableNightlyLinkages?

[0;31mSignature:[0m
[0mdifi[0m[0;34m.[0m[0mcalcFindableNightlyLinkages[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mobservations[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlinkage_min_obs[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_obs_separation[0m[0;34m:[0m [0mfloat[0m [0;34m=[0m [0;36m0.0625[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_linkage_nights[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m3[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Finds the truths that have at least min_linkage_nights linkages of length
linkage_min_obs or more. Observations are considered to be in a possible intra-night
linkage if their observation time does not e

We have now run into a case where a metric requires additional columns in the observations file to be able to determine
what is and what is not findable. Particular, the 'nightly_linkages' metric needs the time of observation and also the night during which the observation occured. So lets make sure we have what we need in our observations and update our the column names accordingly. 

In [31]:
# Current column_mapping
column_mapping

{'cluster_id': 'linkage_id', 'obs_id': 'obs_id', 'designation': 'truth'}

In [32]:
# Current observations
observations

Unnamed: 0,obs_id,exp_mjd,ra_deg,dec_deg,theta_x_deg,theta_y_deg,mag,mag_sigma,filter,night_id,truth,class
0,51197,58365.292836,350.311888,-14.885022,-2.653011,-4.924894,19.7123,0.183362,1,611,unknown000000,Unknown
1,53549,58365.292836,350.368406,-14.289740,-2.482205,-4.612564,20.3548,0.183661,1,611,unknown000001,Unknown
2,53559,58365.292836,344.513829,-15.935858,-5.953186,-4.210763,17.3449,0.055973,1,611,unknown000002,Unknown
3,53548,58365.292836,350.341918,-14.362091,-2.513279,-4.645991,15.0328,0.042089,1,611,unknown000003,Unknown
4,53547,58365.292836,343.688043,-18.177766,-6.905754,-5.277483,19.8840,0.194024,1,611,unknown000004,Unknown
...,...,...,...,...,...,...,...,...,...,...,...,...
50729,281797,58378.389271,348.498612,3.481443,3.237752,5.078468,19.2464,0.113087,2,624,W5896,MBA
50730,281796,58378.389271,348.569671,3.286327,3.230641,4.955594,20.0634,0.145169,2,624,unknown012771,Unknown
50731,281795,58378.389271,348.976259,3.540390,3.510203,4.999753,19.5664,0.107874,2,624,J6958,MBA
50732,281805,58378.389271,347.858284,7.527130,3.845336,7.450702,19.7387,0.149677,2,624,unknown012772,Unknown


The observations file already has the observation time (exp_mjd), 
a special data format used by astronomers in units of decimal days. Let us update our column_mapping dictionary to point to that column.

In [33]:
column_mapping["exp_mjd"] = "time"

The last column we still need is the "night" column, this column should indicate the night during which observation occured so that it can be used to isolate nightly observations. Conveniently, the observations file already has that information in the "night_id" column. Lets add that column name to the column_mapping dictionary:

In [34]:
column_mapping["night_id"] = "night"

In [35]:
observations.rename(columns=column_mapping, inplace=True)

In [36]:
# Any objects with at least 3 "tracklets" should be findable
all_truths, findable_observations, summary = difi.analyzeObservations(
    observations,
    classes=classes_dict,
    metric="nightly_linkages",
    linkage_min_obs=2,          # a tracklet should be at least 2 observations
    max_obs_separation=1.5/24,  # these observations should be within 90 minutes
    min_linkage_nights=3,       # we need a tracklet on 3 unique nights
    detection_window=None,
)

In [37]:
summary

Unnamed: 0,class,num_members,num_obs,findable
0,All,23999,50734,493
1,MBA,11111,37529,484
2,Unknown,12773,12773,0
3,MCA,89,337,7
4,NEO,19,68,2
5,Trojans,4,11,0
6,TNO,2,9,0
7,Centaurs,1,7,0


Comparing this summary dataframe to the previous one shows fewer objects to be findable, this intuitively makes sense since the 'nightly_linkages' metric is much more restrictive. 

Before we proceed to the next section, let's rerun the min_obs metric so our data products reflect the assumed parameters of the linking algorithm that generated the data.

In [38]:
# Any truth with 6 or more observations is considered findable
all_truths, findable_observations, summary = difi.analyzeObservations(
    observations,
    classes=classes_dict,
    metric="min_obs",
    min_obs=6,
    detection_window=None,
)

In [39]:
summary

Unnamed: 0,class,num_members,num_obs,findable
0,All,23999,50734,2132
1,MBA,11111,37529,2103
2,Unknown,12773,12773,0
3,MCA,89,337,22
4,NEO,19,68,5
5,Trojans,4,11,0
6,TNO,2,9,1
7,Centaurs,1,7,1


### Analyzing Linkages (Did I Find It?)

We have described how to find the truths that should be findable by a linking algorithm, now lets analyze actual linking algorithm performance by analyzing our linkages. As a reminder our linkages are defined by the `linkage_members` dataframe:

In [40]:
linkage_members

Unnamed: 0,linkage_id,obs_id
0,1,277110
1,1,275612
2,1,202500
3,1,276243
4,1,275927
...,...,...
393363,73394,283234
393364,73394,286532
393365,73394,280015
393366,73394,280537


Our observations look as follows:

In [41]:
observations

Unnamed: 0,obs_id,time,ra_deg,dec_deg,theta_x_deg,theta_y_deg,mag,mag_sigma,filter,night,truth,class
0,51197,58365.292836,350.311888,-14.885022,-2.653011,-4.924894,19.7123,0.183362,1,611,unknown000000,Unknown
1,53549,58365.292836,350.368406,-14.289740,-2.482205,-4.612564,20.3548,0.183661,1,611,unknown000001,Unknown
2,53559,58365.292836,344.513829,-15.935858,-5.953186,-4.210763,17.3449,0.055973,1,611,unknown000002,Unknown
3,53548,58365.292836,350.341918,-14.362091,-2.513279,-4.645991,15.0328,0.042089,1,611,unknown000003,Unknown
4,53547,58365.292836,343.688043,-18.177766,-6.905754,-5.277483,19.8840,0.194024,1,611,unknown000004,Unknown
...,...,...,...,...,...,...,...,...,...,...,...,...
50729,281797,58378.389271,348.498612,3.481443,3.237752,5.078468,19.2464,0.113087,2,624,W5896,MBA
50730,281796,58378.389271,348.569671,3.286327,3.230641,4.955594,20.0634,0.145169,2,624,unknown012771,Unknown
50731,281795,58378.389271,348.976259,3.540390,3.510203,4.999753,19.5664,0.107874,2,624,J6958,MBA
50732,281805,58378.389271,347.858284,7.527130,3.845336,7.450702,19.7387,0.149677,2,624,unknown012772,Unknown


Using these two data products and a few keyword arguments we can analyze performance:

In [42]:
all_linkages, all_truths, summary = difi.analyzeLinkages(
    observations, 
    linkage_members, 
    classes=classes_dict,
    all_truths=all_truths,
    min_obs=6, 
    contamination_percentage=20, 
)

The `analyzeLinkages` function returns three dataframes:
- all_linkages: each linkage is summarized as its own row. 
- all_truths: each truth is summarized as its own row. 
- summary: per-class summary statistics

Lets now take a look at each individually.

In [43]:
all_linkages

Unnamed: 0,linkage_id,num_obs,num_members,pure,pure_complete,partial,mixed,contamination_percentage,found_pure,found_partial,found,linked_truth
0,1,5,3,0,0,0,1,,0,0,0,
1,10,5,2,0,0,1,0,20.0,0,0,0,J4070
2,100,5,2,0,0,1,0,20.0,0,0,0,26576
3,1000,5,2,0,0,0,1,,0,0,0,
4,10000,5,2,0,0,1,0,20.0,0,0,0,44064
...,...,...,...,...,...,...,...,...,...,...,...,...
73389,9995,5,2,0,0,1,0,20.0,0,0,0,13849
73390,9996,5,2,0,0,1,0,20.0,0,0,0,32745
73391,9997,5,2,0,0,1,0,20.0,0,0,0,j3234
73392,9998,5,2,0,0,1,0,20.0,0,0,0,K3583


In [44]:
all_linkages[all_linkages["pure"] == 1]

Unnamed: 0,linkage_id,num_obs,num_members,pure,pure_complete,partial,mixed,contamination_percentage,found_pure,found_partial,found,linked_truth
196,10174,5,1,1,0,0,0,0.0,0,0,0,60843
198,10176,6,1,1,0,0,0,0.0,1,0,1,50910
200,10178,5,1,1,0,0,0,0.0,0,0,0,E9391
202,1018,5,1,1,0,0,0,0.0,0,0,0,a2210
212,10189,6,1,1,0,0,0,0.0,1,0,1,F1781
...,...,...,...,...,...,...,...,...,...,...,...,...
73325,9937,5,1,1,0,0,0,0.0,0,0,0,J99T81T
73330,9941,6,1,1,1,0,0,0.0,1,0,1,13883
73333,9944,5,1,1,0,0,0,0.0,0,0,0,G6654
73337,9948,5,1,1,0,0,0,0.0,0,0,0,N9361


In [45]:
all_linkages[all_linkages["partial"] == 1]

Unnamed: 0,linkage_id,num_obs,num_members,pure,pure_complete,partial,mixed,contamination_percentage,found_pure,found_partial,found,linked_truth
1,10,5,2,0,0,1,0,20.0,0,0,0,J4070
2,100,5,2,0,0,1,0,20.0,0,0,0,26576
4,10000,5,2,0,0,1,0,20.0,0,0,0,44064
6,10002,5,2,0,0,1,0,20.0,0,0,0,D4692
9,10005,5,2,0,0,1,0,20.0,0,0,0,K3583
...,...,...,...,...,...,...,...,...,...,...,...,...
73389,9995,5,2,0,0,1,0,20.0,0,0,0,13849
73390,9996,5,2,0,0,1,0,20.0,0,0,0,32745
73391,9997,5,2,0,0,1,0,20.0,0,0,0,j3234
73392,9998,5,2,0,0,1,0,20.0,0,0,0,K3583


In [46]:
all_linkages[all_linkages["mixed"] == 1]

Unnamed: 0,linkage_id,num_obs,num_members,pure,pure_complete,partial,mixed,contamination_percentage,found_pure,found_partial,found,linked_truth
0,1,5,3,0,0,0,1,,0,0,0,
3,1000,5,2,0,0,0,1,,0,0,0,
5,10001,5,2,0,0,0,1,,0,0,0,
7,10003,6,2,0,0,0,1,,0,0,0,
8,10004,5,2,0,0,0,1,,0,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...
73380,9987,5,2,0,0,0,1,,0,0,0,
73383,999,5,3,0,0,0,1,,0,0,0,
73384,9990,5,3,0,0,0,1,,0,0,0,
73385,9991,5,3,0,0,0,1,,0,0,0,


For each linkage defined in the `linkage_members` format, the number of unique 'truths' is counted ('num_members'), the number of unique observations in each linkage ('num_obs'), whether the linkage is 'pure', 'partial' or 'mixed', the contamination percentage (if the linkage is considered 'partial') and if the linkage is either 'pure' or 'partial' then the linked truth ('linked_truth').  

Here we briefly summarize the different linkage types possible:
- 'pure: a linkage where all constituent observations belong to a single truth, this linkage class is further subdivided into 'pure_complete' linkages which are pure linkages that contain all of an objects observations contained in the given observations. 
- 'partial': a linkage that contains observations belonging to multiple truths but 
    equal to or more than min_obs observations of one truth and no more than the contamination threshold
    of observations of other truths. For example, a linkage with ten observations, eight of which belong to
    a single unique truth and two of which belong to other truths has contamination percentage 20%. If the threshold
    is set to 20% or greater, and min_obs is less than or equal to eight then the truth with the eight observations
    is considered found and the linkage is considered a partial linkage.
- 'mixed': all linkages that are neither pure nor partial.

In [47]:
all_truths

Unnamed: 0,truth,num_obs,findable,found_pure,found_partial,found,pure,pure_complete,partial,partial_contaminant,mixed,obs_in_pure,obs_in_pure_complete,obs_in_partial,obs_in_partial_contaminant,obs_in_mixed
0,79145,14,1,7,12,19,10,0,64,4,47,82,0,327,4,153
1,82134,14,1,9,1,10,9,1,396,24,286,76,14,1588,24,964
2,J7220,14,1,8,2,10,11,1,2,0,63,82,14,21,0,169
3,S9731,14,1,4,31,35,6,0,164,7,152,41,0,790,7,449
4,V2016,14,1,9,0,9,12,0,95,25,166,88,0,380,25,481
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23994,unknown012768,1,0,0,0,0,0,0,0,0,10,0,0,0,0,10
23995,unknown012769,1,0,0,0,0,0,0,0,7,5,0,0,0,7,5
23996,unknown012770,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0
23997,unknown012771,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0


The all_truths dataframe shows for each truth if it has been found in either a pure or partial linkage. If found in either it sets the found column to 1. 

Lastly, the summary dataframe contains overall statistics on the number of truths found, the completeness (if calculable) and so on...

In [48]:
summary

Unnamed: 0,class,num_members,num_obs,completeness,findable,found,findable_found,findable_missed,not_findable_found,not_findable_missed,...,unique_in_partial_linkages_only,unique_in_pure_and_partial_linkages,unique_in_partial_linkages,unique_in_partial_contaminant_linkages,unique_in_mixed_linkages,obs_in_pure_linkages,obs_in_pure_complete_linkages,obs_in_partial_linkages,obs_in_partial_contaminant_linkages,obs_in_mixed_linkages
0,All,23999,50734,72.983114,2132,1556,1556,576,0,21867,...,45,362,407,3916,6100,42171,11196,110403,27274,213520
1,MBA,11111,37529,73.656681,2103,1549,1549,554,0,9008,...,42,360,402,2290,3866,42012,11183,109775,21653,196945
2,Unknown,12773,12773,,0,0,0,0,0,12773,...,0,0,0,1598,2189,0,0,0,5381,14465
3,MCA,89,337,31.818182,22,7,7,15,0,67,...,2,2,4,22,36,159,13,600,187,1685
4,NEO,19,68,0.0,5,0,0,5,0,14,...,0,0,0,5,7,0,0,0,47,82
5,Trojans,4,11,,0,0,0,0,0,4,...,0,0,0,0,0,0,0,0,0,0
6,TNO,2,9,0.0,1,0,0,1,0,1,...,1,0,1,1,1,0,0,28,6,341
7,Centaurs,1,7,0.0,1,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,2


Notice that we passed the all_truths dataframe as a kwarg to the `analyzeLinkages` function, this allows the function to access the 'findable' column and calculate completeness. You do not need to pass the all_truths dataframe nor the summary dataframe to use `analyzeLinkages`. Below is an example. 

In [49]:
# Any truth with 6 or more observations is considered findable
all_truths, findable_observations, summary = difi.analyzeObservations(
    observations,
    classes=classes_dict,
    metric="min_obs",
    min_obs=6,
    detection_window=None,
)
all_truths.drop(columns=["findable"], inplace=True)

all_linkages, all_truths, summary = difi.analyzeLinkages(
    observations, 
    linkage_members, 
    classes=classes_dict,
    all_truths=all_truths,
    min_obs=6, 
    contamination_percentage=20, 
)

statistics can not be calculated.


In [50]:
summary

Unnamed: 0,class,num_members,num_obs,completeness,findable,found,findable_found,findable_missed,not_findable_found,not_findable_missed,...,unique_in_partial_linkages_only,unique_in_pure_and_partial_linkages,unique_in_partial_linkages,unique_in_partial_contaminant_linkages,unique_in_mixed_linkages,obs_in_pure_linkages,obs_in_pure_complete_linkages,obs_in_partial_linkages,obs_in_partial_contaminant_linkages,obs_in_mixed_linkages
0,All,23999,50734,,,1556,,,,,...,45,362,407,3916,6100,42171,11196,110403,27274,213520
1,MBA,11111,37529,,,1549,,,,,...,42,360,402,2290,3866,42012,11183,109775,21653,196945
2,Unknown,12773,12773,,,0,,,,,...,0,0,0,1598,2189,0,0,0,5381,14465
3,MCA,89,337,,,7,,,,,...,2,2,4,22,36,159,13,600,187,1685
4,NEO,19,68,,,0,,,,,...,0,0,0,5,7,0,0,0,47,82
5,Trojans,4,11,,,0,,,,,...,0,0,0,0,0,0,0,0,0,0
6,TNO,2,9,,,0,,,,,...,1,0,1,1,1,0,0,28,6,341
7,Centaurs,1,7,,,0,,,,,...,0,0,0,0,1,0,0,0,0,2


So what do all those columns track, for that the best way to find out is to checkout the docstring:

In [51]:
difi.analyzeLinkages?

[0;31mSignature:[0m
[0mdifi[0m[0;34m.[0m[0manalyzeLinkages[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mobservations[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlinkage_members[0m[0;34m:[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mall_truths[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_obs[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcontamination_percentage[0m[0;34m:[0m [0mfloat[0m [0;34m=[0m [0;36m20.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mclasses[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mdict[0m[0;34m][0m [0;34m=[0m [0;32