## Tutorial - A Case of Solar System Small-Body Linking

[Assumed Inputs](#Assumed-Inputs)  
[Analyzing Observations](#Analyzing-Observations-(Can-I-Find-It%3F))  
[Analyzing Linkages](#Analyzing-Linkages-(Did-I-Find-It%3F))

In this tutorial, we are going to take a look at an example use case for `difi` -- Solar System small-body linking. One of the goals of Solar System science is discovering new asteroids, comets, and natural satellites, these celestial objects are commonly refered to as small-bodies. There are a variety of different algorithms that complete this task but all of them produce essentially the same data product and need the same form of analysis. Each algorithm generates a proposed set of linkages of observations belonging to a unique set of objects. 
For example, when a generic linking algorithm applied to a data set containing several observations of the minor planet Ceres, this linking algorithm should be able to recover Ceres' observations as a proposed linkage. To test how well these kinds of linking algorithms work, and given knowledge of what observations belong to what object, we can use `difi` to analyze how well our linking algorithm performed. 

In [1]:
import os
import sys
import numpy as np
import pandas as pd

import difi

from difi import __version__
print("difi version: {}".format(__version__))

difi version: 1.1.dev38+gb001aca.d20200918


### Assumed Inputs

Lets take a look at a sample observations file from a linking algorithm called MOPS. Do not worry too much about the details of all the columns nor the details of the linking algorithm.

In [2]:
observations = pd.read_csv(os.path.join("observations.txt"), 
                           sep=" ", 
                           index_col=False, 
                           header=None, 
                           names=["det_id", 
                                  "field_id", 
                                  "object_name", 
                                  "ra_deg", 
                                  "dec_deg", 
                                  "epoch_mjd", 
                                  "mag", 
                                  "mag_sigma"],
                           dtype={"det_id" : str,
                                  "object_name" : str})

In [3]:
observations.head()

Unnamed: 0,det_id,field_id,object_name,ra_deg,dec_deg,epoch_mjd,mag,mag_sigma
0,137541512,1719609,4,171.39297,-14.23383,52391.002282,20.7856,0.105232
1,137541513,1719609,5,171.308411,-14.222651,52391.002282,19.7737,0.037289
2,137541533,1719609,24,171.105427,-13.838795,52391.002282,21.103,0.146543
3,137541550,1719609,41,171.097453,-14.453792,52391.002282,20.7811,0.104742
4,137541564,1719609,54,171.740297,-14.035202,52391.002282,18.1394,0.008938


The observations file contains observation IDs, field IDs, object IDs, the location on the sky, the time of the observation, and how bright the object appeared in the night sky. 

`difi` really only cares about two of the columns in the observations file: the ID column and object ID column, hereafter refered to as the truth column. 

Let's quickly take a look at the different truths in our observations.

In [4]:
observations["object_name"].value_counts()

-1        2346
-2         947
7359        26
7928        26
7888        26
          ... 
659062       1
403579       1
258991       1
4207         1
897197       1
Name: object_name, Length: 17476, dtype: int64

Notice how there are two IDs (-1, -2) that have many more observations than the remaining truths. These are actually observations that belong to different classes of noise and were inserted into the observations to see how well the linking algorithm can handle spurious detections. 
  
`difi` has no problem handling these kinds of observations. If noise has been inserted into the observations dataset, `difi` assumes that each individual noise observation has its own unique truth ID. So lets go ahead and make that the case:

In [5]:
observations.loc[observations["object_name"] == "-1", "object_name"] = [
    "NS{:06d}".format(i) for i in range(len(observations[observations["object_name"] == "-1"]))
]
observations.loc[observations["object_name"] == "-2", "object_name"] = [
    "FD{:06d}".format(i) for i in range(len(observations[observations["object_name"] == "-2"]))
]

In [6]:
observations["object_name"].value_counts()

7928        26
7888        26
7359        26
7684        25
7839        25
            ..
NS001202     1
FD000172     1
FD000285     1
NS002326     1
FD000118     1
Name: object_name, Length: 20767, dtype: int64

How about the remaining detections? Could they belong to different classes of truths? In the case of the Solar System, this might be differentiating between Main Belt asteroids, or near Earth asteroids. For the purposes of this demonstration lets just randomly place the 
truths into different classes. Doing so will allow us to show how `difi` can handle population statistics. 

In [7]:
# Grab unique object IDs
NS_mask = observations["object_name"].str.contains("NS", regex=True)
FD_mask = observations["object_name"].str.contains("FD", regex=True)
remaining = observations[(~FD_mask) & (~NS_mask)]["object_name"].unique()
MBA1, MBA2 = np.array_split(remaining, 2)

classes = {
    "NS" : observations[NS_mask]["object_name"].unique(),
    "FD" : observations[FD_mask]["object_name"].unique(),
    "MBA1" : MBA1,
    "MBA2" : MBA2
}

We now have a dictionary with class names as keys and the unique truths belonging to each class as values.

Now, lets take a look at a sample linkage input. MOPS outputs its linkages in a text file with the observation IDs for each linkage written in a single line. Notice how there are also no linkage IDs.

In [8]:
! head linkages.txt

137541512 137543165 137615070 137620728 138216303 138216866 138221227 
137541512 137543165 137615070 137620728 138216303 138216866 138221227 144513728 144533645 
137541512 137543165 137615070 137620728 138216303 138216866 138221227 144513728 144533645 146991832 147084549 
137541512 137543165 137615070 137620728 138216303 138216866 138221227 144514371 144534274 
137541512 137543165 137615070 137620728 142747928 142763154 
137541512 137543165 137615070 137620728 142748009 142763229 
137541512 137543165 137615070 137620728 142748009 142763229 144513839 144533746 
137541512 137543165 137615070 137620728 142748120 142763338 
137541512 137543165 137615070 137620728 142748305 142763529 
137541512 137543165 137615070 137620728 142748337 142763570 


Before we continue lets tell `difi` what columns to use. Between the observations and linkage_members dataframe, `difi` needs to know about just three columns:
- linkage_id: the ID assigned to each linkage
- obs_id : the observation ID from which linkages are made
- truth : the truth for every observation

So lets define a dictionary that tells `difi` what columns to use for this information.

In [9]:
column_mapping = {
    # difi column name : data column name
    "linkage_id" : "track_id",
    "obs_id" : "det_id",
    "truth" : "object_name"
}

We can convert this format into the linkageMembers format required by `difi` using the following code. 

In [10]:
linkage_members = difi.readLinkagesByLineFile("linkages.txt",
                                              column_mapping=column_mapping)

In [11]:
linkage_members

Unnamed: 0,track_id,det_id
0,1,137541512
1,1,137543165
2,1,137615070
3,1,137620728
4,1,138216303
...,...,...
437114,50000,140131831
437115,50000,140168253
437116,50000,140241451
437117,50000,576903193


The linkage dataframe has just two columns, both of which `difi` needs. The first column has the linkage ID, then for each linkage each unique observation in that linkage is listed in the second column. 

Lets take a look at two specific examples:

In [12]:
linkage_members[linkage_members["track_id"] == "4"]

Unnamed: 0,track_id,det_id
27,4,137541512
28,4,137543165
29,4,137615070
30,4,137620728
31,4,138216303
32,4,138216866
33,4,138221227
34,4,144514371
35,4,144534274


In [13]:
linkage_members[linkage_members["track_id"] == "14"]

Unnamed: 0,track_id,det_id
97,14,137541512
98,14,137543165
99,14,137615070
100,14,137620728
101,14,144513738
102,14,144533658


### Analyzing Observations (Can I Find It?) 

Determing how a linking algorithm performs involves knowing what it should be able to link. 

`difi` comes with a function that analyzes findablility with one of two simple assumptions )we term these as findability metrics):
- 'min_obs' : Any truth with this many or more observations should be findable.
- 'nightly_linkages' : Any truth with enough observations to make an intra-night linkage of a user-defined length, and any object with enough nights during which such linkages can be made are considered findable. This metric is more catered towards the "tracklet" building methodology as used by our generic linking algorithm.

If these metrics don't satisfy the desired use case, don't worry. The `analyzeObservations` function can handle a callable as its metric keyword argument. This callable should return a dataframe a `pandas.DataFrame` with one column of the truth IDs that are findable, and a column named 'obs_ids' containing `~numpy.ndarray`s of the observations that made each truth findable.

Lets see what should be findable with our generic linking algorithm with this simplest metric: 'min_obs'

In [14]:
# Any truth with 6 or more observations is considered findable
all_truths, findable_observations, summary = difi.analyzeObservations(observations,
                                                                      classes=None,
                                                                      metric="min_obs",
                                                                      min_obs=6,
                                                                      column_mapping=column_mapping)

The `analyzeObservations` function returns three dataframes. Let's take a look at all three in a little detail. 

The all_truths dataframe lists each unique truth as a row with columns that account for the number of observations that each unique truth has and also if it is findable (if it has more than `min_obs` observations). 

In [15]:
all_truths

Unnamed: 0,object_name,num_obs,findable
0,7359,26,1
1,7888,26,1
2,7928,26,1
3,7684,25,1
4,7704,25,1
...,...,...,...
20762,NS002341,1,0
20763,NS002342,1,0
20764,NS002343,1,0
20765,NS002344,1,0


One can trivially select the objects that should or should not be findable thanks to `pandas`.

In [16]:
findable = all_truths[all_truths["findable"] == 1][column_mapping["truth"]].unique()
not_findable = all_truths[all_truths["findable"] == 0][column_mapping["truth"]].unique()

In [17]:
findable

array(['7359', '7888', '7928', ..., '9825', '9875', '9911'], dtype=object)

In [18]:
not_findable

array(['10025', '10103', '10281', ..., 'NS002343', 'NS002344', 'NS002345'],
      dtype=object)

The next dataframe returned is findable_observations: this data product has an column of all the truths that were deemed findable by the findability metric and a column "obs_ids" that contains arrays of observation IDs that made each truth findable.

In [19]:
findable_observations

Unnamed: 0,object_name,obs_ids
0,10008,"[139152477, 139221856, 140022991, 140098876, 1..."
1,1003647,"[142169190, 142171432, 142734469, 142751167, 1..."
2,1003654,"[142169206, 142171450, 142734487, 142751184, 1..."
3,1003656,"[142169208, 142171452, 142734489, 142751185, 1..."
4,10069,"[140023430, 140099333, 143526052, 143574336, 1..."
...,...,...
7983,9912,"[138937876, 138980248, 138982691, 139001243, 1..."
7984,9915,"[138982842, 138998705, 139036791, 139039371, 1..."
7985,9919,"[138999034, 139037130, 140058258, 140060964, 1..."
7986,9920,"[138983202, 138999145, 139037235, 139039745, 1..."


The last dataframe that was returned is the summary dataframe, which gives some per-class summary statistics.

In [20]:
summary

Unnamed: 0,class,num_members,num_obs,findable
0,All,20767,106481,7988


Let us now make `difi` aware of the classes we defined earlier so we can take a look at population statistics.

In [21]:
# Any truth with 6 or more observations is considered findable
all_truths, findable_observations, summary = difi.analyzeObservations(observations,
                                                                      classes=classes,
                                                                      metric="min_obs",
                                                                      min_obs=6,
                                                                      column_mapping=column_mapping)

Now that we have handed our class definitions to the `analyzeObservations` function, our summary dataframe will have updated with a per-class summary of the truths that should be findable.

In [22]:
summary

Unnamed: 0,class,num_members,num_obs,findable
0,All,20767,106481,7988
1,MBA1,8737,75347,6912
2,MBA2,8737,27841,1076
3,NS,2346,2346,0
4,FD,947,947,0


As stated earlier, the `analyzeObservations` function has a two built-in findability criteria. Let's take a look at the second one: 'nightly_linkages'. The `analyzeObservations` function calls a function that calculates which objects should be findable. The 'min_obs' metric from earlier is defined by `calcFindableMinObs`, while the 'nightly_linkages' metric is defined by `calcFindableNightlyLinkages`. Let's take a look at the latter to get an idea of the parameters we can configure.

In [23]:
difi.calcFindableNightlyLinkages?

[0;31mSignature:[0m
[0mdifi[0m[0;34m.[0m[0mcalcFindableNightlyLinkages[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mobservations[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlinkage_min_obs[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_obs_separation[0m[0;34m=[0m[0;36m0.0625[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_linkage_nights[0m[0;34m=[0m[0;36m3[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcolumn_mapping[0m[0;34m=[0m[0;34m{[0m[0;34m'obs_id'[0m[0;34m:[0m [0;34m'obs_id'[0m[0;34m,[0m [0;34m'truth'[0m[0;34m:[0m [0;34m'truth'[0m[0;34m,[0m [0;34m'time'[0m[0;34m:[0m [0;34m'time'[0m[0;34m,[0m [0;34m'night'[0m[0;34m:[0m [0;34m'night'[0m[0;34m}[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Finds the truths that have at least min_linkage_nights linkages of length
linkage_min_obs or more. Observations are considered to be in a possible intra-night
link

We have now run into a case where a metric requires additional columns in the observations file to be able to determine
what is and what is not findable. Particular, the 'nightly_linkages' metric needs the time of observation and also the night during which the observation occured. So lets make sure we have what we need in our observations and update our column_mapping accordingly. 

In [24]:
# Current column_mapping
column_mapping

{'linkage_id': 'track_id', 'obs_id': 'det_id', 'truth': 'object_name'}

In [25]:
# Current observations
observations.head()

Unnamed: 0,det_id,field_id,object_name,ra_deg,dec_deg,epoch_mjd,mag,mag_sigma
0,137541512,1719609,4,171.39297,-14.23383,52391.002282,20.7856,0.105232
1,137541513,1719609,5,171.308411,-14.222651,52391.002282,19.7737,0.037289
2,137541533,1719609,24,171.105427,-13.838795,52391.002282,21.103,0.146543
3,137541550,1719609,41,171.097453,-14.453792,52391.002282,20.7811,0.104742
4,137541564,1719609,54,171.740297,-14.035202,52391.002282,18.1394,0.008938


The observations file already has the observation time (epoch_mjd), 
a special data format used by astronomers in units of decimal days. Let us update our column_mapping dictionary to point to that column.

In [26]:
column_mapping["time"] = "epoch_mjd"

The last column we still need is the "night" column, this column should indicate the night during which observation occured so that it can be used to isolate nightly observations. Lets add that column:

Note: here we use a little bit of domain knowledge to calculate the night, the details don't fully matter outside of understanding that the night column is just a unique ID for each night of observation. A better dataset would have likely already included this information for us. 

In [27]:
def calcNight(mjd, midnight=0.166):
    night = mjd + 0.5 - midnight
    return night.astype(int)

observations["night"] = calcNight(observations[column_mapping["time"]])
column_mapping["night"] = "night"

In [28]:
all_truths, findable_observations, summary = difi.analyzeObservations(
    observations,
    classes=classes,
    metric="nightly_linkages",
    linkage_min_obs=2, # number of observations in a nightly linkage
    max_obs_separation=1.5/24, # maximum temporal separation between consecutive observations (in decimal days)
    min_linkage_nights=3, # number of unique nights during a which a 
    column_mapping=column_mapping)

In [29]:
summary

Unnamed: 0,class,num_members,num_obs,findable
0,All,20767,106481,6359
1,MBA1,8737,75347,5911
2,MBA2,8737,27841,448
3,NS,2346,2346,0
4,FD,947,947,0


Comparing this summary dataframe to the previous one shows fewer objects to be findable, this intuitively makes sense since the 'nightly_linkages' metric is much more restrictive. 

### Analyzing Linkages (Did I Find It?)

We have described how find the truths that should be findable by a linking algorithm, now lets analyze actual linking algorithm performance by analyzing our linkages. As a reminder our linkages are defined by the `linkage_members` dataframe:

In [30]:
linkage_members

Unnamed: 0,track_id,det_id
0,1,137541512
1,1,137543165
2,1,137615070
3,1,137620728
4,1,138216303
...,...,...
437114,50000,140131831
437115,50000,140168253
437116,50000,140241451
437117,50000,576903193


Our observations look as follows:

In [31]:
observations

Unnamed: 0,det_id,field_id,object_name,ra_deg,dec_deg,epoch_mjd,mag,mag_sigma,night
0,137541512,1719609,4,171.392970,-14.233830,52391.002282,20.7856,0.105232,52391
1,137541513,1719609,5,171.308411,-14.222651,52391.002282,19.7737,0.037289,52391
2,137541533,1719609,24,171.105427,-13.838795,52391.002282,21.1030,0.146543,52391
3,137541550,1719609,41,171.097453,-14.453792,52391.002282,20.7811,0.104742,52391
4,137541564,1719609,54,171.740297,-14.035202,52391.002282,18.1394,0.008938,52391
...,...,...,...,...,...,...,...,...,...
106476,647044590,1731686,NS002342,264.678804,-14.372913,52406.180694,23.3985,0.113214,52406
106477,147435568,1731816,861639,279.590348,-42.205197,52406.239688,23.7766,0.125809,52406
106478,647319299,1731840,NS002343,279.617153,-42.209205,52406.250551,23.5980,0.113748,52406
106479,647319475,1731840,NS002344,279.309504,-42.968246,52406.250551,24.0731,0.171468,52406


Using these two data products and a few keyword arguments we can analyze performance:

In [32]:
all_linkages, all_truths, summary = difi.analyzeLinkages(observations, 
                                                         linkage_members, 
                                                         classes=classes,
                                                         all_truths=all_truths,
                                                         min_obs=6, 
                                                         contamination_percentage=20, 
                                                         column_mapping=column_mapping)

The `analyzeLinkages` function returns three dataframes:
- all_linkages: each linkage is summarized as its own row. 
- all_truths: each truth is summarized as its own row. 
- summary: per-class summary statistics

Lets now take a look at each individually.

In [33]:
all_linkages

Unnamed: 0,track_id,num_obs,num_members,pure,pure_complete,partial,mixed,contamination_percentage,found_pure,found_partial,found,linked_truth
0,1,7,1,1,0,0,0,0.0,1,0,1,4
1,10,6,2,0,0,0,1,,0,0,0,
2,100,8,2,0,0,0,1,,0,0,0,
3,1000,6,6,0,0,0,1,,0,0,0,
4,10000,6,6,0,0,0,1,,0,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...
49995,9995,9,3,0,0,0,1,,0,0,0,
49996,9996,8,4,0,0,0,1,,0,0,0,
49997,9997,9,3,0,0,0,1,,0,0,0,
49998,9998,8,3,0,0,0,1,,0,0,0,


In [34]:
all_linkages[all_linkages["pure"] == 1]

Unnamed: 0,track_id,num_obs,num_members,pure,pure_complete,partial,mixed,contamination_percentage,found_pure,found_partial,found,linked_truth
0,1,7,1,1,0,0,0,0.0,1,0,1,4
5,10001,9,1,1,0,0,0,0.0,1,0,1,5717
14,1001,10,1,1,1,0,0,0.0,1,0,1,2659
18,10013,9,1,1,1,0,0,0.0,1,0,1,5722
25,1002,10,1,1,1,0,0,0.0,1,0,1,2670
...,...,...,...,...,...,...,...,...,...,...,...,...
49909,9917,6,1,1,0,0,0,0.0,1,0,1,5977
49922,9929,6,1,1,0,0,0,0.0,1,0,1,5987
49941,9946,6,1,1,1,0,0,0.0,1,0,1,6002
49958,9961,9,1,1,0,0,0,0.0,1,0,1,6010


For each linkage defined in the `linkage_members` format, the number of unique 'truths' is counted ('num_members'), the number of unique observations in each linkage ('num_obs'), whether the linkage is 'pure', 'partial' or 'mixed', the contamination percentage (if the linkage is considered 'partial') and if the linkage is either 'pure' or 'partial' then the linked truth ('linked_truth').  

Here we briefly summarize the different linkage types possible:
- 'pure: a linkage where all constituent observations belong to a single truth, this linkage class is further subdivided into 'pure_complete' linkages which are pure linkages that contain all of an objects observations contained in the given observations. 
- 'partial': a linkage that contains observations belonging to multiple truths but 
    equal to or more than min_obs observations of one truth and no more than the contamination threshold
    of observations of other truths. For example, a linkage with ten observations, eight of which belong to
    a single unique truth and two of which belong to other truths has contamination percentage 20%. If the threshold
    is set to 20% or greater, and min_obs is less than or equal to eight then the truth with the eight observations
    is considered found and the linkage is considered a partial linkage.
- 'mixed': all linkages that are neither pure nor partial.

In [35]:
all_truths

Unnamed: 0,object_name,num_obs,findable,found_pure,found_partial,found,pure,pure_complete,partial,partial_contaminant,mixed,obs_in_pure,obs_in_pure_complete,obs_in_partial,obs_in_partial_contaminant,obs_in_mixed
0,7359,26,1,2,1,3,2,0,1,0,64,34,0,12,0,359
1,7888,26,1,3,6,9,3,0,6,4,53,54,0,72,12,204
2,7928,26,1,3,3,6,3,0,3,9,59,50,0,44,27,209
3,7684,25,1,4,4,8,4,1,4,0,80,82,25,52,0,340
4,7704,25,1,3,2,5,3,0,2,0,59,51,0,31,0,258
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20762,NS002341,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1
20763,NS002342,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1
20764,NS002343,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1
20765,NS002344,1,0,0,0,0,0,0,0,0,2,0,0,0,0,2


The all_truths dataframe shows for each truth if it has been found in either a pure or partial linkage. If found in either it sets the found column to 1. 

Lastly, the summary dataframe contains overall statistics on the number of truths found, the completeness (if calculable) and so on...

In [36]:
summary

Unnamed: 0,class,num_members,num_obs,completeness,findable,found,findable_found,findable_missed,not_findable_found,not_findable_missed,...,unique_in_partial_linkages_only,unique_in_pure_and_partial_linkages,unique_in_partial_linkages,unique_in_partial_contaminant_linkages,unique_in_mixed_linkages,obs_in_pure_linkages,obs_in_pure_complete_linkages,obs_in_partial_linkages,obs_in_partial_contaminant_linkages,obs_in_mixed_linkages
0,All,20767,106481,44.047806,6359,3116,2801,3558,315,14093,...,946,587,1533,1497,20188,41600,18290,26002,5532,363985
1,MBA1,8737,75347,47.183218,5911,3070,2789,3122,281,2545,...,900,587,1487,947,8328,41600,18290,25507,4355,298448
2,MBA2,8737,27841,2.678571,448,46,12,436,34,8255,...,46,0,46,480,8584,0,0,495,1053,59121
3,NS,2346,2346,,0,0,0,0,0,2346,...,0,0,0,54,2336,0,0,0,101,4682
4,FD,947,947,,0,0,0,0,0,947,...,0,0,0,16,940,0,0,0,23,1734


Notice that we passed the all_truths dataframe as a kwarg to the `analyzeLinkages` function, this allows the function to access the 'findable' column and calculate completeness. You do not need to pass the all_truths dataframe nor the summary dataframe to use `analyzeLinkages`. Below is an example. 

In [37]:
all_truths.drop(columns=["findable"], inplace=True)

all_linkages, all_truths, summary = difi.analyzeLinkages(observations, 
                                                         linkage_members, 
                                                         classes=classes,
                                                         all_truths=None,
                                                         min_obs=6, 
                                                         contamination_percentage=20, 
                                                         column_mapping=column_mapping)

In [38]:
summary

Unnamed: 0,class,num_members,num_obs,completeness,findable,found,findable_found,findable_missed,not_findable_found,not_findable_missed,...,unique_in_partial_linkages_only,unique_in_pure_and_partial_linkages,unique_in_partial_linkages,unique_in_partial_contaminant_linkages,unique_in_mixed_linkages,obs_in_pure_linkages,obs_in_pure_complete_linkages,obs_in_partial_linkages,obs_in_partial_contaminant_linkages,obs_in_mixed_linkages
0,All,20767,106481,,,3116,,,,,...,946,587,1533,1497,20188,41600,18290,26002,5532,363985
1,MBA1,8737,75347,,,3070,,,,,...,900,587,1487,947,8328,41600,18290,25507,4355,298448
2,MBA2,8737,27841,,,46,,,,,...,46,0,46,480,8584,0,0,495,1053,59121
3,NS,2346,2346,,,0,,,,,...,0,0,0,54,2336,0,0,0,101,4682
4,FD,947,947,,,0,,,,,...,0,0,0,16,940,0,0,0,23,1734


So what do all those columns track, for that the best way to find out is to checkout the docstring:

In [39]:
difi.analyzeLinkages?

[0;31mSignature:[0m
[0mdifi[0m[0;34m.[0m[0manalyzeLinkages[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mobservations[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlinkage_members[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mall_truths[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_obs[0m[0;34m=[0m[0;36m5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcontamination_percentage[0m[0;34m=[0m[0;36m20.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mclasses[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcolumn_mapping[0m[0;34m=[0m[0;34m{[0m[0;34m'linkage_id'[0m[0;34m:[0m [0;34m'linkage_id'[0m[0;34m,[0m [0;34m'obs_id'[0m[0;34m:[0m [0;34m'obs_id'[0m[0;34m,[0m [0;34m'truth'[0m[0;34m:[0m [0;34m'truth'[0m[0;34m}[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Did I Find It? 

Given a data frame of observations and a data frame defining possible link