# Using the `unionfind` library to group similar data points

A library that I personally find useful when trying to group together similar objects, such as similar text strings, is the `unionfind` library from here https://pypi.org/project/unionfind/. There is basically no documentation but it's not too hard to figure out, and works nicely.

In [1]:
import unionfind
import pandas as pd
from math import sqrt

Consider the numbers {0, 1, 2, 3, 4, 5, 6, 7}, and suppose we want to group some of them together. We initialise a `unionfind` object with 8 numbers.

In [2]:
uf = unionfind.unionfind(8)  # There are 8 items.
uf

<unionfind.unionfind at 0x111908f28>

To start with each of our 8 numbers is in a group of its own.

In [3]:
uf.groups()

[[0], [1], [2], [3], [4], [5], [6], [7]]

Let's merge the groups with 1 and 5 in:

In [4]:
uf.unite(1, 5)

Now 1 and 5 are in the same group:

In [5]:
uf.groups()

[[0], [2], [3], [4], [1, 5], [6], [7]]

Let's do some more merges:

In [6]:
uf.unite(2, 4)
uf.unite(0, 7)

In [7]:
uf.groups()

[[3], [2, 4], [1, 5], [6], [0, 7]]

If we `unite` two items that are already grouped up, _their whole groups_ are merged.

In [8]:
uf.unite(5, 0)

In [9]:
uf.groups()

[[3], [2, 4], [6], [0, 1, 5, 7]]

So you can see that `unionfind` is basically a way of keeping track of which things belong in the same group.

We can also make a dictionary mapping each number to its group:

In [10]:
{n: i for i, grp in enumerate(uf.groups()) for n in grp}

{3: 0, 2: 1, 4: 1, 6: 2, 0: 3, 1: 3, 5: 3, 7: 3}

The above thing is a dictionary comprehension with nested "for" - **woot!**

## A (slightly) bigger example

Suppose we have a dataframe of 2D coordinates, and we want to put into the same group any points that are "close together" - which for this example we will take to mean that their (Euclidean) distance is less than 0.15.

In [11]:
points_df = pd.Series([0.808365940, 0.625088961, 0.527152717, 0.875271894, 0.260115151, 0.868444106, 0.924971694,
                       0.774657217, 0.873412998, 0.360637736, 0.990485171, 0.448918051, 0.915996106, 0.986437399]
    ).to_frame(name='X')

points_df['Y'] = [
    0.324543772, 0.709069147, 0.545833882, 0.31255227, 0.749837826, 0.058352554, 0.148190481, 0.133989841,
    0.382651849, 0.153702295, 0.542930595, 0.308365616, 0.817682371, 0.356522304]

In [12]:
points_df.round(3)

Unnamed: 0,X,Y
0,0.808,0.325
1,0.625,0.709
2,0.527,0.546
3,0.875,0.313
4,0.26,0.75
5,0.868,0.058
6,0.925,0.148
7,0.775,0.134
8,0.873,0.383
9,0.361,0.154


Our basic strategy is to make a `unionfind` object, then loop over pairs of the points, and `unite` any that are close together.

In [13]:
num_points = points_df.shape[0]
threshold = 0.15

uf = unionfind.unionfind(num_points)

for i in range(num_points):
    for j in range(i):
        if sqrt(  (points_df['X'].iloc[i] - points_df['X'].iloc[j]) ** 2 
                + (points_df['Y'].iloc[i] - points_df['Y'].iloc[j]) ** 2)  < threshold:
            uf.unite(i, j)

In [14]:
uf.groups()

[[0, 3, 8, 13], [1], [2], [4], [5, 6, 7], [9], [10], [11], [12]]

Now we use `groups` to add a new column to the dataframe telling us which rows were grouped together.

In [15]:
group_list = [None] * num_points
for grp, members in enumerate(uf.groups()):
    for m in members:
        group_list[m] = grp

points_df['Group'] = group_list

By order the dataframe by the Group column we can reassure ourselves that close-together points have been grouped:

In [16]:
points_df.sort_values('Group').round(2)

Unnamed: 0,X,Y,Group
0,0.81,0.32,0
3,0.88,0.31,0
8,0.87,0.38,0
13,0.99,0.36,0
1,0.63,0.71,1
2,0.53,0.55,2
4,0.26,0.75,3
5,0.87,0.06,4
6,0.92,0.15,4
7,0.77,0.13,4


Let's turn the above into a reusable function. `similarity_pred` should be a predicate on a pair of dataframe rows.

In [17]:
def group_similar(df, similarity_pred):
    
    df = df.copy()
    num_rows = df.shape[0]
    
    uf = unionfind.unionfind(num_rows)

    for i in range(num_rows):
        for j in range(i):
            if similarity_pred(df.iloc[i], df.iloc[j]):
                uf.unite(i, j)
    
    group_list = [None] * num_rows
    for grp, members in enumerate(uf.groups()):
        for m in members:
            group_list[m] = grp

    df['Group'] = group_list
    
    return df

Here we can do the same grouping of the points:

In [18]:
def nearby_points(point1, point2):
    
    threshold = 0.15
    
    return sqrt(  (point1['X'] - point2['X']) ** 2 
                + (point1['Y'] - point2['Y']) ** 2)  < threshold

In [19]:
points_df = points_df[['X', 'Y']]
points_df

Unnamed: 0,X,Y
0,0.808366,0.324544
1,0.625089,0.709069
2,0.527153,0.545834
3,0.875272,0.312552
4,0.260115,0.749838
5,0.868444,0.058353
6,0.924972,0.14819
7,0.774657,0.13399
8,0.873413,0.382652
9,0.360638,0.153702


In [20]:
grouped_df = group_similar(points_df, nearby_points)

grouped_df.sort_values('Group').round(2)

Unnamed: 0,X,Y,Group
0,0.81,0.32,0
3,0.88,0.31,0
8,0.87,0.38,0
13,0.99,0.36,0
1,0.63,0.71,1
2,0.53,0.55,2
4,0.26,0.75,3
5,0.87,0.06,4
6,0.92,0.15,4
7,0.77,0.13,4


If we want to keep only the first item in each group, we can do this:

In [21]:
grouped_df.drop_duplicates(subset=['Group'], keep='first') # Can also keep 'last', or False to drop all duplicates

Unnamed: 0,X,Y,Group
0,0.808366,0.324544,0
1,0.625089,0.709069,1
2,0.527153,0.545834,2
4,0.260115,0.749838,3
5,0.868444,0.058353,4
9,0.360638,0.153702,5
10,0.990485,0.542931,6
11,0.448918,0.308366,7
12,0.915996,0.817682,8


## Exercise!

Combine the things we have discussed today, by:

 - choosing a dataset of text strings where you're interested in finding similar strings - e.g. mentions, news headlines, location names
 - think about what kind of string similarity you're interested in, and pick which of the three string similarity measures you think will work best (or define another one!)
 - define a predicate based on your chosen similarity measure - this will be a function that takes two strings and returns `True` (for similar) or `False` (for not similar) (to turn a function returning a numerical similarity value into a predicate, you'll need to set a numerical threshold)
 - use the `group_similar` function above, along with your chosen string similarity predicate, to group together the similar strings
 - inspect the results and see whether it worked the way you wanted!