<img src="https://complexity.asu.edu/sites/default/files/comses.jpg" alt="comses.net's logo" style="border: 4px solid  gray; display: flex; justify-content: center;" height=200>

# **Entity Resolution of Tags on ComSES.net using Dedupe**

## **🚀 Introduction** 
><br/>   
> ❓ ComSES.net is a digital repository that supports discovery and good practices for software citation, digital preservation, reproducibility, and reuse.    <br/>
> Visit it <a href="https://www.comses.net/">here</a>   <br/>
> <br/>

<br/>
On ComSES.net, tags play an important role. They allow users to search for codebases that they are interested in:
<img src="public/tag-search.png" alt="searching for tags" style="border: 4px solid  gray; display: flex; justify-content: center;">
<sub> <strong>Figure 1. Searching for codebases using tags</strong> </sub>

There is also a cool metrics page where researchers can see how the use of various technologies have changed over time:
<img src="public/metrics-page.png" alt="metrics page" height=400 style="border: 4px solid  gray; display: flex; justify-content: center;">

<sub><strong> Figure 2. ComSES.net's awesome metrics page</strong></sub>

Since there are a great number of technologies available to use for research computing, ComSES does not enforce any rules in the way codebases are tagged by researchers during the codebase creation phase.


## **❓ Problem**
Allowing researchers to write their own tags is wonderful since it allows researchers to better describe their codebase. However, a few problems arise from this scenario.
- 1. Codebases with wrongly spelled / abbreviated tags are hard to search for.
- 2. Different versions of the same technology aren't grouped together on ComSES.net's metrics page.
  For example, on the metrics page, we would ideally like to treat NetLogo 6.1.1, NetLogo 6.2.1 and NetLogo 6.2.2 as the same thing and to display metrics accordingly.


## **🕵️‍♀️ Existing strategies**
### **1. NLTK's Porter Stemmer**


> <br/>  
> ❓ NLTK is a leading platform for building Python programs to work with human language data. <br/>
> For more information, read the <a href="https://www.nltk.org">documentation</a> for NLTK    <br/>
> <br/>
Use NLTK's porter stemmer to remove common suffixes and group tags accordingly.
<img src="public/nltk-intro.png" alt="your-image-description" style="border: 4px solid  gray; display: flex; justify-content: center;" height=200>
<sub>**NLTK's porter-stemmer removing common suffixes to the word connect**</sub>

Using the approach above, connect, connected, connection and connecting are treated as the same model. However, this method suffers from the problems identified in the **Problem** section.  
**1. Small spelling mistakes have drastic consequences:**
<img src="public/nltk-failure.png" alt="your-image-description" style="border: 4px solid  gray; display: flex; justify-content: center;" height=200>
<sub>**NLTK's porter-stemmer fails to account for spelling mistakes**</sub>

**2. Different versions of the same software are not grouped accordingly:**
<img src="public/nltk-versions.png" alt="your-image-description" style="border: 4px solid  gray; display: flex; justify-content: center;" height=200>
<sub>**Different versions of the same software are not grouped accordingly**</sub>

### **2. Regex**
> <br/>  
> ❓ Regular expressions (regex) are concise patterns for text manipulation and matching.   <br/>
> <br/>




Use regex to group software of various version numbers.  
This approach works reasonably well since most technologies have standard ways of specifying version numbers.

However, this method is labor-intensive and requires programmers to write regular expressions for every single thing they would like grouped. ComSES.net currently has 3000+ tags in production so this is not ideal. Also, this means that every time a new technology shows up on ComSES.net, the programmers have to be notified to write a new regex for the technology. 
Again, this method also fails to account for small mispellings.



## **🤖 Using Machine Learning**
We want to group similar tags. In machine learning, this task is called [entity resolution](https://www.ibm.com/docs/en/iii/9.0.0?topic=insight-entity-resolution). 
> <br/>  
> ❓ Entity resolution is the process of identifying and merging   
> duplicate or similar records in a dataset.   <br/>
> <br/>

After searching the web for various solutions, we found one that fit our specifications. 
<img src="public/dedupe-logo.png" alt="your-image-description" style="border: 4px solid  gray; display: flex; justify-content: center;" height=150>

Dedupe is wonderful because it accounts for mispellings and abbreviations.  
It is also able to group similar versions together without requiring specialized regexes as long as it is provided enough training data.

Taken straight from the website:   
> <br/>  
> 🤓 "Dedupe.io is a powerful tool that learns the best way to find similar rows in your data. Using cutting-edge research in machine learning we quickly and accurately identify matches in your Excel spreadsheet or database—saving you time and money."  
> Check out the <a href="https://dedupe.io/">website</a> for more information <br/>
> <br/>

Dedupe is both a SaaS and a Python library. We decided to use the Python library and took advantage of the following classes:
- **Dedupe** - A class for active-learning deduplication. It clusters matching tags that it assumes maps to the same entity.
- **Gazetteer** - A class for active learning gazetteer matching. It matches a messy data set against a 'canonical dataset'.

## **⭐ Clustering**

It is pretty easy to get started with using the `Dedupe` class.  
You simply have to provide a list of columns and specify the variable type of each column:  
In our case, we only had a single column.


In [1]:
# %pip install dedupe

In [2]:
import dedupe

columns = [
    {"field": "name", "type": "String"},
]
deduper = dedupe.Dedupe(columns)

For this tutorial, we will use the following small dataset

In [3]:
data = {
    0: {"name": "agent-based model"},
    1: {"name": "agent based model"},
    2: {"name": "ajent base moder (abms)"},
    3: {"name": "comses"},
    4: {"name": "comses.net"},
    5: {"name": "NetLogo 6.1.1"},
    6: {"name": "NetLogo 6.1.2"},
    7: {"name": "NetLogo 6.2.1"},
    8: {"name": "C#"},
    9: {"name": "C++15"},
    10: {"name": "C++16"},
    11: {"name": "C++17"},
    12: {"name": "The University of Waterloo"},
    13: {"name": "Arizona State University"},
    14: {"name": "Arizona International University"}
}

Dedupe requires us to pass in a list of data that the model will use.  
We do that here:  

In [4]:
deduper.prepare_training(data)


### **💪 Training the clustering model**
To start out, you also have to provide Dedupe with pairs of tags that should be mapped together.  
The Dedupe class has a simple function which returns the pair of tags that it is most uncertain about:


In [5]:
pair = deduper.uncertain_pairs()
print(pair)

[({'name': 'comses'}, {'name': 'comses.net'})]


After this, you pass the label back into the model like so. Since both the tags above match, we do the following:

In [6]:
labeled_examples = {
    "match": pair,
    "distinct": [
    ],
}
deduper.mark_pairs(labeled_examples)

There is a nifty little console labelling function that dedupe provides:

In [None]:
dedupe.console_label(deduper)

After labelling enough instances, you can start training the model like so:

In [8]:
deduper.train()

### **✨Clustering✨**
You would definitely need to label a few more instances before your model can start working.  
Once you are done with this, you can move on to the true magic. You pass in a list of records you would like to cluster and set a threshold.
Your list of records should be formatted as follows:

In [9]:
data = {
    0: {'name': 'agents'},
    1: {'name': 'agent-based models'},
    2: {'name': 'agent-based models (abms)'},
    3: {'name': 'agent-based model'},
    4: {'name': 'agants'},
}

In [10]:
duplicates = deduper.partition(data, threshold=0.5)
duplicates

[((0, 4), (0.96180147, 0.96180147)),
 ((1, 2, 3), array([0.99858896, 0.99842318, 0.99854703]))]

It isn't immediately clear what the output means but the tuples returned are:
(list of ids of records that are grouped together, list of confidence values for each id).
We use a nifty little function for displaying the clusters.

In [11]:
def display_clusters(partitions):
    for index, partition in enumerate(partitions):
        print(f"GROUP {index}")
        for element in partition[0]:
            print(data[element])
        print("\n")

print(display_clusters(duplicates))

GROUP 0
{'name': 'agents'}
{'name': 'agants'}


GROUP 1
{'name': 'agent-based models'}
{'name': 'agent-based models (abms)'}
{'name': 'agent-based model'}


None


🎉 **That's all in terms of clustering, let's move on to gazetteering now!**

## **⭐ Canonicalization / Gazetteering**

It is pretty easy to get started with using the `Dedupe` class.  
You simply have to provide a list of columns and specify the variable type of each column:  
In our case, we only had a single column.

In [12]:
columns = [
    {'field' : 'name', 'type': 'String'},
]
gazetteer = dedupe.Gazetteer(columns)

We will use the following small dataset for the examples

In [13]:
data = {
    0: {"name": "C# 12"},
    1: {"name": "C# 11"},
    2: {"name": "C++ 17"},
    3: {"name": "C++ 18"},
    4: {"name": "Java 19"},
    5: {"name": "Java 12"},
}

canonical_data = {
    6: {"name": "C#"},
    7: {"name": "C++"},
    8: {"name": "Java"}
}

We first prepare the training data similar to what we did to Dedupe

In [14]:
gazetteer.prepare_training(data, canonical_data)

### **💪 Training the model**
To start out, you also have to provide Gazetteer with pairs of tags that should be mapped together.  
The Gazetteer class has a simple function which returns the pair of tags that it is most uncertain about:

In [15]:
pair = gazetteer.uncertain_pairs()
print(pair)

[({'name': 'C++ 17'}, {'name': 'C++'})]


After this, you pass the label back into the model like so. Since both the tags above match, we do the following:


In [16]:
labeled_examples = {
    "match": pair,
    "distinct": [
    ],
}
gazetteer.mark_pairs(labeled_examples)

Like the Dedupe class, there's also a nifty little console label function:

In [None]:
dedupe.console_label(gazetteer)

Then, we train the model on the labelled data like so:

In [18]:
gazetteer.train()
gazetteer.index(canonical_data)

### **🤓 Canonicalization / Gazetteering**
You pass in a list of records you would like to map to the canonical list before.
Your list of records should be formatted as follows:

In [19]:
test_data = {
    0: {"name": "C# 12"},
    1: {"name": "C# 11"},
    2: {"name": "C++ 17"},
}

In [20]:
matches = gazetteer.search(test_data, threshold=0.5)
print(matches)

[(0, ((6, 0.821512),)), (1, ((6, 0.821512),)), (2, ((7, 0.8383985),))]


It isn't immediately clear what the value returns is. 
(id from test_data, (array of tuple of matches the first value is the id from canonical_data and the second is the confidence))
We made a nifty little visualizer for this:

In [21]:
def display_gazetteer_results(matches):
    for match in matches:
        print("TAG:", test_data[match[0]], "  CANONICAL TAG: ", end="")
        for canonical_tag in match[1]:
            print(canonical_data[canonical_tag[0]], end = "")
        print("")
display_gazetteer_results(matches)

TAG: {'name': 'C# 12'}   CANONICAL TAG: {'name': 'C#'}
TAG: {'name': 'C# 11'}   CANONICAL TAG: {'name': 'C#'}
TAG: {'name': 'C++ 17'}   CANONICAL TAG: {'name': 'C++'}


## **🔨 Workflow**
Using these classes, we implemented a reasonably simple workflow for a curator to cluster the tags.

First, the curator clusters the tags:  
![Alt text](public/dedupe-workflow.png)  
<sub>Diagram for the human in the loop workflow for clustering</sub>

<br/>

However, after creating the initial canonical list, it's usually a better idea to use the gazetteer to map new tags to the canonical list.
This workflow is fairly similar to the one for clustering:  
![Alt text](public/gazetteer-workflow.png)  
<sub>Diagram for the human in the loop workflow for gazetteering</sub>



# **🙌 Conclusion**
In conclusion, we've covered all the essential aspects of tag deduplication. By eliminating duplicate tags, you can streamline your data, improve organization, and enhance the overall efficiency of your projects or systems.