<img src="https://complexity.asu.edu/sites/default/files/comses.jpg" alt="comses.net's logo" style="border: 4px solid  gray; display: flex; justify-content: center;" height=200>

# **Entity Resolution of Tags on ComSES.net using Dedupe**

## **🚀 Introduction** 
><br/>   
> ❓ ComSES.net is a digital repository that supports discovery and good practices for software citation, digital preservation, reproducibility, and reuse.    <br/>
> Visit it <a href="https://www.comses.net/">here</a>   <br/>
> <br/>

<br/>
On ComSES.net, tags play an important role. They allow users to search for codebases that they are interested in:
<img src="public/tag-search.png" alt="searching for tags" style="border: 4px solid  gray; display: flex; justify-content: center;">
<sub> <strong>Figure 1. Searching for codebases using tags</strong> </sub>

There is also a cool metrics page where researchers can see how the use of various technologies have changed over time:
<img src="public/metrics-page.png" alt="your-image-description" height=400 style="border: 4px solid  gray; display: flex; justify-content: center;">

<sub><strong> Figure 2. ComSES.net's awesome metrics page</strong></sub>

Since there are a great number of technologies available to use for research computing, ComSES does not enforce any rules in the way codebases are tagged by researchers during the codebase creation phase.


## **❓ Problem**
Allowing researchers to write their own tags is wonderful since it allows researchers to better describe their codebase. However, a few problems arise from this scenario.
- 1. Codebases with wrongly spelled / abbreviated tags are hard to search for.
- 2. Different versions of the same technology aren't grouped together on ComSES.net's metrics page.
  For example, on the metrics page, we would ideally like to treat NetLogo 6.1.1, NetLogo 6.2.1 and NetLogo 6.2.2 as the same thing and to display metrics accordingly.


## **🕵️‍♀️ Existing strategies**
### **1. NLTK's Porter Stemmer**


> <br/>  
> ❓ NLTK is a leading platform for building Python programs to work with human language data. <br/>
> For more information, read the <a href="https://www.nltk.org">documentation</a> for NLTK    <br/>
> <br/>
Use NLTK's porter stemmer to remove common suffixes and group tags accordingly.
<img src="image-6.png" alt="your-image-description" style="border: 4px solid  gray; display: flex; justify-content: center;" height=200>
<sub>**NLTK's porter-stemmer removing common suffixes to the word connect**</sub>

Using the approach above, connect, connected, connection and connecting are treated as the same model. However, this method suffers from the problems identified in the **Problem** section.  
**1. Small spelling mistakes have drastic consequences:**
<img src="image-5.png" alt="your-image-description" style="border: 4px solid  gray; display: flex; justify-content: center;" height=200>
<sub>**NLTK's porter-stemmer fails to account for spelling mistakes**</sub>

**2. Different versions of the same software are not grouped accordingly:**
<img src="image-7.png" alt="your-image-description" style="border: 4px solid  gray; display: flex; justify-content: center;" height=200>
<sub>**Different versions of the same software are not grouped accordingly**</sub>

### **2. Regex**
> <br/>  
> ❓ Regular expressions (regex) are concise patterns for text manipulation and matching.   <br/>
> <br/>




Use regex to group software of various version numbers.  
This approach works reasonably well since most technologies have standard ways of specifying version numbers.

However, this method is labor-intensive and requires programmers to write regular expressions for every single thing they would like grouped. ComSES.net currently has 3000+ tags in production so this is not ideal. Also, this means that every time a new technology shows up on ComSES.net, the programmers have to be notified to write a new regex for the technology.



## **🤖 Using Machine Learning**
We want to group similar tags. In machine learning, this task is called [entity resolution](https://www.ibm.com/docs/en/iii/9.0.0?topic=insight-entity-resolution). 
> <br/>  
> ❓ Entity resolution is the process of identifying and merging   
> duplicate or similar records in a dataset.   <br/>
> <br/>

After searching the web for various solutions, we found one that fit our specifications. 
<img src="image-10.png" alt="your-image-description" style="border: 4px solid  gray; display: flex; justify-content: center;" height=150>

Dedupe is wonderful because it accounts for mispellings and abbreviations.  
It is also able to group similar versions together without requiring specialized regexes as long as it is provided enough training data.

Taken straight from the website:   
> <br/>  
> 🤓 "Dedupe.io is a powerful tool that learns the best way to find similar rows in your data. Using cutting-edge research in machine learning we quickly and accurately identify matches in your Excel spreadsheet or database—saving you time and money."  
> Check out the <a href="https://dedupe.io/">website</a> for more information <br/>
> <br/>

Dedupe is both a SaaS and a Python library. We decided to use the Python library and took advantage of the following classes:
- **Dedupe** - A class for active-learning deduplication. It clusters matching tags that it assumes maps to the same entity.
- **Gazetteer** - A class for active learning gazetteer matching. It matches a messy data set against a 'canonical dataset'.

### **⭐ Starting out**

It is pretty easy to get started with using the `Dedupe` class.  
You simply have to provide a list of columns and specify the variable type of each column:  
In our case, we only had a single column.


In [None]:
%pip install dedupe

In [None]:
import dedupe

columns = [
    {'field' : 'name', 'type': 'String'},
]
deduper = dedupe.Dedupe(columns)


### **💪 Training the clustering model**
To start out, you also have to provide Dedupe with pairs of tags that should be mapped together.  
The Dedupe class has a simple function which returns the pair of tags that it is most uncertain about:


In [2]:
pair = deduper.uncertain_pairs()
print(pair)

[({'name': 'agent-based model'}, {'name': 'agent based model'})]




After this, you pass the label back into the model like so. Since both the tags above match, we do the following:
```python
>> labeled_examples = {
>>     "match": [({'name' : 'agent-based model'}, {'name' : 'agent based model'})],
>>     "distinct": [
>>     ],
>> }
>> matcher.mark_pairs(labeled_examples)
```

On ComSES.net, we created a command line interface for labelling data.


### **✨Clustering✨**
You would definitely need to label a few more instances before your model can start working.  
Once you are done with this, you can move on to the true magic. You pass in a list of records you would like to cluster and set a threshold.
Your list of records should be formatted as follows:
```python
{
    0: {'name': 'agents'},
    1: {'name': 'agent-based models'},
    2: {'name': 'agent-based models (abms)'},
    3: {'name': 'agent-based model'},
    4: {'name': 'agants'},
}
```
```python
>> duplicates = matcher.partition(data, threshold=0.5)
>> duplicates
[
    ((1, 2, 3), (0.790, 0.860, 0.790)),
    ((0, 4), (0.720, 0.720)),
]
```

It isn't immediately clear what the array of tuples means but the tuples returned are:
(list of ids of records that are grouped together, list of confidence values for each id).
In this scenario, these are the groupings
Group 1:
```python
{'name': 'agent-based models'}
{'name': 'agent-based models (abms)'}
{'name': 'agent-based model'}
```

Group 2:
```python
{'name': 'agents'}
{'name': 'agants'}
```


### **⭐ Starting out**

It is pretty easy to get started with using the `Dedupe` class.  
You simply have to provide a list of columns and specify the variable type of each column:  
In our case, we only had a single column.

```python
>> columns = [
>>     {'field' : 'name', 'type': 'String'},
>> ]
>> deduper = dedupe.Dedupe(columns)
```

### **💪 Training the model**
To start out, you also have to provide Dedupe with pairs of tags that should be mapped together.  
The Dedupe class has a simple function which returns the pair of tags that it is most uncertain about:
```python
>> pair = matcher.uncertain_pairs()
>> print(pair)
[({'name' : 'agent-based model'}, {'name' : 'agent based model'})]
```

After this, you pass the label back into the model like so. Since both the tags above match, we do the following:
```python
>> labeled_examples = {
>>     "match": [({'name' : 'agent-based model'}, {'name' : 'agent based model'})],
>>     "distinct": [
>>     ],
>> }
>> matcher.mark_pairs(labeled_examples)
```

On ComSES.net, we created a command line interface for labelling data.


### **✨Clustering✨**
You would definitely need to label a few more instances before your model can start working.  
Once you are done with this, you can move on to the true magic. You pass in a list of records you would like to cluster and set a threshold.
Your list of records should be formatted as follows:
```python
{
    0: {'name': 'agents'},
    1: {'name': 'agent-based models'},
    2: {'name': 'agent-based models (abms)'},
    3: {'name': 'agent-based model'},
    4: {'name': 'agants'},
}
```
```python
>> duplicates = matcher.partition(data, threshold=0.5)
>> duplicates
[
    ((1, 2, 3), (0.790, 0.860, 0.790)),
    ((0, 4), (0.720, 0.720)),
]
```

It isn't immediately clear what the array of tuples means but the tuples returned are:
(list of ids of records that are grouped together, list of confidence values for each id).
In this scenario, these are the groupings
Group 1:
```python
{'name': 'agent-based models'}
{'name': 'agent-based models (abms)'}
{'name': 'agent-based model'}
```

Group 2:
```python
{'name': 'agents'}
{'name': 'agants'}
```

## **🔨 Workflow**
Using these classes, we implemented a reasonably simple workflow for a curator to cluster the tags.
![Alt text](image-1.png)  
<sub>Diagram for the human in the loop workflow</sub>

Every time the curator provides new tags to the system, the model gets more and more labelled data and gets better and better!


## **👩‍🚀 Future work**
It is definitely quite a pain to do 

ComSES mission  
Workflow  
Results  
Future Work  