## Text UniSim Demo

This demo showcases how to use Text UniSim (TextSim) for efficient fuzzy string matching, near-duplicate detection, and string similarity using a real-world entity matching dataset.

For additional information, please see the documentation on [GitHub](https://github.com/google/unisim). For more details on the RETSim model used by UniSim, please see the [RETSim paper](https://arxiv.org/abs/2311.17264).

In [1]:
# installing needed dependencies
import os

# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

try:
    import unisim
except ImportError:
    !pip install unisim

try:
    import datasets
except ImportError:
    !pip install datasets


[0m

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import onnxruntime as rt

rt.get_device()

'GPU'

In [3]:
# imports
from datasets import load_dataset
from tabulate import tabulate
import pandas as pd

## Load Dataset

For this demo, we use entity matching datasets available on [Huggingface](https://huggingface.co/datasets/RUC-DataLab/ER-dataset). We use the `restaurants1.csv` dataset which contains restaurants' names/phone numbers/addresses for this colab.

Feel free to explore other examples they offer such as product matching (`walmart_amazon.csv`), paper citation matching (`dblp_scholar.csv`), and beer brands (`beer.csv`). The public datasets are from [DeepMatcher](https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md), [Magellan](https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository) and [WDC](http://webdatacommons.org/largescaleproductcorpus/v2/) and you can find a summary of them [here](https://github.com/ruc-datalab/DADER/tree/main).

In [4]:
!pip show unisim

[0mName: unisim
Version: 1.0.0
Summary: UniSim: Universal Similarity
Home-page: https://github.com/google/unisim
Author: Google
Author-email: unisim@google.com
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: jaxtyping, numpy, onnx, onnxruntime-gpu, pandas, tabulate, tqdm, usearch
Required-by: 


In [5]:
# import TextSim from UniSim
from unisim import TextSim

ImportError: cannot import name 'AcceleratorType' from 'unisim.enums' (/usr/local/lib/python3.10/dist-packages/unisim/enums.py)

In [None]:
# load dataset from huggingface
ds_name = "restaurants1.csv"
dataset = load_dataset("RUC-DataLab/ER-dataset", data_files=ds_name, split="train")

print("Size of dataset:", len(dataset))

dataset_features = list(dataset.features.keys())
print("Dataset features:", dataset_features)

Repo card metadata block was not found. Setting CardData to empty.


Size of dataset: 450
Dataset features: ['A_NAME', 'A_PHONENUMBER', 'A_ADDRESS', 'B_NAME', 'B_PHONENUMBER', 'B_ADDRESS', 'label']


In [5]:
# get features corresponding to pairs of texts in the dataset (text1, text2)
text1_features = [x for x in dataset_features if x.startswith("A")]
text2_features = [x for x in dataset_features if x.startswith("B")]
is_match_feature = "label"

# create text pairs for example at idx in the dataset
def get_text_pair(idx):
    ex = dataset[idx]

    text1 = " ".join(str(ex[x]) for x in text1_features)
    text2 = " ".join(str(ex[x]) for x in text2_features)

    label = ex[is_match_feature]
    return [text1, text2, label]

NameError: name 'dataset_features' is not defined

#### Initialize TextSim

TextSim supports GPU acceleration with a specified `batch_size` parameter. If a GPU is not detected, TextSim will default to CPU. By default, TextSim saves a copy of your dataset but you can set `store_data=False` to save memory when using larger datasets.

Additionally, TextSim support Approximate Nearest Neighbor (ANN) search through [USearch](https://github.com/unum-cloud/usearch). Setting `index_type="approx"` will make TextSim significantly faster on large datasets (sub-linear search time). However, please note that while ANN search is very accurate, it does not guarantee that it will always find the closest match to a search query.

In [6]:
# create TextSim using default parameter settings
text_sim = TextSim(
    store_data=True, # set to False for large datasets to save memory
    index_type="exact", # set to "approx" for large datasets to use ANN search
    batch_size=128, # increasing batch_size on GPU may be faster
    use_accelerator=True, # uses GPU if available, otherwise uses CPU
)

NameError: name 'TextSim' is not defined

#### Computing Similarity between Strings

You can compute the similarity between two strings using the `.similarity(string1, string2)` method. The similarity value is a float between 0 and 1, with 1.0 representing identical strings. This is the cosine similarity between the vector representations of the strings.

In this example, we compute the similarity between the first 5 pairs of the dataset.

In [7]:
example_data = [get_text_pair(idx) for idx in range(0, 5)]

for i in range(len(example_data)):
    text1, text2, is_match = example_data[i]  # ground truth is_match label

    # compute similarity between text pair using .similarity
    similarity = text_sim.similarity(text1, text2)

    example_data[i].append(similarity)

# display results in df
df = pd.DataFrame(example_data, columns=["text1", "text2", "match_label", "similarity"])
display(df.head())

Unnamed: 0,text1,text2,match_label,similarity
0,"15 Romolo (415) 398-1359 15 Romolo Place, San ...","15 Romolo (415) 398-1359 15 Romolo Pl, San Fra...",1,0.971126
1,"456 Shanghai Cuisine 1261 69 Mott Street, New ...",Shanghai Asian Manor (212) 766-6311 21 Mott St...,0,0.77969
2,5A5 Steak Lounge (415) 989-2539 244 Jackson St...,Delicious Dim Sum (415) 781-0721 752 Jackson S...,0,0.68784
3,"9th Street Pizza (213) 627-7798 231 E 9th St, ...",Han Bat Sul Lung Tang (213) 383-9499 4163 W 5t...,0,0.610615
4,"9th Street Pizza (213) 627-7798 231 E 9th St, ...",Jun Won Restaurant (213) 383-8855 3100 W 8th S...,0,0.617358


We can see that the similarity between the first pair is 0.97, which is very high and indicates that the strings are near-duplicates of each other and the addresses match. The other pairs have far lower similarity values, indicating they are likely not matching strings and indeed, they do not represent the same entity.

#### Fuzzy String Matching

TextSim offers efficient fuzzy string matching between two lists using the `.match` function. The `.match` function accepts `queries` (list of strings you want to find matches for) and `targets` (list of strings you are finding matches in). It returns a Pandas DataFrame, where each row contains a query, its most similar match found in targets, their similarity, and whether or not they are a match (if their similarity is >= `similarity_threshold`). `0.9` is a good starting point for `similarity_threshold` when matching near-duplicate strings.

In this example, we show that TextSim is able to match restaurant addresses accurately even when there are typos, abbreviations, and formatting differences.

In [8]:
# targets to match queries to
targets = [
    "Shanghai Asian Manor (212) 766-6311 21 Mott St, New York, NY 10013",
    "Delicious Dim Sum (415) 781-0721 752 Jackson St, San Francisco, CA 94133",
    "15 Romolo (415) 398-1359 15 Romolo Pl, San Francisco, CA 94133",
]

# search queries we are looking up and finding matches for
queries = [
    "Shanghai asia manor (212)-766-6311 21 Mott street, New York, NY 94133", # near-dup match (capitalization, typos, different format)
    "Googleplex (650) 253-0000 1600 Amphitheatre Pkwy, Mountain View, CA 94043", # no match
    "Sino-american books & arts (415) 421-3345 751 Jackson St, San Francisco, CA 94133", # no match, different places but similar address
]

# .match does fuzzy matching between queries and targets lists
results_df = text_sim.match(queries, targets, similarity_threshold=0.9)

# display results dataframe
with pd.option_context('display.max_colwidth', None):
    display(results_df.head(10))

Unnamed: 0,query,target,similarity,is_match
0,"Shanghai asia manor (212)-766-6311 21 Mott street, New York, NY 94133","Shanghai Asian Manor (212) 766-6311 21 Mott St, New York, NY 10013",0.907005,True
1,"Googleplex (650) 253-0000 1600 Amphitheatre Pkwy, Mountain View, CA 94043","15 Romolo (415) 398-1359 15 Romolo Pl, San Francisco, CA 94133",0.466495,False
2,"Sino-american books & arts (415) 421-3345 751 Jackson St, San Francisco, CA 94133","Delicious Dim Sum (415) 781-0721 752 Jackson St, San Francisco, CA 94133",0.746311,False


We can try this on the whole dataset now. We use the first text in each pair as the target and the second text as the search query.

In [9]:
targets = list(set([get_text_pair(idx)[0] for idx in range(0, len(dataset))]))
queries = list(set([get_text_pair(idx)[1] for idx in range(0, len(dataset))]))

print("Dataset examples:")
print("\n".join([t for t in targets[:5]]))

Dataset examples:
State Street Brats (608) 255-5544 603 State St, Madison, WI
Quaker Steak & Lube (608) 831-5823 2259 Deming Way, Middleton, WI
9th Street Pizza (213) 627-7798 231 E 9th St, Los Angeles, CA
Han 202 (312) 949-1314 605 W. 31st Street, Chicago, IL
Maharana (608) 246-8525 1707 Thierer Rd, Madison, WI


In [10]:
results_df = text_sim.match(queries, targets)

with pd.option_context('display.max_colwidth', None):
    display(results_df.head(10))

Unnamed: 0,query,target,similarity,is_match
0,"Winton Deli Cafe (510) 786-2444 2042 W Winton Ave, Hayward, CA 94545","Winton Deli (510) 786-2444 2042 W Winton Avenue, Hayward, CA",0.931267,True
1,"Han Bat Sul Lung Tang (213) 383-9499 4163 W 5th St, Los Angeles, CA 90020","Tinga (323) 954-9566 142 S La Brea, Los Angeles, CA",0.705754,False
2,"Soot Bull Jeep (213) 387-3865 3136 W 8th St, Los Angeles, CA 90005","Dong Il Jang (213) 383-5757 3455 W 8th St, Los Angeles, CA",0.758085,False
3,"Gam Ja Gol (213) 381-6446 3003 W Olympic Blvd, Los Angeles, CA 90006","Ramen Hayatemaru (310) 212-0055 11678 W Olympic Blvd, Los Angeles, CA",0.769117,False
4,"Cup & Cup (646) 398-9990 15 E 31st St, New York, NY 10016","456 Shanghai Cuisine 1261 69 Mott Street, New York, NY",0.692922,False
5,"Asiento (415) 829-3375 2730 21st St, San Francisco, CA 94110","Asiento (415) 829-3375 2730 21st Street, San Francisco, CA",0.947136,True
6,"Sake Lounge (608) 467-7770 406 W Gilman St, Madison, WI 53703","Blue Velvet Lounge (608) 250-9900 430 W Gilman St, Madison, WI",0.778796,False
7,"Chris A.'s review of Arts and Crafts Beer Parlo (646) 678-5263 26 W 8th St, New York, NY 10011","Natsumi Bar And Lounge (212) 258-2988 226 W 50th Street, New York, NY",0.631072,False
8,"Delicious Dim Sum (415) 781-0721 752 Jackson St, San Francisco, CA 94133","Delarosa (415) 673-7100 2175 Chestnut Street, San Francisco, CA",0.710805,False
9,"Local Mission Eatery (415) 655-3422 3111 24th St, San Francisco, CA 94110","La Santaneca (415) 648-1034 3781 Mission Street, San Francisco, CA",0.763338,False


### Indexing and Searching for Similar Texts in a Dataset

TextSim allows you to maintain, update, and query a large index to find similar texts. This gives you more control over indexing and querying your dataset, including how many similar texts you want to retrieve per query and detailed results.

You can use the `.add` method to add examples from your dataset to the index, then use  `.search` to search the index and return the most similar texts to your search query.

In [11]:
# resets the index if you previously added things
text_sim.reset_index()

# adds the dataset of target examples to the index
text_sim.add(targets)

# for each query, search for the k=5 most similar texts
result_collection = text_sim.search(queries, similarity_threshold=0.9, k=5)

In [12]:
# texts are considered near-duplicate matches if their similarity >= similarity_threshold
total_matches = result_collection.total_matches
print("Total matches found:", total_matches)

Total matches found: 103


`result_collection.results` contains a list of results corresponding to each query. Each `Result` object contains the results of a search query, including the number of matches found (`.num_matches`), the idx/data/embedding of the query (`.query_idx`, `.query_data`, `.query_embedding`), and a list of `Match` objects (`.matches`).

The list of `Match`'s correspond to the `k` most similar texts found for the query, sorted by similarity (most similar first). Each `Match` object contains info on whether it is a near-duplicate match (`.is_match`), the rank (`.rank`), the data (`.data`), the similarity value (`.similarity`), and the embedding (`.embedding`) of the matched text.

You can visualize a search result using `text_sim.visualize(result)`.

In [13]:
# visualize results for each query using .visualize
query_idx = 0
result = result_collection.results[query_idx]
text_sim.visualize(result)

Query 0: "Winton Deli Cafe (510) 786-2444 2042 W Winton Ave, Hayward, CA 94545"
Most similar matches:

  idx  is_match      similarity  text
-----  ----------  ------------  ----------------------------------------------------------------
   49  True                0.93  Winton Deli (510) 786-2444 2042 W Winton Avenue, Hayward, CA
  205  False               0.64  Barriques Coffee (608) 268-6264 127 W Washington Ave, Madison, W
  254  False               0.61  Cafe Zoma (608) 246-2009 2326 Atwood Ave, Madison, WI
  166  False               0.61  Subway (608) 441-6887 2850 University Ave, Madison, WI
   34  False               0.61  Taco Bell (608) 249-7312 4120 E Washington Ave, Madison, WI


In [14]:
# visualize a matching result
first_matching_idx = None
for result in result_collection.results:
    if result.num_matches > 0:
        first_matching_idx = result.query_idx
        break

result = result_collection.results[first_matching_idx]
text_sim.visualize(result)

Query 0: "Winton Deli Cafe (510) 786-2444 2042 W Winton Ave, Hayward, CA 94545"
Most similar matches:

  idx  is_match      similarity  text
-----  ----------  ------------  ----------------------------------------------------------------
   49  True                0.93  Winton Deli (510) 786-2444 2042 W Winton Avenue, Hayward, CA
  205  False               0.64  Barriques Coffee (608) 268-6264 127 W Washington Ave, Madison, W
  254  False               0.61  Cafe Zoma (608) 246-2009 2326 Atwood Ave, Madison, WI
  166  False               0.61  Subway (608) 441-6887 2850 University Ave, Madison, WI
   34  False               0.61  Taco Bell (608) 249-7312 4120 E Washington Ave, Madison, WI


You can keep adding examples and querying your index after you create it. This is useful for production use-cases, where you have incoming data or frequently need to query your index.

In [15]:
# add new example to the index
new_examples = ["Googleplex (650) 253-0000 1600 Amphitheatre Parkway, Mountain View, CA 94043"]
text_sim.add(new_examples)

# search for the example, with typos in our query
result_collection = text_sim.search(["googleplx 650-253-0000 1600 amphitheatre parkway, mountain view, ca 94043"], k=5)

result = result_collection.results[0]
text_sim.visualize(result)

Query 0: "googleplx 650-253-0000 1600 amphitheatre parkway, mountain view, ca 94043"
Most similar matches:

  idx  is_match      similarity  text
-----  ----------  ------------  ----------------------------------------------------------------
  304  True                0.92  Googleplex (650) 253-0000 1600 Amphitheatre Parkway, Mountain Vi
  195  False               0.52  KFC (608) 849-5004 600 W Main St, Waunakee, WI
    6  False               0.51  Gus's Diner (608) 318-0900 630 N Westmount Dr, Sun Prairie, WI
   29  False               0.5   Sweet Maple (415) 655-9169 2101 Sutter Street, San Francisco, CA
  194  False               0.49  Pho Nam (608) 836-7040 610 Junction Rd Suite 109, Madison, WI
