# Hands-On Lab 2 - String Matching

#### Welcome to the second hands-on lab of this NLP Workshop. In this task, you will experiment several algorithms for fuzzy string matching. 

Import Packages

In [3]:
import pandas as pd
import numpy as np
from thefuzz import fuzz
import textdistance as tx

import tensorflow as tf

import tensorflow_hub as hub

from absl import logging

# Reduce logging output.
logging.set_verbosity(logging.ERROR)

2022-12-13 15:41:34.005848: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-13 15:41:34.151535: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-13 15:41:34.151560: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-12-13 15:41:35.723657: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-

In [4]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

2022-12-13 15:41:38.624651: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-13 15:41:38.625272: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-13 15:41:38.625326: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2022-12-13 15:41:38.625363: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2022-12-13 15:41:38.625397: W tensorflow/c

module https://tfhub.dev/google/universal-sentence-encoder/4 loaded


Basic examples (Token vs. Edit)

In [5]:

s1 = "hello hello world"
s2 = "hello world world"

# Textdistance - Edit
print("Edit-based similarity:", tx.levenshtein.similarity(s1, s2))
print("Edit-based distance:", tx.levenshtein.distance(s1, s2), "\n")

# Textdistance - Token
print("Token-based Jaccard:", tx.jaccard.similarity(s1, s2))
print("Token-based Cosine:", tx.cosine.similarity(s1, s2), "\n")

# Thefuzz
print("Thefuzz ratio:", fuzz.ratio(s1, s2))
print("Thefuzz partial ratio:", fuzz.partial_ratio(s1, s2))
print("Thefuzz token sort ratio:", fuzz.token_sort_ratio(s1, s2))
print("Thefuzz token set ratio:", fuzz.token_set_ratio(s1, s2), "\n")

# Sentence embed

message_embeddings1 = embed([s1])
message_embeddings2 = embed([s2])
np.inner(message_embeddings1, message_embeddings2)



Edit-based similarity: 13
Edit-based distance: 4 

Token-based Jaccard: 0.7
Token-based Cosine: 0.8235294117647058 

Thefuzz ratio: 65
Thefuzz partial ratio: 65
Thefuzz token sort ratio: 65
Thefuzz token set ratio: 100 



array([[0.8538301]], dtype=float32)

Load Data

In [6]:
read_kwargs = {
    "header": 0,
    "index_col": 0,
    "skip_blank_lines": False,
    "names": ["meetup_names", "given_names"]
}

data = pd.read_csv("../data/fuzzy_names.csv", **read_kwargs).dropna()

In [7]:
data

Unnamed: 0,meetup_names,given_names
199666335,Lynn,Lynn Zhang
achang0319,Cheng,Cheng Chang
AimOnTheEl,A,Aimee Light
andheartsjaz,Jaz Sophia Viccarro,Jasmine Wilson
AusSeattle,Rene,Rene Haase
...,...,...
user 98524592,Kevin N,Kevin Nasto
user 98968492,TR,TR Tuccio
user 99224232,Ariel Greenway,Ariel Ann Greenway
wkeithvan,Wm. Keith van der Meulen,Keith van der Meulen


Calculate similarities

In [12]:

def get_score(s1, s2):
    # return tx.hamming.similarity(s1, s2)
    # return tx.DamerauLevenshtein.similarity(s1, s2)
    # return tx.Levenshtein.similarity(s1, s2)
    # return tx.ratcliff_obershelp.similarity(s1, s2)
    # return tx.jaccard.similarity(s1, s2)
    return fuzz.token_set_ratio(s1, s2)

data['score'] = data.apply(lambda x: get_score(x.meetup_names, x.given_names), axis=1)

In [13]:
data

Unnamed: 0,meetup_names,given_names,score
199666335,Lynn,Lynn Zhang,100
achang0319,Cheng,Cheng Chang,100
AimOnTheEl,A,Aimee Light,17
andheartsjaz,Jaz Sophia Viccarro,Jasmine Wilson,30
AusSeattle,Rene,Rene Haase,100
...,...,...,...
user 98524592,Kevin N,Kevin Nasto,83
user 98968492,TR,TR Tuccio,100
user 99224232,Ariel Greenway,Ariel Ann Greenway,100
wkeithvan,Wm. Keith van der Meulen,Keith van der Meulen,100


Select the best matches based on score

In [14]:
threshold = 80
best_matches = data[data['score'] > threshold]

Calculate the ratio for selected matched

In [15]:
best_matches.shape[0] / data.shape[0]

0.6782608695652174