## Discovering deduplication using the Recordlinkage toolkit

In [1]:
# Import the necessary libraries
import recordlinkage
import pandas as pd

**Step 1:** Read the sample category dataset

In [2]:
sample_category_df = pd.read_csv('./Category_list.csv')
sample_category_df

Unnamed: 0.1,Unnamed: 0,categories,category_link
0,0,Halal,https://www.ubereats.com/gb/search?kn=Halal&pl...
1,1,Pizza,https://www.ubereats.com/gb/search?kn=Pizza&pl...
2,2,Breakfast and brunch,https://www.ubereats.com/gb/search?kn=Breakfas...
3,3,Chinese,https://www.ubereats.com/gb/search?kn=Chinese&...
4,4,[],https://www.ubereats.com/gb/search?carid=eyJwb...
...,...,...,...
149,149,Latin fusion,https://www.ubereats.com/gb/search?kn=LatinFus...
150,150,Northeastern Thai,https://www.ubereats.com/gb/search?kn=Northeas...
151,151,Gluten-free friendly,https://www.ubereats.com/gb/search?kn=GlutenFr...
152,152,Belgian,https://www.ubereats.com/gb/search?kn=Belgian&...


**Step 2:** Find the empty category values and clean them


In [3]:
sample_category_df['categories'] = sample_category_df['categories'].apply(lambda x: x if x != '[]' else 'No category')
sample_category_df = sample_category_df[sample_category_df['categories'] != 'No category']

**Step 3:** Create a record linkage indexer


In [4]:
indexer = recordlinkage.Index()

**Step 4:** Define a blocking method to group similar records


In [5]:
indexer.block('categories')

<Index>

**Step 5:** Generate pairs of records to compare for potential duplicates


In [6]:
pairs = indexer.index(sample_category_df)

**Step 6:** Create a comparison object to specify how to compare records


In [7]:
compare_cl = recordlinkage.Compare()

**Step 7:** Perform a string comparison using the Levenshtein method with a threshold


In [8]:
compare_cl.string('categories', 'categories', method='levenshtein', threshold=0.65)

<Compare>

**Step 8:** Compute the comparison features for the pairs of records

In [9]:
features = compare_cl.compute(pairs, sample_category_df)

**Step 9:** Classification - Set a threshold to classify duplicates


In [10]:
matches = features[features.sum(axis=1) >= 0]

In [11]:
matches

Unnamed: 0,Unnamed: 1,0
77,6,1.0


Identify and duplicate values.

In [12]:
sample_category_df['categories'].loc[77]

'Convenience'

In [13]:
sample_category_df['categories'].loc[6]

'Convenience'

**conclusion:**

We are able to find out duplicate category names from the list of various categories extracted from the Ubereats page using the recrodlinkage package.This saves us time from having to manually check whether there are duplicates after extraction