## Discovering deduplication using the Recordlinkage toolkit
This code is designed to identify duplicate records and retrieve two sample restaurant datasets that are considered **duplicates.**

In [1]:
import recordlinkage
import pandas as pd

Read the sample restaurant dataset

In [2]:
sample_restaurant_df_1 = pd.read_excel('./Final_df.xlsx',sheet_name='Halal')
sample_restaurant_df_2 = pd.read_excel('./Final_df.xlsx',sheet_name='Pizza')

In [3]:
sample_restaurant_df_1.head()

Unnamed: 0.1,Unnamed: 0,Restaurant_ID,Restaurant_name,Restaurant page URL link,Delivery fee,Estimated_delivery_time_min,Estimated_delivery_time_max,Estimated_delivery_time_average,Rating,status,promotion_uuid
0,0,2d9c18e6-49a0-5f9c-a539-395911893bb2,Shahs Halal Food,https://www.ubereats.com/store/shahs-halal-foo...,2.79,35,50,42.5,4.5,,
1,1,9c99ec23-a6e2-4556-b096-0d2de4471e39,The Halal Guys,https://www.ubereats.com/store/the-halal-guys-...,0.29,15,30,22.5,4.2,,f5f190ae-8214-4b07-96d8-52ee6e63623e
2,2,ca792fb2-2785-5428-bb43-730815c2fee1,Shahs halal food,https://www.ubereats.com/store/shahs-halal-foo...,4.29,35,50,42.5,4.6,,
3,3,7253197a-bc59-512d-98de-09e9f1a69f1f,Shah s Halal Food,https://www.ubereats.com/store/shahs-halal-foo...,4.29,30,45,37.5,,,
4,4,4b3e8064-3664-5ae7-872e-6378416586a1,Shahs Halal Food,https://www.ubereats.com/store/shahs-halal-foo...,4.29,35,50,42.5,,,


In [4]:
sample_restaurant_df_2.head()

Unnamed: 0.1,Unnamed: 0,Restaurant_ID,Restaurant_name,Restaurant page URL link,Delivery fee,Estimated_delivery_time_min,Estimated_delivery_time_max,Estimated_delivery_time_average,Rating,status,promotion_uuid
0,0,d3e8d5b5-4fcd-52e8-9afc-043e8fbfb1c5,jojo s Grill Pizza,https://www.ubereats.com/store/jojos-grill-%26...,4.29,50,65,57.5,,,f0682412-4f44-4e6b-b28a-60d1f6475159
1,1,3182c2d0-1278-5176-9713-2402b4aa79d5,Express Grill Pizza,https://www.ubereats.com/store/express-grill-%...,4.29,40,55,47.5,,,
2,2,5dfc8138-b296-4bae-80e9-091b684eae25,Tennessee Pizza BBQ,https://www.ubereats.com/store/tennessee-pizza...,4.29,35,50,42.5,,,c548b39d-d1a2-494c-b06b-b6561c7d8ac0
3,3,c8ca909f-436d-4bf5-9282-7b19046b063b,Snappy Tomato Pizza,https://www.ubereats.com/store/snappy-tomato-p...,4.29,30,45,37.5,4.6,,7238111f-7593-4d4d-8893-dca43cef1102
4,4,7362875a-9f25-5f21-8a4b-99b60d65b89d,Mamma Mia Pizza,https://www.ubereats.com/store/mamma-mia-pizza...,2.0,25,45,35.0,4.1,,58686ea9-6d1d-42b6-9027-0e08d18ed419


**Step 2:** Detect any leading or trailing spaces in the **Restaurant_name** values and remove them for cleaning.

In [5]:
sample_restaurant_df_1['Restaurant_name'] = sample_restaurant_df_1['Restaurant_name'].str.strip()
sample_restaurant_df_2['Restaurant_name'] = sample_restaurant_df_2['Restaurant_name'].str.strip()

**Step 3:** Create a record linkage indexer

In [6]:
indexer = recordlinkage.Index()

**Step 4:** Define a blocking method to group similar records


In [7]:
indexer.block('Restaurant_name')

<Index>

**Step 5:** Generate pairs of records to compare for potential duplicates


In [8]:
pairs = indexer.index(sample_restaurant_df_1,sample_restaurant_df_2)

**Step 6:** Create a comparison object to specify how to compare records


In [9]:
compare_cl = recordlinkage.Compare()

**Step 7:** Perform a string comparison using the Levenshtein method with a threshold


In [10]:
compare_cl.string('Restaurant_name', 'Restaurant_name', method='levenshtein', threshold=1)

<Compare>

**Step 8:** Compute the comparison features for the pairs of records

In [11]:
features = compare_cl.compute(pairs, sample_restaurant_df_1,sample_restaurant_df_2)

**Step 9:** Classification - Set a threshold to classify duplicates

In [12]:
matches = features[features.sum(axis=1) >= 0]

In [13]:
matches

Unnamed: 0,Unnamed: 1,0
8,41,1.0
284,41,1.0
300,41,1.0
313,41,1.0
314,41,1.0
14,172,1.0
39,166,1.0
52,12,1.0
119,12,1.0
54,112,1.0


In [14]:
print('sample_df_1:',sample_restaurant_df_1['Restaurant_name'].loc[377],"    ","sample_df_2:",sample_restaurant_df_2['Restaurant_name'].loc[0])


sample_df_1: jojo s Grill Pizza      sample_df_2: jojo s Grill Pizza


In [15]:
print('sample_df_1:',sample_restaurant_df_1['Restaurant_name'].loc[376],"    ","sample_df_2:",sample_restaurant_df_2['Restaurant_name'].loc[236])


sample_df_1: LDN Pizza Co      sample_df_2: LDN Pizza Co


In [16]:
print('sample_df_1:',sample_restaurant_df_1['Restaurant_name'].loc[373],"    ","sample_df_2:",sample_restaurant_df_2['Restaurant_name'].loc[253])


sample_df_1: China Express      sample_df_2: China Express


In [17]:
print('sample_df_1:',sample_restaurant_df_1['Restaurant_name'].loc[224],"    ","sample_df_2:",sample_restaurant_df_2['Restaurant_name'].loc[135])


sample_df_1: Master Kebab and Pizza      sample_df_2: Master Kebab and Pizza


**conclusion:**

The record linkage analysis has successfully identified and linked duplicate records between the two dataframes, Providing extracted from the Ubereats page.