# Problem Statement: What to do when your data is full of duplicates

It happens often that you have datasets about *entities* (Companies, persons, places), that are full of *duplicates*, and where it is difficult to find a *unique identifier* across the different datasets for the same entity.   
It might be because the same entity was entered independently in two different information systems, in two slightly different ways. Because of different inputs (typos,..) and other transcription problems (how to type München on an English keyboard) you might find it impossible to match exactly the same name between two datasets.   

## 1. Visualize some examples using Suricate datasets
As an example, the *Suricate* package already has some datasets about companies incorporated.   
This data consists of two tables, called "*left*" and "*right*"

In [11]:
from suricate.data.companies import getsource, gettarget
left = getsource(nrows=200)
right = gettarget(nrows=200)

In [12]:
y = [
    ('6880aa5d', '42a5e929'),
    ('68ba9560', '46bed352'),
    ('4c772645', '2839b691'),
    ('f89ddf6d', 'fbba638f'),
    ('a17ba961', '4247e534')
]

In [13]:
from suricate.lrdftransformers import LrDfVisualHelper
Xsbs = LrDfVisualHelper().fit_transform(X=[left, right])
Xsbs.loc[y]

Unnamed: 0_level_0,Unnamed: 1_level_0,name_source,name_target,street_source,street_target,city_source,city_target,postalcode_source,postalcode_target,duns_source,duns_target,countrycode_source,countrycode_target
ix_source,ix_target,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
6880aa5d,42a5e929,ge aviation systems ltd,ge aviation,arle court,cheltenham,cheltenham,cheltenham,gl51 0,gl52 7,,,GB,GB
68ba9560,46bed352,frey blumenhof,blumenhof frey,mittenheimer str,17a wa12rmbachstraayerasse,oberschleissheim,unterschleissheim,85764,85716,342418069.0,,DE,DE
4c772645,2839b691,mewa textil mietservice,mewa textil service ag co,5 hermann gebauer str,hermann gebauer str,meisenheim,weil im schonbuch,77974,71093,314496969.0,318287794.0,DE,DE
f89ddf6d,fbba638f,aeroflex test solutions,aeroflex ltd,school close,monks brook industrial park,chandler s ford,chandler s ford,so534ra,so534ra,216834002.0,216834002.0,GB,GB
a17ba961,4247e534,honeywell aerospace,honeywell engines systems,190th st,2525 w 190th st,torrance,torrance,25250,90504-6002,,,US,US


As you can see in the examples, there are a lot of cases where the company name or the address is very similar. Are they the same entity? Is there a mathematical way to say if they are the same or not? This problem is usually called deduplication, or record linkage.    

The Suricate project aims to allow you to set-up quickly a machine learning model in order to intelligently finds matches in your data sets

## Next topic: Similarity functions