<a href="https://colab.research.google.com/github/michalis0/BigScaleAnalytics/blob/master/week5/2-restaurants_deduplication.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# if you don't have recordlinkage installed
!pip install recordlinkage

Collecting recordlinkage
[?25l  Downloading https://files.pythonhosted.org/packages/db/26/babbca39d74824e8bc17428a8eb04951a1d63318af7d02beeb2106a1ec26/recordlinkage-0.14-py3-none-any.whl (944kB)
[K     |████████████████████████████████| 952kB 6.7MB/s 
Collecting jellyfish>=0.5.4
[?25l  Downloading https://files.pythonhosted.org/packages/30/a6/4d039bc827a102f62ce7a7910713e38fdfd7c7a40aa39c72fb14938a1473/jellyfish-0.8.2-cp37-cp37m-manylinux2014_x86_64.whl (90kB)
[K     |████████████████████████████████| 92kB 5.5MB/s 
Installing collected packages: jellyfish, recordlinkage
Successfully installed jellyfish-0.8.2 recordlinkage-0.14


In [3]:
import pandas as pd
import recordlinkage

In [5]:
df = pd.read_csv("https://raw.githubusercontent.com/michalis0/BigScaleAnalytics/master/week5/datasets/restaurants.csv", delimiter=",")
targets = df["class"]
df = df.iloc[:, :-1].set_index(keys="id")
df.head(10)

Unnamed: 0_level_0,name,addr,city,phone,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,arnie morton's of chicago,435 s. la cienega blv.,los angeles,310/246-1501,american
2,arnie morton's of chicago,435 s. la cienega blvd.,los angeles,310-246-1501,steakhouses
3,art's delicatessen,12224 ventura blvd.,studio city,818/762-1221,american
4,art's deli,12224 ventura blvd.,studio city,818-762-1221,delis
5,hotel bel-air,701 stone canyon rd.,bel air,310/472-1211,californian
6,bel-air hotel,701 stone canyon rd.,bel air,310-472-1211,californian
7,cafe bizou,14016 ventura blvd.,sherman oaks,818/788-3536,french
8,cafe bizou,14016 ventura blvd.,sherman oaks,818-788-3536,french bistro
9,campanile,624 s. la brea ave.,los angeles,213/938-1447,american
10,campanile,624 s. la brea ave.,los angeles,213-938-1447,californian


### Data cleaning
It is important to clean up a bit the data before jumping into finding duplicates. This not only helps to find matches more easily but also makes our blocking index to work better. Examples of data cleaning steps include standardising the order of first and surname, standardising the addresses dates and phone numbers, lowercasing all of the characters, etc. 

As you already know, Pandas itself is very useful for data cleaning, however Record Linkage Toolkit has also some cleaning function. We are going to use `recordlinkage.preprocessing.clean()` which is the most generic function in this package.

In [6]:
# cleaning the name column
from recordlinkage.preprocessing import clean
df["name"] = clean(df["name"], lowercase=True, strip_accents='unicode', remove_brackets=True)
# addr
df["addr"] = clean(df["addr"], lowercase=True, strip_accents='unicode', remove_brackets=True)
# city
df["city"] = clean(df["city"], lowercase=True, strip_accents='unicode', remove_brackets=True)
# type
df["type"] = clean(df["type"], lowercase=True, strip_accents='unicode', remove_brackets=True)

In [7]:
# there are also more specific cleaning functions in this package
from recordlinkage.preprocessing import phonenumbers
df["phone"] = phonenumbers(df["phone"])

In [8]:
df.head(10)

Unnamed: 0_level_0,name,addr,city,phone,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,arnie mortons of chicago,435 s la cienega blv,los angeles,3102461501,american
2,arnie mortons of chicago,435 s la cienega blvd,los angeles,3102461501,steakhouses
3,arts delicatessen,12224 ventura blvd,studio city,8187621221,american
4,arts deli,12224 ventura blvd,studio city,8187621221,delis
5,hotel bel air,701 stone canyon rd,bel air,3104721211,californian
6,bel air hotel,701 stone canyon rd,bel air,3104721211,californian
7,cafe bizou,14016 ventura blvd,sherman oaks,8187883536,french
8,cafe bizou,14016 ventura blvd,sherman oaks,8187883536,french bistro
9,campanile,624 s la brea ave,los angeles,2139381447,american
10,campanile,624 s la brea ave,los angeles,2139381447,californian


Now let's choose a blocking index for this data set to make record pairs. We will pick the `city` attribute as the blocking index, since it seems that this is the feature which may not change for a restaurant.

In [9]:
indexer = recordlinkage.Index()
indexer.block('city')
candidate_links = indexer.index(df)

print(len(candidate_links))

57943


In [10]:
compare_cl = recordlinkage.Compare()

# exact comparison for city since it is the blocking index
compare_cl.exact('city', 'city', label='city')
# partial similarity between string values
compare_cl.string('name', 'name', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.string('addr', 'addr', threshold=0.85, label='addr')
# phone is not really a numerical value, therefore it's better to use a string comparison
compare_cl.string('phone', 'phone', threshold=0.85, label='phone')
compare_cl.string('type', 'type', method='jarowinkler', threshold=0.85, label='type')

features = compare_cl.compute(candidate_links, df)

In [11]:
features.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,city,surname,addr,phone,type
id_1,id_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,1,1,1.0,1.0,1.0,0.0
9,1,1,0.0,0.0,0.0,1.0
9,2,1,0.0,0.0,0.0,0.0
10,1,1,0.0,0.0,0.0,0.0
10,2,1,0.0,0.0,0.0,0.0
10,9,1,1.0,1.0,1.0,0.0
13,1,1,0.0,0.0,0.0,0.0
13,2,1,0.0,0.0,0.0,0.0
13,9,1,0.0,0.0,0.0,0.0
13,10,1,0.0,0.0,0.0,1.0


### Classification
In general there are three ways to classify the candidate record pairs as matches or non-matches:

- Threshold-based methods
- Supervised learning methods
- Unsupervised learning methods

Once we classify each of the record pairs, we can evaluate the classification using three commonly used metrics: precision, recall and F-score.


#### Threshold based

In [12]:
# Sum the comparison results.
features.sum(axis=1).value_counts().sort_index(ascending=False)

5.0       21
4.0       29
3.0       91
2.0     8289
1.0    49513
dtype: int64

In [13]:
matches = features[features.sum(axis=1) >= 4]

print(len(matches))
matches.head(10)

50


Unnamed: 0_level_0,Unnamed: 1_level_0,city,surname,addr,phone,type
id_1,id_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,1,1,1.0,1.0,1.0,0.0
10,9,1,1.0,1.0,1.0,0.0
14,13,1,1.0,1.0,1.0,1.0
26,25,1,1.0,1.0,1.0,1.0
28,27,1,1.0,0.0,1.0,1.0
34,33,1,1.0,1.0,1.0,1.0
40,39,1,1.0,1.0,1.0,0.0
4,3,1,1.0,1.0,1.0,0.0
6,5,1,0.0,1.0,1.0,1.0
8,7,1,1.0,1.0,1.0,1.0


##### precision and recall

In [15]:
# the restaurants gold data contains the true matches between data pairs
true_links = pd.read_csv("https://raw.githubusercontent.com/michalis0/BigScaleAnalytics/master/week5/datasets/restaurant_gold.csv")
true_links.head()

Unnamed: 0,class,id_1,id_2
0,'0',1,2
1,'1',3,4
2,'2',5,6
3,'3',7,8
4,'4',9,10


In [16]:
# make sure the order of the indices match. here, in the matches dataframe the first index is
# always the bigger number whears in the true links dataframe it is reverse. So we have to set
# the multi-index for true links dataframe in the order of `id_2`, `id_1`
true_links = true_links.set_index(keys=["id_2", "id_1"])

In [17]:
print("precision = ", recordlinkage.precision(true_links, matches))

precision =  0.94


In [18]:
print("recall = ", recordlinkage.recall(true_links, matches))

recall =  0.41964285714285715


In [19]:
print("F-score = ", recordlinkage.fscore(true_links, matches))

F-score =  0.5802469135802469


#### Supervised learning
In this part we will use Logistic Regression to classify the candidate pairs.

In [20]:
from recordlinkage.classifiers import LogisticRegressionClassifier
LR = LogisticRegressionClassifier()

Let's first add labels to the feature vectors we created. This way we can have balanced number of actual matches and non-matches in the train and test set.

In [21]:
join_feature_label = features.reset_index().join(true_links, how='left', on=["id_1", "id_2"])
join_feature_label["class"] = join_feature_label["class"].map(lambda x: 0 if pd.isnull(x) else 1)

In [22]:
join_feature_label.set_index(keys=["id_1", "id_2"], inplace=True)
join_feature_label.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,city,surname,addr,phone,type,class
id_1,id_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,1,1,1.0,1.0,1.0,0.0,1
9,1,1,0.0,0.0,0.0,1.0,0
9,2,1,0.0,0.0,0.0,0.0,0
10,1,1,0.0,0.0,0.0,0.0,0
10,2,1,0.0,0.0,0.0,0.0,0
10,9,1,1.0,1.0,1.0,0.0,1
13,1,1,0.0,0.0,0.0,0.0,0
13,2,1,0.0,0.0,0.0,0.0,0
13,9,1,0.0,0.0,0.0,0.0,0
13,10,1,0.0,0.0,0.0,1.0,0


In [23]:
# train-test split
from sklearn.model_selection import train_test_split
train, test, train_labels, test_labels = train_test_split(join_feature_label.iloc[:, :-1], 
                                                          join_feature_label["class"],
                                                          test_size=0.3)

In [24]:
# find the indices in the train data which contain a true match
train_matches_index = train.index & true_links.index

In [25]:
train_matches_index

MultiIndex([(204, 203),
            (218, 217),
            (142, 141),
            (220, 219),
            (150, 149),
            (168, 167),
            (  8,   7),
            (166, 165),
            (  4,   3),
            (148, 147),
            (196, 195),
            (138, 137),
            (216, 215),
            (158, 157),
            (114, 113),
            (172, 171),
            (136, 135),
            (152, 151),
            (224, 223),
            (176, 175),
            (206, 205),
            ( 14,  13),
            (222, 221),
            ( 40,  39),
            (194, 193),
            (156, 155),
            (144, 143),
            ( 12,  11),
            (170, 169),
            (202, 201),
            (184, 183),
            (140, 139),
            (174, 173),
            ( 44,  43),
            (162, 161),
            (154, 153),
            (210, 209),
            (  2,   1),
            ( 30,  29),
            (164, 163),
            (182, 181),
            (160

In [26]:
# find the indices in the test data containing a true match
test_matches_index = test.index & true_links.index

In [27]:
_ = LR.fit_predict(train, train_matches_index)

In [28]:
LR_test_matches = LR.predict(test)

##### precision and recall

In [29]:
print("precision = ",recordlinkage.precision(test_matches_index, LR_test_matches))

precision =  0.9285714285714286


In [30]:
print("recall = ", recordlinkage.recall(test_matches_index, LR_test_matches))

recall =  0.8666666666666667


In [31]:
print("F-score = ", recordlinkage.fscore(test_matches_index, LR_test_matches))

F-score =  0.896551724137931


We can observe a significant improvment in the supervised learning method compared to threshold based method.

#### Unsupervised learning
We can also use unsupervised methods to classify candidate pairs. Here we are going to use [KMeans](https://en.wikipedia.org/wiki/K-means_clustering). KMeans partiotions candidate record pairs into matches and non-matches. Each comparison vector belongs to the clustcer with the nearest mean. The algorithm is calibrated for two clusters: a match cluster and a non-match cluster). The centers of these clusters can be given as arguments or set automatically.

In [32]:
kmeans = recordlinkage.KMeansClassifier()

let's use the same train and test set as we used for logistic regression. Note that here there is no training involved and we are doing the splitting to see how does our model perform in case of new data coming in.

In [33]:
kmeans.fit_predict(train)

kmeans_test_matches = kmeans.predict(test)

##### precision and recall

In [34]:
print("precision = ",recordlinkage.precision(test_matches_index, kmeans_test_matches))

precision =  0.42857142857142855


In [35]:
print("recall = ", recordlinkage.recall(test_matches_index, kmeans_test_matches))

recall =  1.0


In [36]:
print("F-score = ", recordlinkage.fscore(test_matches_index, kmeans_test_matches))

F-score =  0.6
