# Automatically classify each pair
# Implement a ML Classifier

This chapter is about training a machine learning algorithm to automatically classify each possible pair as a match or not.    
In order to do that, we use both:
- the similarity scores calculated in the previous chapter (X_score)
- the data labelled manually in chapter 2 (the simple questions)

in order to predict if two records should be linked together or not

## 1. Set-up of the score matrix and of the labelled data

### 1.1. Similarity score matrix (see previous chapters)

In [1]:
import pandas as pd
from suricate.data.companies import getXlr
X_lr = getXlr(nrows=500)

from suricate.lrdftransformers import FuzzyConnector, VectorizerConnector, ExactConnector
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer as Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
scores = [
    ('name_vecword', VectorizerConnector(on='name', analyzer='word', ngram_range=(1,2))),
    ('name_vecchar', VectorizerConnector(on='name', analyzer='char', ngram_range=(1,3))),
    ('street_vecword', VectorizerConnector(on='street', analyzer='word', ngram_range=(1,2))),
    ('street_vecchar', VectorizerConnector(on='street', analyzer='char', ngram_range=(1,3))),
    ('city_vecchar', VectorizerConnector(on='city', analyzer='char', ngram_range=(1,3))),
    ('postalcode_exact', ExactConnector(on='postalcode')),
    ('duns_exact', ExactConnector(on='duns')),
    ('countrycode_exact', ExactConnector(on='countrycode'))
]
transformer = FeatureUnion(scores)
steps = [
    ('scorer', transformer),
    ('imputer', Imputer(strategy='constant', fill_value=0)),
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=3))
]
preprocessing_pipeline = Pipeline(steps)
X_score_reduced = preprocessing_pipeline.fit_transform(X=X_lr)
print(X_score_reduced.shape)

(250000, 3)


### 1.2. Loading the labelled data
In the previous chapter, we have seen how to take a representative sample of each possible pair. We assume we are able to manually label each pair : 0 if it is a match, 1 if it is not a match.    
In this tutorial, we already have some labelled data.

In [2]:
from suricate.data.companies import getytrue
y_true = getytrue()
y_true.sample(5)

ix_left   ix_right
c48d2aae  5687372d    0.0
8b16163c  d90bd6f1    0.0
c42d1668  8ae5cbfc    0.0
57fb4d86  62f043fc    0.0
d1406ede  fa66dcf6    0.0
Name: y_true, dtype: float64

In [3]:
y_true.shape[0]

4587588

## 2. Manually Fit() and predict the model

We arrive to a particular problem here:
y_true is a shape smaller than X_score_reduced.    
To fit the classifier on the data, we must take the intersection of y_true and of X_score_reduced.

### 2.1. Finding the labelled data from y_true in the score data

In [4]:
from suricate.preutils import createmultiindex
# Index of all pairs compared
allindex = createmultiindex(X=X_lr)
# Index common to y_true and all pairs compared
commonindex= y_true.index.intersection(allindex)
print('number of labelled samples:{}'.format(len(commonindex)))

number of labelled samples:250000


In [5]:
y_labelled = y_true.loc[commonindex]
y_labelled.value_counts()

0.0    249691
1.0       309
Name: y_true, dtype: int64

In [6]:
X_score_reduced = pd.DataFrame(X_score_reduced, index=allindex)
X_labelled = X_score_reduced.loc[commonindex]

### 2.2. Training and testing the model

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, ix_train, ix_test = train_test_split(X_labelled, y_labelled, commonindex, test_size=0.33)

In [8]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X=X_train, y=y_train)
print('training score:{}'.format(clf.score(X=X_train, y=y_train)))
print('testing score:{}'.format(clf.score(X=X_test, y=y_test)))



training score:0.9997194029850747
testing score:0.9996848484848485


### 2.3. Visualizing the predicted pairs

In [9]:
y_pred_test = pd.Series(clf.predict(X=X_test), index=ix_test)
good_matches = y_pred_test.loc[y_pred_test==1].index

In [10]:
from suricate.lrdftransformers.cartesian import create_lrdf_sbs
create_lrdf_sbs(X=X_lr, on_ix=good_matches).sample(5)


Unnamed: 0_level_0,Unnamed: 1_level_0,name_left,name_right,street_left,street_right,city_left,city_right,postalcode_left,postalcode_right,duns_left,duns_right,countrycode_left,countrycode_right
ix_left,ix_right,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
cfcf9b8c,cfcf9b8c,keithley instruments gmbh,keithley instruments gmbh,65 landsberger str,65 landsberger str,germering,germering,82110,82110,31609217.0,31609217.0,DE,DE
d1406ede,9bc7bcee,sna europe,sna europe deutschland a range of sna germany ...,willettstr,10 willettstr,mettmann,mettmann,40822,40822,,,DE,DE
591099fe,8b5d81b9,nespresso deutschland gmbh,nespresso deutschland gmbh,speditionsstrae,23 speditionstr,dusseldorf,dusseldorf,40221,40221,,333868649.0,DE,DE
22be1313,22be1313,fako heinrich a anton,fako heinrich a anton,sderstr,sderstr,hamburg,hamburg,20537,20537,340213235.0,340213235.0,DE,DE
37fa1e22,c8efedca,marconi selenia communications spa,marconi selenia communications spa,1a via ambrogio negrone,1 a via negrone,genoa,genoa,16153,16153,,,IT,IT
