# Automatically classify each pair
# Implement a ML Classifier

This chapter is about training a machine learning algorithm to automatically classify each possible pair as a match or not.    
In order to do that, we use both:
- the similarity scores calculated in the previous chapter (X_score)
- the data labelled manually in chapter 2 (the simple questions)

in order to predict if two records should be linked together or not

## 1. Set-up of the score matrix and of the labelled data

### 1.1. Similarity score matrix (see previous chapters)

In [11]:
import pandas as pd
from suricate.data.companies import getXlr
X_lr = getXlr(nrows=500)

from suricate.lrdftransformers import FuzzyConnector, VectorizerConnector, ExactConnector
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer as Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
scores = [
    ('name_vecword', VectorizerConnector(on='name', analyzer='word', ngram_range=(1,2))),
    ('name_vecchar', VectorizerConnector(on='name', analyzer='char', ngram_range=(1,3))),
    ('street_vecword', VectorizerConnector(on='street', analyzer='word', ngram_range=(1,2))),
    ('street_vecchar', VectorizerConnector(on='street', analyzer='char', ngram_range=(1,3))),
    ('city_vecchar', VectorizerConnector(on='city', analyzer='char', ngram_range=(1,3))),
    ('postalcode_exact', ExactConnector(on='postalcode')),
    ('duns_exact', ExactConnector(on='duns')),
    ('countrycode_exact', ExactConnector(on='countrycode'))
]
transformer = FeatureUnion(scores)
steps = [
    ('scorer', transformer),
    ('imputer', Imputer(strategy='constant', fill_value=0)),
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=3))
]
preprocessing_pipeline = Pipeline(steps)
X_score_reduced = preprocessing_pipeline.fit_transform(X=X_lr)
print(X_score_reduced.shape)

(250000, 3)


### 1.2. Loading the labelled data
In the previous chapter, we have seen how to take a representative sample of each possible pair. We assume we are able to manually label each pair : 0 if it is a match, 1 if it is not a match.    
In this tutorial, we already have some labelled data.

In [12]:
from suricate.data.companies import getytrue
y_true = getytrue()
y_true.sample(5)

ix_left   ix_right
c8efedca  9a5ce8e1    0
1289d400  6997e77d    0
e3c0d785  a1e7869f    1
dbf54700  094d3c39    1
f02cb731  f2f90760    0
Name: y_true, dtype: int64

In [13]:
y_true.shape[0]

5535

## 2. Manually Fit() and predict the model

We arrive to a particular problem here:
y_true is a shape smaller than X_score_reduced.    
To fit the classifier on the data, we must take the intersection of y_true and of X_score_reduced.

### 2.1. Finding the labelled data from y_true in the score data

In [14]:
from suricate.preutils import createmultiindex
# Index of all pairs compared
allindex = createmultiindex(X=X_lr)
# Index common to y_true and all pairs compared
commonindex= y_true.index.intersection(allindex)
print('number of labelled samples:{}'.format(len(commonindex)))

number of labelled samples:792


In [15]:
y_labelled = y_true.loc[commonindex]
y_labelled.value_counts()

0    576
1    216
Name: y_true, dtype: int64

In [16]:
X_score_reduced = pd.DataFrame(X_score_reduced, index=allindex)
X_labelled = X_score_reduced.loc[commonindex]

### 2.2. Training and testing the model

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, ix_train, ix_test = train_test_split(X_labelled, y_labelled, commonindex, test_size=0.33)

In [18]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X=X_train, y=y_train)
print('training score:{}'.format(clf.score(X=X_train, y=y_train)))
print('testing score:{}'.format(clf.score(X=X_test, y=y_test)))

training score:0.9377358490566038
testing score:0.9465648854961832




### 2.3. Visualizing the predicted pairs

In [19]:
y_pred_test = pd.Series(clf.predict(X=X_test), index=ix_test)
good_matches = y_pred_test.loc[y_pred_test==1].index

In [20]:
from suricate.lrdftransformers.cartesian import create_lrdf_sbs
create_lrdf_sbs(X=X_lr, on_ix=good_matches).sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,name_left,name_right,street_left,street_right,city_left,city_right,postalcode_left,postalcode_right,duns_left,duns_right,countrycode_left,countrycode_right
ix_left,ix_right,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
f0d34671,253ce464,hamilton sundstrand,hamilton sundstrand,cl4747 harrison ave,4747 harrison ave,rockford,rockford,61125,61108-7929,,51079937.0,US,US
816d262e,fc8bf3d0,ge measurement control,ge sensing,fir tree lane,fir tree lane,groby,groby,le60fh,le60fh,226525053.0,219144201.0,GB,GB
0908a0aa,77f5274a,selex es spa,selex es spa,4 piazza monte grappa,via piemonte,rome,rome,195,187,,434003576.0,IT,IT
150322b3,d8fa1b69,fako gmbh,fako,peutestr,15 peutestrae,hamburg,hamburg,20539,20539,313518398.0,340213235.0,DE,DE
787940de,3b7341ce,marconi selenia communications spa,marconi selenia communications spa,1a via ambrogio negrone,1a via ambrogio negrone,genoa,genoa,16153,16153,440028405.0,,IT,IT
