# From Comparison to Prediction

This chapter is about training a machine learning algorithm on the similarity scores calculated in the previous chapter, in order to predict if two records are essentially the same or not.

## Set-up of the score matrix

### Data imports

In [16]:
import pandas as pd
from suricate.data.companies import getXlr
X_lr = getXlr(nrows=500)

In [17]:
X_lr[0].sample(5)

Unnamed: 0_level_0,name,street,city,postalcode,duns,countrycode
ix,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ea464e59-6a49-4424-af64-5a3eb67e502d,adf technologies championdoor adf technologies...,6 avenue jean monnet,colomiers,31770,,FR
d8fa1b69-823b-4456-b13d-5235988514ae,fako,15 peutestrae,hamburg,20539,340213235.0,DE
77c66fd9-8215-4702-945b-92fd6958b535,rg gmbh,7 im meiel,waldenbuch,71111,319932075.0,DE
76e834db-6e0c-400a-8f51-98b2652fd479,datalis,8 avenue gutenberg,portet sur garonne,31125,260332779.0,FR
0e02feec-dbf0-4693-b7bf-0b4d01629baa,heider druck gmbh,102 116 paffrather str,bergisch gladbach,51465,323117283.0,DE


### Creating the score matrix
We can pipeline and concatenate several comparators using standard scikit-learn operators

In [18]:
from suricate.lrdftransformers import FuzzyConnector, VectorizerConnector, ExactConnector
from sklearn.pipeline import FeatureUnion

In [24]:
scores = [
    ('name_vecword', VectorizerConnector(on='name', analyzer='word', ngram_range=(1,2))),
    ('name_vecchar', VectorizerConnector(on='name', analyzer='char', ngram_range=(1,3))),
    ('street_vecword', VectorizerConnector(on='street', analyzer='word', ngram_range=(1,2))),
    ('street_vecchar', VectorizerConnector(on='street', analyzer='char', ngram_range=(1,3))),
    ('city_vecchar', VectorizerConnector(on='city', analyzer='char', ngram_range=(1,3))),
    ('postalcode_exact', ExactConnector(on='postalcode')),
    ('duns_exact', ExactConnector(on='duns')),
    ('countrycode_exact', ExactConnector(on='countrycode'))
]
transformer = FeatureUnion(scores)
X_score = transformer.fit_transform(X_lr)
X_score.shape

(250000, 8)

### Pipelining for dimension reduction

In [40]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer as Imputer
from sklearn.preprocessing import Normalizer as Scaler
from sklearn.decomposition import PCA
steps = [
    ('scorer', transformer),
    ('imputer', Imputer(strategy='constant', fill_value=0)),
    ('scaler', Scaler()),
    ('pca', PCA(n_components=3))
]
preprocessing_pipeline = Pipeline(steps)
X_score_ready = preprocessing_pipeline.fit_transform(X=X_lr)
print(X_score_ready.shape)
# from seaborn import pairplot
# %matplotlib inline
# pairplot(pd.DataFrame(X_score_ready))

(250000, 3)


## Training the model

### Asking representative questions
#### By taking a sample of each cluster of pairs, we have a representative subset of the whole data set
Since we have many possibles pairs (n_rows_left * n_rows_right), we must take a representative subset of all possible pairs in order to train the model. Based on the score matrix defined at the end of the preprocessing pipeline, we can do that using clustering. We cluster the pairs according to their similarity scores.


In [51]:
from sklearn.cluster import KMeans
cluster = KMeans(n_clusters=20)
y_cluster = Pipeline(steps=[
        ('feature_extraction', preprocessing_pipeline),
        ('clustering', cluster)
]).fit_transform(X_lr)
y_avg_score = Pipeline(steps=[
        ('feature_extraction', preprocessing_pipeline),
        ('reduction1d', PCA(n_components=1))
]).fit_transform(X_lr)

In [53]:
len(y_cluster)

250000

In [48]:
df = cluster.representative_questions(n_questions=10).sort_values(by=['similarity', 'cluster'], ascending=False)

In [49]:
df.to_csv('/Users/paulogier/Google Drive/1-Work/results.csv', index=True)