# Resemblance

Resemblance models try to measure how different two samples are from a multivariate perspective.<br>

It works as follows:
- takes in input two datasets, X1, and X2
- it labels them with 0 and 1
- builds a model that will try to predict to which sample does an observation belong.
- when this model has an AUC = 0.5, it means that the samples are not distingushable.

In the present case, an AUC close to 50% is the expected result because it means that the features are uninformative for the 'classification' of the instances into the train and the test set.

In [1]:
from probatus.datasets import lending_club
from sklearn.model_selection import train_test_split
from probatus.samples import ResemblanceModel

### Use the lending club dataset

In [2]:
credit_df ,X_train,X_test,y_train, y_test =  lending_club()

### Fit the model on two samples, in this case we want to compare how similar are train and test

In [3]:
rm = ResemblanceModel().fit(X_train, X_test)
rm



ResemblanceModel
	Underlying model type: RandomForestClassifier
The model is able to distinguish the samples with an AUC of 0.498

<br>
This is the expected results as the model is not able to categorize an instance based on the features. This is a first indication the train and test set are similar in feature distribution.
If this is not the case, looking at the importance may help us identifying the feature(s) that cause the issue. 

### Check which features seem to have a higher importance in predicting to which sample does the model belong

In [4]:
rm.importances

loan_amnt          0.144486
funded_amnt        0.145201
term               0.015892
int_rate           0.200380
annual_inc         0.220503
fico_range_low     0.108164
fico_range_high    0.102589
long_emp           0.028907
credit_grade_A     0.004518
credit_grade_B     0.006247
credit_grade_C     0.009524
credit_grade_D     0.006349
credit_grade_E     0.004318
credit_grade_F     0.002342
credit_grade_G     0.000579
dtype: float64