Model Mining class - Keras blog comparison #99

MichalChromcak · 2021-05-24T14:08:13Z

Hi Vincent,

as many times mentioned in private messages, I am a big fan of calm code - an endless source of inspiration what to try, really.

Looking at the Model Mining class today and comparison with Keras claimed performance, I though could not resist to make a note on the fairer comparison.

Problem

Looking at the blog post, the validation set contains 20% of the data, while yours is 50/50 with shuffle=True. Not discussing, which way is better, but I think to compare the result more fairly one would aim to have the same test set.

Assuming the fetch_openml method and following part from the Keras post give the same data (not a different sorting)

# Get the real data from https://www.kaggle.com/mlg-ulb/creditcardfraud/
fname = "/Users/fchollet/Downloads/creditcard.csv"

the performance of human-learn based classifier would drop from

              precision    recall  f1-score   support

       False       1.00      1.00      1.00    142146
        True       0.75      0.75      0.75       258

    accuracy                           1.00    142404
   macro avg       0.88      0.87      0.87    142404
weighted avg       1.00      1.00      1.00    142404

to

              precision    recall  f1-score   support

       False       1.00      1.00      1.00     56886
        True       0.75      0.63      0.68        75

    accuracy                           1.00     56961
   macro avg       0.87      0.81      0.84     56961
weighted avg       1.00      1.00      1.00     56961

Please consider this as an observation, don't want it to look any bad.

Proposed change

num_val_samples = int(len(df_credit) * 0.2)
credit_train, credit_test = df_credit[:-num_val_samples], df_credit[-num_val_samples:]

print(classification_report(credit_test_keras['group'], clf.fit(credit_train_keras, credit_train_keras['group']).predict(credit_test_keras)))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00     56886
        True       0.75      0.63      0.68        75

    accuracy                           1.00     56961
   macro avg       0.87      0.81      0.84     56961
weighted avg       1.00      1.00      1.00     56961

Full code used

Update of Calm Code Model Mining to reflect same validation set as Keras blog

!cat environment.yml

name: calm_code
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pandas=1.2
  - scikit-learn>=0.24.1
  - numpy
  - matplotlib
  - jupyterlab
  - pip

  - pip:
    - human-learn

from sklearn.datasets import fetch_openml
df_credit = fetch_openml(
    data_id=1597,
    as_frame=True
)

df_credit = df_credit['frame'].rename(columns={"Class": "group"})
df_credit['group'] = df_credit['group'] == '1'

df_credit.head()

def split_data(df, way="original"):
    if way=="original":
        from sklearn.model_selection import train_test_split

        train, test = train_test_split(df, test_size=0.5, shuffle=True)
    else:
        # assuming the datasources are sorted in the same way (on kaggle and sklearn), 
        # this way of split should keep the fair comparison with the Keras blog
        num_val_samples = int(len(df) * 0.2)
        train, test = df.iloc[:-num_val_samples], df.iloc[-num_val_samples:]

    return train, test

credit_train_orig, credit_test_orig = split_data(df_credit, way="original")
credit_train_keras, credit_test_keras = split_data(df_credit, way="keras")

from hulearn.experimental import CaseWhenRuler
from hulearn.classification import FunctionClassifier

def make_prediction(dataf):
    ruler = CaseWhenRuler(default=0)

    (ruler
    .add_rule(lambda d: (d['V11'] > 4), 1)
    .add_rule(lambda d: (d['V17'] < -3), 1)
    .add_rule(lambda d: (d['V14'] < -8), 1))

    return ruler.predict(dataf)

clf = FunctionClassifier(make_prediction)
from sklearn.metrics import classification_report

# it does not matter that test is used in fit as we have rule based system, but maybe this is more expected (results are same)
print(classification_report(credit_test_orig['group'], clf.fit(credit_train_orig, credit_train_orig['group']).predict(credit_test_orig)))
# print(classification_report(credit_test_orig['group'], clf.fit(credit_test_orig, credit_test_orig['group']).predict(credit_test_orig)))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00    142146
        True       0.75      0.75      0.75       258

    accuracy                           1.00    142404
   macro avg       0.88      0.87      0.87    142404
weighted avg       1.00      1.00      1.00    142404

print(classification_report(credit_test_keras['group'], clf.fit(credit_train_keras, credit_train_keras['group']).predict(credit_test_keras)))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00     56886
        True       0.75      0.63      0.68        75

    accuracy                           1.00     56961
   macro avg       0.87      0.81      0.84     56961
weighted avg       1.00      1.00      1.00     56961

Sources
https://calmcode.io/model-mining/benchmark.html
https://keras.io/examples/structured_data/imbalanced_classification/

The text was updated successfully, but these errors were encountered:

koaning · 2021-05-24T14:16:18Z

Ah yeah that might be fair to further explore. The general point that I'm trying to make is that the rule model certainly seems competative but if the validation size makes a large effect I should investigate that further.

koaning · 2021-05-24T15:15:40Z

I just ran the numbers locally and I can confirm your finding.

I'll add a comment to the post to take everything with a grain of salt, everything you describe here is fair to add.

But it strikes me that the conclusion still remains the same. The Keras blog lists a 0.24 precision score which is terrible. Even at, p=0.75/r=0.63 I would argue that the rule-based system has a lot going for it compared to the Keras numbers.

koaning · 2021-05-24T15:40:26Z

I've just pushed a PR with the comment (should be live within a minute). The comment will link to the conversation here so as well. I'm closing the issue but if folks want to keep discussing what might be a fair comparison here: feel free.

To be perfectly honest, as much as I like the technique ... the methodology around it has some rough edges. Since there is a human-in-the-loop it can be incredibly tempting to update the rules to optimize for test set performances, which is exactly what we don't want during benchmarking.

koaning closed this as completed May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Mining class - Keras blog comparison #99

Model Mining class - Keras blog comparison #99

MichalChromcak commented May 24, 2021

koaning commented May 24, 2021

koaning commented May 24, 2021 •

edited

Loading

koaning commented May 24, 2021

Model Mining class - Keras blog comparison #99

Model Mining class - Keras blog comparison #99

Comments

MichalChromcak commented May 24, 2021

Problem

Proposed change

Full code used

Update of Calm Code Model Mining to reflect same validation set as Keras blog

koaning commented May 24, 2021

koaning commented May 24, 2021 • edited Loading

koaning commented May 24, 2021

koaning commented May 24, 2021 •

edited

Loading