Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Mining class - Keras blog comparison #99

Closed
MichalChromcak opened this issue May 24, 2021 · 3 comments
Closed

Model Mining class - Keras blog comparison #99

MichalChromcak opened this issue May 24, 2021 · 3 comments

Comments

@MichalChromcak
Copy link

Hi Vincent,

as many times mentioned in private messages, I am a big fan of calm code - an endless source of inspiration what to try, really.

Looking at the Model Mining class today and comparison with Keras claimed performance, I though could not resist to make a note on the fairer comparison.

Problem

Looking at the blog post, the validation set contains 20% of the data, while yours is 50/50 with shuffle=True. Not discussing, which way is better, but I think to compare the result more fairly one would aim to have the same test set.

Assuming the fetch_openml method and following part from the Keras post give the same data (not a different sorting)

# Get the real data from https://www.kaggle.com/mlg-ulb/creditcardfraud/
fname = "/Users/fchollet/Downloads/creditcard.csv"

the performance of human-learn based classifier would drop from

              precision    recall  f1-score   support

       False       1.00      1.00      1.00    142146
        True       0.75      0.75      0.75       258

    accuracy                           1.00    142404
   macro avg       0.88      0.87      0.87    142404
weighted avg       1.00      1.00      1.00    142404

to

              precision    recall  f1-score   support

       False       1.00      1.00      1.00     56886
        True       0.75      0.63      0.68        75

    accuracy                           1.00     56961
   macro avg       0.87      0.81      0.84     56961
weighted avg       1.00      1.00      1.00     56961

Please consider this as an observation, don't want it to look any bad.

Proposed change

num_val_samples = int(len(df_credit) * 0.2)
credit_train, credit_test = df_credit[:-num_val_samples], df_credit[-num_val_samples:]

print(classification_report(credit_test_keras['group'], clf.fit(credit_train_keras, credit_train_keras['group']).predict(credit_test_keras)))
              precision    recall  f1-score   support

       False       1.00      1.00      1.00     56886
        True       0.75      0.63      0.68        75

    accuracy                           1.00     56961
   macro avg       0.87      0.81      0.84     56961
weighted avg       1.00      1.00      1.00     56961

Full code used

Update of Calm Code Model Mining to reflect same validation set as Keras blog

!cat environment.yml
name: calm_code
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pandas=1.2
  - scikit-learn>=0.24.1
  - numpy
  - matplotlib
  - jupyterlab
  - pip

  - pip:
    - human-learn
from sklearn.datasets import fetch_openml
df_credit = fetch_openml(
    data_id=1597,
    as_frame=True
)

df_credit = df_credit['frame'].rename(columns={"Class": "group"})
df_credit['group'] = df_credit['group'] == '1'

df_credit.head()
def split_data(df, way="original"):
    if way=="original":
        from sklearn.model_selection import train_test_split

        train, test = train_test_split(df, test_size=0.5, shuffle=True)
    else:
        # assuming the datasources are sorted in the same way (on kaggle and sklearn), 
        # this way of split should keep the fair comparison with the Keras blog
        num_val_samples = int(len(df) * 0.2)
        train, test = df.iloc[:-num_val_samples], df.iloc[-num_val_samples:]

    return train, test
credit_train_orig, credit_test_orig = split_data(df_credit, way="original")
credit_train_keras, credit_test_keras = split_data(df_credit, way="keras")
from hulearn.experimental import CaseWhenRuler
from hulearn.classification import FunctionClassifier

def make_prediction(dataf):
    ruler = CaseWhenRuler(default=0)

    (ruler
    .add_rule(lambda d: (d['V11'] > 4), 1)
    .add_rule(lambda d: (d['V17'] < -3), 1)
    .add_rule(lambda d: (d['V14'] < -8), 1))

    return ruler.predict(dataf)

clf = FunctionClassifier(make_prediction)
from sklearn.metrics import classification_report
# it does not matter that test is used in fit as we have rule based system, but maybe this is more expected (results are same)
print(classification_report(credit_test_orig['group'], clf.fit(credit_train_orig, credit_train_orig['group']).predict(credit_test_orig)))
# print(classification_report(credit_test_orig['group'], clf.fit(credit_test_orig, credit_test_orig['group']).predict(credit_test_orig)))
              precision    recall  f1-score   support

       False       1.00      1.00      1.00    142146
        True       0.75      0.75      0.75       258

    accuracy                           1.00    142404
   macro avg       0.88      0.87      0.87    142404
weighted avg       1.00      1.00      1.00    142404
print(classification_report(credit_test_keras['group'], clf.fit(credit_train_keras, credit_train_keras['group']).predict(credit_test_keras)))
              precision    recall  f1-score   support

       False       1.00      1.00      1.00     56886
        True       0.75      0.63      0.68        75

    accuracy                           1.00     56961
   macro avg       0.87      0.81      0.84     56961
weighted avg       1.00      1.00      1.00     56961

Sources
https://calmcode.io/model-mining/benchmark.html
https://keras.io/examples/structured_data/imbalanced_classification/

@koaning
Copy link
Owner

koaning commented May 24, 2021

Ah yeah that might be fair to further explore. The general point that I'm trying to make is that the rule model certainly seems competative but if the validation size makes a large effect I should investigate that further.

@koaning
Copy link
Owner

koaning commented May 24, 2021

I just ran the numbers locally and I can confirm your finding.

I'll add a comment to the post to take everything with a grain of salt, everything you describe here is fair to add.

But it strikes me that the conclusion still remains the same. The Keras blog lists a 0.24 precision score which is terrible. Even at, p=0.75/r=0.63 I would argue that the rule-based system has a lot going for it compared to the Keras numbers.

@koaning
Copy link
Owner

koaning commented May 24, 2021

I've just pushed a PR with the comment (should be live within a minute). The comment will link to the conversation here so as well. I'm closing the issue but if folks want to keep discussing what might be a fair comparison here: feel free.

To be perfectly honest, as much as I like the technique ... the methodology around it has some rough edges. Since there is a human-in-the-loop it can be incredibly tempting to update the rules to optimize for test set performances, which is exactly what we don't want during benchmarking.

@koaning koaning closed this as completed May 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants