Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add interface for multi-instance splitting (closes #126) #127

Merged
merged 6 commits into from
Nov 26, 2021
Merged

Conversation

jannisborn
Copy link
Contributor

@jannisborn jannisborn commented Nov 21, 2021

Addressing #126 (support for multi-instance splitting).
This is a draft PR, please give some feedback @kexinhuang12345

Done:
Following your suggestion, I implemented a method called create_fold_setting_cold_multi that splits on multiple instance. I tried to follow the package conventions in coding style. The code logic is to iterate over all entities on which the split should be done (e.g., ['Drug_ID', 'Cell Line_ID']) and randomly assign entity-instances to test. The test-set is created by selecting samples where all entity instances belong to test. This is then repeated for validation.
The proposed method is strictly more powerful than the current create_fold_setting_cold. Indeed, if a single entity is passed for splitting the result is always identical. I verified like this for multiple configurations.

folds_1 = create_fold_setting_cold_multi(df, 42, [0.8, 0.1, 0.1], entities=['Drug_ID'])
folds_2 = create_fold_setting_cold_multi(df, 42, [0.8, 0.1, 0.1], entity='Drug_ID')

for a,b in zip(folds_1.values(), folds_1.values()):
    assert all(a==b)

Therefore, I wanted to suggest to replace the current create_fold_setting_cold with the body of the new create_fold_setting_cold_multi. This would be perfectly backwards compatible. The only difference would be that for the positional entity argument, instead of a str you could also pass a List of strings. The beauty of this solution is that it would avoid duplicate code snippets and require minimal updates in the existing codebase. Let me know what you think

ToDo:

  • Decide whether create_fold_setting_cold is superseded by create_fold_setting_cold_multi
  • Add an example to the docs
  • unittests. Do you want to see some?
  • Remove the print commands (I added them for debugging)

@kexinhuang12345
Copy link
Collaborator

kexinhuang12345 commented Nov 21, 2021

Hi Jannis, thanks so much for the quick, excellent and elegant solution! I think it makes lots of sense to replace create_fold_setting_cold with your version. Feel free to go ahead!

And regarding the other ToDos, (1) it would be great to add an example in the doc and I will also duplicate it in the website; (2) uni test would be great! After you are done, I will also do some testing on the other dataloaders that use the create_fold_setting_cold (3) sounds good!

@jannisborn
Copy link
Contributor Author

Thanks Kexin for the quick feedback. I think the PR is ready for review!

  • I now replaced the body of the create_fold_setting_cold with the new code
  • I adapted the docstrings (since you deploy the docs automatically via sphinx I dont think anything else is needed here?)
  • I added unittests for the new multi-split version. I also renamed the remaining methods to prepend test_ such that they are discovered by unittest
  • I wanted to add an example to the website here but I didnt find the code for it in the repo. Maybe it's easier if you do that quickly?

@jannisborn jannisborn marked this pull request as ready for review November 22, 2021 15:57
@kexinhuang12345
Copy link
Collaborator

Thank you @jannisborn ! Looking great! I will review it this week and will keep you posted!
And yes, the docstring is auto generated; website is in another repo which i can copy-paste to it quickly

@kexinhuang12345
Copy link
Collaborator

Hey @jannisborn thanks again for the commit! I go through the code and conduct several tests. They all worked perfectly! Merging it now

@kexinhuang12345 kexinhuang12345 merged commit 47b8f9c into mims-harvard:main Nov 26, 2021
@kexinhuang12345
Copy link
Collaborator

Also, just added an example on the website! https://tdcommons.ai/functions/data_split/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants