Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track input data #232

Merged
merged 13 commits into from
Feb 9, 2021
Merged

Track input data #232

merged 13 commits into from
Feb 9, 2021

Conversation

edmundmiller
Copy link
Contributor

Closes #141

@Zethson is this the direction you had in mind so far?

@edmundmiller edmundmiller self-assigned this Feb 7, 2021
@edmundmiller edmundmiller changed the base branch from master to development February 7, 2021 23:02
@mlf-core mlf-core deleted a comment from github-actions bot Feb 7, 2021
@mlf-core mlf-core deleted a comment from github-actions bot Feb 7, 2021
Copy link
Contributor

@KevinMenden KevinMenden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this looks already pretty nice and simply! 👍

However, for image projects or any other projects with larger datasets, you have folders of images and the labels usually in a json file.
Since this is a typical subcase, we should probably generate md5 sums for all files in these folders, to really be on the safe side.

Only problem with this is that this could be really time-intensive if someone happens to run this on a large dataset, with half a million samples. Maybe it's not too bad though. Probably something to try out.

I've also implemented a recursive calculation of md5sums in my PR by the way, in case you want to take a shortcut :)

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agree with everything that @KevinMenden said
  2. We usually use single quotes in the template and it would be nice to stick with them here (super nitpick, don't worry :) )
  3. I slept about it and I prefer input_name_input_hash :)
  4. Besides that I think this is what I/we had in mind 👍 Just need to update the documentation somewhere to mention that this functionality exists. I am not perfectly sure yet where. Maybe you have an idea? If not feel free to ping me and I will try to find a suitable spot.
  5. I increased the version of mlf-core in development -> please merge development into this branch and update the changelog

@edmundmiller
Copy link
Contributor Author

edmundmiller commented Feb 9, 2021

4. Besides that I think this is what I/we had in mind +1  Just need to update the documentation somewhere to mention that this functionality exists. I am not perfectly sure yet where. Maybe you have an idea? If not feel free to ping me and I will try to find a suitable spot.

I think this is the only part left. I think the documentation of mlf_core.py needs to happen, and maybe to make a follow up issue.

Do you want me to update the changelog in this PR also?

"""
Recursively go through directories and subdirectories
and generate tuples of (<file_path>, <md5sum>)
returns: list of tuples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be a type hint as well :)

@Zethson
Copy link
Member

Zethson commented Feb 9, 2021

4. Besides that I think this is what I/we had in mind +1  Just need to update the documentation somewhere to mention that this functionality exists. I am not perfectly sure yet where. Maybe you have an idea? If not feel free to ping me and I will try to find a suitable spot.

I think this is the only part left. I think the documentation of mlf_core.py needs to happen, and maybe to make a follow up issue.

Do you want me to update the changelog in this PR also?

+1 for updating the CL
You can either update the documentation right now or open an issue and do it later. Whatever you prefer.

Traceback (most recent call last):
  File "exploding_springfield/exploding_springfield.py", line 57, in <module>
    MLFCore.log_input_data(dm)
  File "/mlflow/projects/code/exploding_springfield/mlf_core/mlf_core.py", line 100, in log_input_data
    input_hash = cls.get_md5_sums(input_data)
  File "/mlflow/projects/code/exploding_springfield/mlf_core/mlf_core.py", line 85, in get_md5_sums
    elements = glob.glob(dir + "/*")
TypeError: unsupported operand type(s) for +: 'MNISTDataModule' and 'str'

This needs to be fixed :)

@edmundmiller
Copy link
Contributor Author

@Zethson I'm having trouble figuring this one out, probably due to lack of experience. Should these all just be data/?

I created #235 as a follow up.

@Zethson
Copy link
Member

Zethson commented Feb 9, 2021

@Zethson I'm having trouble figuring this one out, probably due to lack of experience. Should these all just be data/?

I created #235 as a follow up.

Yeah that's why I am not sure whether you can really nicely incorporate the functions into the templates already. These dataloaders download and save the files in /data. However, be careful because when training with Docker the Tensorboard logs are also saved in the /data directory.

@Imipenem time to finally move the tensorboard logs out from /data and save them in a new directory? /logs?

@edmundmiller
Copy link
Contributor Author

What if we comment them out with data/ in them, and slap a TODO comment above it with enable input logging?

@Zethson
Copy link
Member

Zethson commented Feb 9, 2021

What if we comment them out with data/ in them, and slap a TODO comment above it with enable input logging?

So couple of options:

  1. What you suggested
  2. We hard core /data and fix the Tensorboard logging path
  3. We don't add it at all in the templates for now.

I am fine with any of those. So feel free to do what you think is best or we wait for a comment from @Imipenem

@edmundmiller edmundmiller marked this pull request as ready for review February 9, 2021 16:43
@edmundmiller
Copy link
Contributor Author

I went with 1. while we wait to hear from @Imipenem. Should be ready for review if everything passes.

@Imipenem
Copy link
Contributor

Imipenem commented Feb 9, 2021

@emiller88 @Zethson : Yes, we should definitely move the tensorboard logging stuff then into a new directory logs/.

I just remember that I had some issues with the volume path to be shared between docker and local system, which was the reason I stayed with the data/ directory. But I have to try this and see how its going.

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emiller88 @Imipenem can one of you please open an issue to change the tensorboard logging path?

@Zethson Zethson merged commit 185cf56 into development Feb 9, 2021
@Imipenem
Copy link
Contributor

Imipenem commented Feb 9, 2021

Done #236

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Track input data with md5
4 participants