Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do we have to manually split training and validation images? #45

Open
aferust opened this issue Jan 25, 2021 · 9 comments
Open

Why do we have to manually split training and validation images? #45

aferust opened this issue Jan 25, 2021 · 9 comments
Labels
enhancement New feature or request

Comments

@aferust
Copy link

aferust commented Jan 25, 2021

Yes, why? It is a huge hassle. Can you easily implement an auto split procedure using a scikit-learn function (train_test_split)?

@maikherbig
Copy link
Owner

Hi,
thanks for the new issue! Having such an option would indeed be cool for quick model performance tests.

Although, I must admit that I deliberately omitted this feature so far. The reason is that I wanted users to wisely choose their validation data. Lots of things can go wrong when capturing a dataset (wrong labeling, badly focused image, lots of blur, dirt on cam, oversampling one class, …) and the validation set should be chosen and checked with extra care. Furthermore, users almost always want to train models for predicting events in the future. Therefore, it makes sense that the validation set is captured after the training set. Similarly, for biomedical applications, trained models still need to work for data from a new experiment or of another patient.

If the existing dataset is very large, a random allocation of validation data might be an option. Therefore, I’m now thinking how to implement the train_test_split, you suggested, as an option.

@maikherbig maikherbig added the enhancement New feature or request label Jan 26, 2021
@aferust
Copy link
Author

aferust commented Jan 26, 2021

Maybe I write a helper tool for automating that split. It does not have to be embedded in the main project. It can just create the required folder structure based on one folder input. The project seems very promising, and thank you for the great contribution. I used NVIDIA-digits once in this study.: https://link.springer.com/article/10.1007/s11694-020-00707-7

However, installing digits is a hassle, especially for my students who have no programming background. I am looking for an alternative to digits that can be easily installed, then I saw your project. I could not found legacy GoogleNet in the predefined list. It is nice to know the opportunity for getting some support for the program.

@maikherbig
Copy link
Owner

Sounds like AIDeveloper could be a helpful tool for you. The students just need to download and unzip. AIDeveloper even works with GPU support.
Within the unzipped AIDeveloper folder, you can find following scripts:
model_zoo.py, aid_backbone.py, aid_bin.py, aid_start.py , aid_dl.py , aid_frontend.py ,aid_img.py,
which you can modify to your desire. After the next start of AIDevloper.exe the changes will take effect.
In particular, model_zoo.py contains the definitions of the neural nets. There is explanation how to add a model within the script. Furthermore, I uploaded a tutorial video showing how to add models to the model zoo.

Maybe you already discovered the "Python" tab within AIDeveloper. There, you can execute any code you want in the same Python environment that is used by AIDeveloper. Hence, packages like tensorflow, scikit-learn, opencv and so on are available without having to install anything.
You could for example use it to execute code for automating the train_test split :)
PythonTabInAID

@aferust
Copy link
Author

aferust commented Jan 26, 2021

Thank you for the information. It looks like the user has much control over it.

@aferust
Copy link
Author

aferust commented Jan 26, 2021

Dear Maik,

Here is my tt_split implementation. Probably, it needs more error handling. I had near-zero experience with QT, and this one is my first Qt program, and looks like it does the job :)

https://gist.github.com/aferust/55bb70359fdd3148c7e920b02907084a

@maikherbig
Copy link
Owner

Thanks for sharing your code!
I used QtDesigner for AIDeveloper. The resulting .ui files can be transformed into .py scripts.
It sounds like you just need some kind of easy to install Python environment for your course. I made an the project PyBox which provides Python in a .zip (basically like AIDeveloper): https://github.com/maikherbig/PyBox
PyBox is easier to customize (compared to AIDeveloper)

@DankMemeGuy
Copy link

While I really thank the GUI and the splitting code, I really think this should be a feature of the software. Having a simple option where you load a class, then set the percentage for training, validation, and testing would be very very useful.

i understand the concern that 'garbage in, garbage out' where you would want people to check their images before using it, but I think a tool like this is more about developing skills, a reasonable model, and fast. there is also many free and easy ways to collect pretty good data, and sure is there going to be some garbage in a dataset? yeah probably, but if you have a class with 10,000 images, and you have 90% very high quality images, then that's good enough accuracy for a model made from a GUI. No one will be making a model in this tool and using it at Google or Facebook or something right? this is just for developing understanding, hobby models, etc. Having a model that is 70% good isn't bad at all!

plus the problem with using external script to do this is it just makes life harder than it has to be. the model should be continually trained and it will be better and better, and if its bad then you retrain it or you start going through the dataset, etc. i think having the option to split the dataset in the software would help people in that journey.

maikherbig added a commit that referenced this issue Jan 4, 2023
maikherbig added a commit that referenced this issue Jan 4, 2023
@maikherbig
Copy link
Owner

maikherbig commented Jan 4, 2023

@DankMemeGuy thanks for your suggestions. I have implemented a (kind of) quick solution. You can now find a new checkbox 'Validation split(%)'. You can change that fraction during the training process on the fly.
Link to the new update https://github.com/maikherbig/AIDeveloper/releases/tag/0.4.7-update

@DankMemeGuy
Copy link

thank you very much!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants