Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a Holdout Set #38

Closed
HarshCasper opened this issue Aug 18, 2020 · 11 comments
Closed

Create a Holdout Set #38

HarshCasper opened this issue Aug 18, 2020 · 11 comments

Comments

@HarshCasper
Copy link
Member

Type

Feature

Description

While training, validating and optimizing our model we could over time start to overfit the validation data without realizing it. This means that the model will perform well during training but it will perform poorly on unseen data.

Create a holdout set containing 200 images. We will keep this holdout set aside and only use it at the end to check how the final model performs on unseen data. Keep the code in a Scripts/ directory for future use.

Tools

  • Python

Have you read the Contributing Guidelines on Pull Requests?

Yes

@aryanVijaywargia
Copy link
Contributor

I would like to work on this @HarshCasper

@BALaka-18
Copy link
Collaborator

@aryanVijaywargia Assigned

@macabdul9
Copy link
Collaborator

macabdul9 commented Aug 18, 2020

Holdout set will be from the same distribution, hence the performance will be the same as validation. This is the main problem of machine learning models that they do not perform well on out of the distribution data. For the demo at the client end, we can take few samples (ie: 10) from the main directory and then split the data into train, Val/test set (so that model doesn't get trained on demo data). @aryanVijaywargia @HarshCasper

@HarshCasper
Copy link
Member Author

I guess we can create a seperate issue for that @macabdul9

@BALaka-18
Copy link
Collaborator

Holdout set will be from the same distribution, hence the performance will be the same as validation. This is the main problem of machine learning models that they do not perform well on out of the distribution data. For the demo at the client end, we can take few samples (ie: 10) from the main directory and then split the data into train, Val/test set (so that model doesn't get trained on demo data). @aryanVijaywargia @HarshCasper

@macabdul9 open a new issue for this. You'll be assigned to work on it.

@aryanVijaywargia
Copy link
Contributor

I have a query. I have written a python script that generates 100 samples at random from each class and moves the images to the holdout_dataset directory. So should my pr contain both the holdout_dataset directory (containing the images) as well as the code or only the code will suffice? @BALaka-18

@BALaka-18
Copy link
Collaborator

I have a query. I have written a python script that generates 100 samples at random from each class and moves the images to the holdout_dataset directory. So should my pr contain both the holdout_dataset directory (containing the images) as well as the code or only the code will suffice? @BALaka-18

@aryanVijaywargia both. The sample you created can be used for initial testing, or as an example when we document our model.

@aryanVijaywargia
Copy link
Contributor

Thanks for clarifying @BALaka-18

@macabdul9
Copy link
Collaborator

Holdout set will be from the same distribution, hence the performance will be the same as validation. This is the main problem of machine learning models that they do not perform well on out of the distribution data. For the demo at the client end, we can take few samples (ie: 10) from the main directory and then split the data into train, Val/test set (so that model doesn't get trained on demo data). @aryanVijaywargia @HarshCasper

@macabdul9 open a new issue for this. You'll be assigned to work on it.

I think mentors can not contribute

@BALaka-18
Copy link
Collaborator

Holdout set will be from the same distribution, hence the performance will be the same as validation. This is the main problem of machine learning models that they do not perform well on out of the distribution data. For the demo at the client end, we can take few samples (ie: 10) from the main directory and then split the data into train, Val/test set (so that model doesn't get trained on demo data). @aryanVijaywargia @HarshCasper

@macabdul9 open a new issue for this. You'll be assigned to work on it.

I think mentors can not contribute

@macabdul9 I'm sorry I forgot. Open an issue then, participants will be assigned.

@rutujadhanawade
Copy link
Contributor

Is this issue open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants