Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase training size for feature level classifier #45

Closed
1 task
bkowshik opened this issue May 30, 2017 · 3 comments
Closed
1 task

Increase training size for feature level classifier #45

bkowshik opened this issue May 30, 2017 · 3 comments
Milestone

Comments

@bkowshik
Copy link
Contributor

Ref #43


  • We currently use 5,269 changesets for training our feature level classifier.
  • From changesets reviewed on osmcha with one feature modifications, it looks like we can potentially add upto 4,000 changesets.
  • This increase in the samples in the training dataset should in-turn improve the model.

Next actions

  • Update dataset with the additional 4,000 changesets - @bkowshik

cc: @batpad @geohacker

@bkowshik bkowshik added this to the version-0.5 milestone May 31, 2017
@bkowshik
Copy link
Contributor Author

Curious to see the effect training size of the model has on the metrics, we have the following:

index-2

Notes / Questions

  • The metrics although diminishing have a significant positive slope.
  • If roc_auc score is 0.8 with 6,000 samples, what would it look like with 10,000 samples?
  • When do we know that we have enough samples?

cc: @anandthakker

@bkowshik
Copy link
Contributor Author

Workflow

  1. Set number of samples to use for the current run
  2. Use only this subset of samples from the labelled training data
  3. Train a model on this subset of training data
  4. Get predictions from model for the entire validation dataset
  5. Extract metrics on validation dataset
  6. Increase number of samples to use for the next run and go again

@bkowshik
Copy link
Contributor Author

Before we had 8,620 labelled samples out of which 6,036 was used for training and 2,584 for validation. With the backfill done, we now have 10,165 out out which we use 7115 for testing and 3050 for validation.

  • In total we added 1,545 new changesets to the labelled dump. 🎉

Interestingly, the nice upward graph now has become something like below. I don't understand why this is happening though.

index-2

We are 💯 to close here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant