Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Effect of attributes on the feature level classifier #59

Closed
bkowshik opened this issue Jun 14, 2017 · 2 comments
Closed

Effect of attributes on the feature level classifier #59

bkowshik opened this issue Jun 14, 2017 · 2 comments

Comments

@bkowshik
Copy link
Contributor

Similar to work on training size, we have questions on effect of number of attributes on model:

  • Does the model have enough attributes
  • What attributes contribute how much to model metrics
  • Can less attributes be better in the long term

Workflow

  • Get a list of all attributes available for training
  • Increase the training attributes appending one at a time from the attributes list
  • Train a model with these attributes from the training dataset
  • Get predictions from the model on this subset of attributes from the validation dataset
  • Store model metrics on the validation dataset and plot

Notes

index

  • There are interesting dips in metrics when the following attributes are added to the list of attributes:
    • user_changesets_with_discussions_count
    • old_user_name_special_characters_count
    • feature_version
    • feature_has_website_old
    • iD
    • Vespucci
  • The metrics somewhat reach their maximum around the 20 attributes mark except for the occasional dips
  • I am not sure what else to read out off of this graph.

cc: @anandthakker @batpad @geohacker

@bkowshik
Copy link
Contributor Author

What would it look like when attributes are added in order of importance for prediction instead of in the order they appear in the csv dataset?

The GradientBoostingClassifier provides a method, model.feature_importances_ that gives out scores for feature importance, the higher the score the more important the feature for predictions.

screen shot 2017-06-14 at 2 53 41 pm

Table with 10 attributes that have the highest importance scores

Now, using the same workflow as ^, we add one attribute at a time but starting with the most important attributes to get the graph below.

index

  • Because, we have the best attributes first, the metrics very quickly reach their max value. This is something we expect to happen.
  • We unusually get large dips even when we are well through 50+ attributes
  • The dips are now for the following attributes:
    • feature_name_translations_count_old
    • place
    • MAPS.ME
    • feature_area
    • sport_old
    • office
    • power
    • railway_old
    • barrier_old
    • railway
    • historic
    • changeset_comment_naughty_words_count
    • public_transport_old
    • route
  • There isn't any attribute common between the list ^ and the list ^^

@bkowshik
Copy link
Contributor Author

After increasing the dataset size, still see the unusual dips. 🤔

index

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant