Effect of attributes on the feature level classifier #59

bkowshik · 2017-06-14T08:44:23Z

Similar to work on training size, we have questions on effect of number of attributes on model:

Does the model have enough attributes
What attributes contribute how much to model metrics
Can less attributes be better in the long term

Workflow

Get a list of all attributes available for training
Increase the training attributes appending one at a time from the attributes list
Train a model with these attributes from the training dataset
Get predictions from the model on this subset of attributes from the validation dataset
Store model metrics on the validation dataset and plot

Notes

There are interesting dips in metrics when the following attributes are added to the list of attributes:
- user_changesets_with_discussions_count
- old_user_name_special_characters_count
- feature_version
- feature_has_website_old
- iD
- Vespucci
The metrics somewhat reach their maximum around the 20 attributes mark except for the occasional dips
I am not sure what else to read out off of this graph.

cc: @anandthakker @batpad @geohacker

bkowshik · 2017-06-14T09:40:44Z

What would it look like when attributes are added in order of importance for prediction instead of in the order they appear in the csv dataset?

The GradientBoostingClassifier provides a method, model.feature_importances_ that gives out scores for feature importance, the higher the score the more important the feature for predictions.

Table with 10 attributes that have the highest importance scores

Now, using the same workflow as ^, we add one attribute at a time but starting with the most important attributes to get the graph below.

Because, we have the best attributes first, the metrics very quickly reach their max value. This is something we expect to happen.
We unusually get large dips even when we are well through 50+ attributes
The dips are now for the following attributes:
- feature_name_translations_count_old
- place
- MAPS.ME
- feature_area
- sport_old
- office
- power
- railway_old
- barrier_old
- railway
- historic
- changeset_comment_naughty_words_count
- public_transport_old
- route
There isn't any attribute common between the list ^ and the list ^^

bkowshik · 2017-06-21T18:57:33Z

After increasing the dataset size, still see the unusual dips. 🤔

This was referenced Jun 14, 2017

Incrementally add attributes and measure model performance #62

Merged

Weekly update from Gabbarland #26

Open

bkowshik closed this as completed in #62 Jun 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Effect of attributes on the feature level classifier #59

Effect of attributes on the feature level classifier #59

bkowshik commented Jun 14, 2017

bkowshik commented Jun 14, 2017

bkowshik commented Jun 21, 2017

Effect of attributes on the feature level classifier #59

Effect of attributes on the feature level classifier #59

Comments

bkowshik commented Jun 14, 2017

Workflow

Notes

bkowshik commented Jun 14, 2017

bkowshik commented Jun 21, 2017