Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smartimputer #213

Merged
merged 35 commits into from
Mar 27, 2017
Merged

Smartimputer #213

merged 35 commits into from
Mar 27, 2017

Conversation

janvanrijn
Copy link
Member

Added support for data features
Added utils.preprocessor.ConditionalImputer
Much more ...

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think that the improved imputer should live in the benchmark study repository.

----------
index : int
The index of this feature
name : string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str instead of string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

The index of this feature
name : string
Name of the feature
data_type : string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

LEGAL_DATA_TYPES = ['nominal', 'numeric', 'string', 'date']

def __init__(self, index, name, data_type, nominal_values, number_missing_values):
assert type(index) is int, "Index is of wrong datatype"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use if statements here, assert statements can be turned off by the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

if isinstance(ignore_attribute, str):
self.ignore_attributes = [ignore_attribute]
elif isinstance(ignore_attribute, list):
self.ignore_attributes = ignore_attribute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an else to make sure that we don't introduce any weird bugs here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, added a value error

xmlfeature['oml:data_type'],
None, #todo add nominal values (currently not in database)
int(xmlfeature['oml:number_of_missing_values']))
assert idx == feature.index, "Data features not provided in right order"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an if + exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

the label that was predicted
predicted_probabilities : array (size=num_classes)
probabilities per class
class_labels : array (size=num_classes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_classes_mapping is not in the docstring; it's hard to tell what this function does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

model_classes = model.best_estimator_.classes_
else:
model_classes = model.classes_
except AttributeError as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, we shouldn't catch anything here I think. Especially since attribute regressors can be able to work on classification tasks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all depends on whether we want openml-python to upload runs with client errors or block them all.
In the prior case, we should catch. In the other case, we should not.

@@ -153,6 +153,23 @@ def test_publish_flow(self):
flow.publish()
self.assertIsInstance(flow.flow_id, int)

def test_semi_legal_flow(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is tested here? OpenML should reject this flow, because it contains the bagging classifier twice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that might be the case, it contains two distinguishable forms of bagging.

  • Bagging(Bagging(J48))
  • Bagging(J48)

Therefore, OpenML will be able to set the parameters of the individual components correct at any run, and there is no problem


flow.publish()

def test_illegal_flow(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a docstring why exactly this is illegal? Someone not too familiar with OpenML might not know about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

rep_no = 0
# TODO use different iterator to only provide a single iterator (less
# methods, less maintenance, less confusion)
for rep in task.iterate_repeats():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this code is exactly doing here. In the end, you only want to test _prediction_to_row, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I wanted to test this in the context of _run_task_get_arffcontent without publishing the run. Adjusted.

@mfeurer mfeurer merged commit b3262b6 into develop Mar 27, 2017
@mfeurer mfeurer deleted the smartimputer branch March 27, 2017 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants