-
-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smartimputer #213
Smartimputer #213
Conversation
added the conditional imputer (for our benchmark algorithms), setup for function that discriminates between feature types
added testcase
…lass '3', which can now be hanndled)
bugfix conditional imputer
adapted test accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think that the improved imputer should live in the benchmark study repository.
openml/datasets/data_feature.py
Outdated
---------- | ||
index : int | ||
The index of this feature | ||
name : string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str
instead of string
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
openml/datasets/data_feature.py
Outdated
The index of this feature | ||
name : string | ||
Name of the feature | ||
data_type : string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
openml/datasets/data_feature.py
Outdated
LEGAL_DATA_TYPES = ['nominal', 'numeric', 'string', 'date'] | ||
|
||
def __init__(self, index, name, data_type, nominal_values, number_missing_values): | ||
assert type(index) is int, "Index is of wrong datatype" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use if
statements here, assert
statements can be turned off by the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
openml/datasets/dataset.py
Outdated
if isinstance(ignore_attribute, str): | ||
self.ignore_attributes = [ignore_attribute] | ||
elif isinstance(ignore_attribute, list): | ||
self.ignore_attributes = ignore_attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be an else
to make sure that we don't introduce any weird bugs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, added a value error
openml/datasets/dataset.py
Outdated
xmlfeature['oml:data_type'], | ||
None, #todo add nominal values (currently not in database) | ||
int(xmlfeature['oml:number_of_missing_values'])) | ||
assert idx == feature.index, "Data features not provided in right order" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be an if + exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
the label that was predicted | ||
predicted_probabilities : array (size=num_classes) | ||
probabilities per class | ||
class_labels : array (size=num_classes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model_classes_mapping
is not in the docstring; it's hard to tell what this function does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
openml/runs/functions.py
Outdated
model_classes = model.best_estimator_.classes_ | ||
else: | ||
model_classes = model.classes_ | ||
except AttributeError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this, we shouldn't catch anything here I think. Especially since attribute regressors can be able to work on classification tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It all depends on whether we want openml-python to upload runs with client errors or block them all.
In the prior case, we should catch. In the other case, we should not.
@@ -153,6 +153,23 @@ def test_publish_flow(self): | |||
flow.publish() | |||
self.assertIsInstance(flow.flow_id, int) | |||
|
|||
def test_semi_legal_flow(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is tested here? OpenML should reject this flow, because it contains the bagging classifier twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While that might be the case, it contains two distinguishable forms of bagging.
- Bagging(Bagging(J48))
- Bagging(J48)
Therefore, OpenML will be able to set the parameters of the individual components correct at any run, and there is no problem
|
||
flow.publish() | ||
|
||
def test_illegal_flow(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a docstring why exactly this is illegal? Someone not too familiar with OpenML might not know about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
rep_no = 0 | ||
# TODO use different iterator to only provide a single iterator (less | ||
# methods, less maintenance, less confusion) | ||
for rep in task.iterate_repeats(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this code is exactly doing here. In the end, you only want to test _prediction_to_row, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I wanted to test this in the context of _run_task_get_arffcontent without publishing the run. Adjusted.
Added support for data features
Added utils.preprocessor.ConditionalImputer
Much more ...