Implementation for conditional feature distributions #193

dzeber · 2020-09-10T21:10:55Z

Implement the computation for conditional feature distributions.

alberginia · 2020-09-18T17:35:24Z

I implemented this one for my Outreachy application:
https://github.com/mozilla/PRESC-Outreachy-archive/blob/master/dev/alberginia/alberginia_issue7_3datasets.ipynb

AYYYang · 2020-10-25T02:26:11Z

@dzeber @alberginia

I think I understand most of the specification, however, I have a question regarding the input.
Two of the allowed inputs are: predicted scores or a function of the features
I am assuming that predicted score means the y column in the test dataset. I am not sure what should be the output, more specifically, let's say if we are plotting a histogram, what should be on the x-axis?

For function of the features, I am not sure what this means.

dzeber · 2020-10-29T01:36:14Z

Sorry for the delay in getting you feedback on this!

The goal would be, given a column of data, to create a separate histogram corresponding to the subset of rows that fall into each cell of the confusion matrix. For example, with two classes there would be 4 histograms. To determine this grouping, you can use the column of true labels (eg. y_true) and the column of predicted labels (eg. y_pred) to get groups eg. (eg. y_true == 1 and y_true == y_pred, y_true == 1 and y_true != y_pred, y_true == 2 and y_true == y_pred, y_true == 2 and y_true != y_pred). Note that we don't need to compute the counts as in the actual confusion matrix, but rather we are just labelling each data row according to which cell it would be counted in, in order to group on this label.

Re your question about the input, this "column of data" could in principle be any feature in the original dataset or any computed column, eg. the average of two features or the scores predicted by the model. The code should be the same either way, since it should work the same so long as the input is a pandas series or list the same length as the dataset. So you wouldn't need to do anything special to support these different cases, but they should "just work" when testing in a notebook.

Re the output, it would always be a histogram, so the output would have bins over the range of column values on the x-axis and (relative) frequency counts on the y-axis. (In the case of categorical data, it will be a barplot instead, but it follows the same idea).

I think the outstanding work that needs to be done for this issue is:

copy the existing code into the presc package dir
implement splitting according to the confusion matrix cells as described above
maybe refactor the existing code to have a separate function that generates a single histogram, and call that inside the for loop in histograms() to avoid code duplication
add a couple of tests
support for categorical variables as well (barplot rather than histogram). Looks like this may already be handled.

Don't worry about tweaking the code that controls the graphics at this point, as that may change when we integrate into a common report. This is mainly about making the functionality available.

Hope this helps! Let me know if you have further questions!

dzeber added the enhancement New feature or request label Sep 10, 2020

This was referenced Sep 10, 2020

Prototyping for conditional feature distributions #194

Closed

Investigate using conditional feature distributions in an automated setting #195

Closed

dzeber added this to the Starting points milestone Sep 11, 2020

dzeber mentioned this issue Oct 29, 2020

Implementation of conditional feature distribution #227

Merged

dzeber closed this as completed Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation for conditional feature distributions #193

Implementation for conditional feature distributions #193

dzeber commented Sep 10, 2020

alberginia commented Sep 18, 2020

AYYYang commented Oct 25, 2020 •

edited

Loading

dzeber commented Oct 29, 2020

Implementation for conditional feature distributions #193

Implementation for conditional feature distributions #193

Comments

dzeber commented Sep 10, 2020

alberginia commented Sep 18, 2020

AYYYang commented Oct 25, 2020 • edited Loading

dzeber commented Oct 29, 2020

AYYYang commented Oct 25, 2020 •

edited

Loading