Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation for conditional feature distributions #193

Closed
dzeber opened this issue Sep 10, 2020 · 3 comments
Closed

Implementation for conditional feature distributions #193

dzeber opened this issue Sep 10, 2020 · 3 comments
Labels
enhancement New feature or request

Comments

@dzeber
Copy link
Collaborator

dzeber commented Sep 10, 2020

Implement the computation for conditional feature distributions.

@alberginia
Copy link
Collaborator

@AYYYang
Copy link
Collaborator

AYYYang commented Oct 25, 2020

@dzeber @alberginia

I think I understand most of the specification, however, I have a question regarding the input.
Two of the allowed inputs are: predicted scores or a function of the features
I am assuming that predicted score means the y column in the test dataset. I am not sure what should be the output, more specifically, let's say if we are plotting a histogram, what should be on the x-axis?

For function of the features, I am not sure what this means.

@dzeber
Copy link
Collaborator Author

dzeber commented Oct 29, 2020

Sorry for the delay in getting you feedback on this!

The goal would be, given a column of data, to create a separate histogram corresponding to the subset of rows that fall into each cell of the confusion matrix. For example, with two classes there would be 4 histograms. To determine this grouping, you can use the column of true labels (eg. y_true) and the column of predicted labels (eg. y_pred) to get groups eg. (eg. y_true == 1 and y_true == y_pred, y_true == 1 and y_true != y_pred, y_true == 2 and y_true == y_pred, y_true == 2 and y_true != y_pred). Note that we don't need to compute the counts as in the actual confusion matrix, but rather we are just labelling each data row according to which cell it would be counted in, in order to group on this label.

Re your question about the input, this "column of data" could in principle be any feature in the original dataset or any computed column, eg. the average of two features or the scores predicted by the model. The code should be the same either way, since it should work the same so long as the input is a pandas series or list the same length as the dataset. So you wouldn't need to do anything special to support these different cases, but they should "just work" when testing in a notebook.

Re the output, it would always be a histogram, so the output would have bins over the range of column values on the x-axis and (relative) frequency counts on the y-axis. (In the case of categorical data, it will be a barplot instead, but it follows the same idea).

I think the outstanding work that needs to be done for this issue is:

  • copy the existing code into the presc package dir
  • implement splitting according to the confusion matrix cells as described above
  • maybe refactor the existing code to have a separate function that generates a single histogram, and call that inside the for loop in histograms() to avoid code duplication
  • add a couple of tests
  • support for categorical variables as well (barplot rather than histogram). Looks like this may already be handled.

Don't worry about tweaking the code that controls the graphics at this point, as that may change when we integrate into a common report. This is mainly about making the functionality available.

Hope this helps! Let me know if you have further questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants