Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Racism and the load_boston dataset #91

Closed
koaning opened this issue Jan 28, 2021 · 10 comments
Closed

Racism and the load_boston dataset #91

koaning opened this issue Jan 28, 2021 · 10 comments
Assignees

Comments

@koaning
Copy link

koaning commented Jan 28, 2021

Description

Your documentation lists a demo that is using the load_boston dataset to explain how to use the tool here. It also lists the variables in the dataset and you can confirm the contents.

- B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

One of the variables used to predict a house price is skin color. That's incredibly problematic.

Given that this project is backed by a global consultancy firm I'm especially worried. People will look at this example and copy it. Since documentation pages are typically seen as an authoritative source of truth it's really dubious to use this dataset without even a mention of the controversy in the variables. The use of skin color to predict a house price is legitimately a bad practice and the documentation currently makes no effort to acknowledge it. For a package that's all about causal inference, you'd expect it to acknowledge that you still need to understand the dataset that you feed in.

Note that this dataset is up for removal from scikit-learn because of the obvious controversy and it's also something that's been pointed out at many conferences. Here is one talk from me if you're interested.

@qbphilip qbphilip self-assigned this Jan 28, 2021
@qbphilip
Copy link
Contributor

Thank you for holding us accountable and also bring our attention to this, so that we can rectify this mistake.

We suggest the following actions:

  1. We will replace the predictive tutorial exercise with the load_diabetes dataset
  2. We will engage industry experts to create a tutorial in our documentation focused on the application of causalnex on data that includes biases of the underlying population. We will reference this discussion in that section.

The latter action highlighting that we believe that we can draw better insights from (flawed) data when looking at it from a causal perspective. This way I hope we can contribute to the fight for racial equity.

Thank you for sharing your talk on this and do let us know if we can engage you for further feedback, or hear your thoughts on our proposed actions.

@koaning
Copy link
Author

koaning commented Jan 28, 2021

I appreciate the immediate response but I feel obliged to respond in detail to this comment.

The latter action highlighting that we believe that we can draw better insights from (flawed) data when looking at it from a causal perspective. This way I hope we can contribute to the fight for racial equity.

Getting advice from experts is all fair and good. But in this situation, is a tutorial on causalnex from an industry expert really what we need? From my end this issue falls into the "understand the data before you model it"-department. Not in the "we're using bad techniques"-department. It's inappropriate to not understand the variables thrown into a model, even if the model is using causal inference under the hood.

I might urge one to be very careful here. We would not want to suggest that a library can replace the actual human-intelligence involved in interpreting a dataset. The problem in the current tutorial isn't that the causal techniques can't be applied in a meaningful way, rather, that the technique seems to have been distracting from the actual problem in the dataset.

I appreciate work in fairness/ML tooling but my genuine fear is that folks might use it as an excuse. It'd be a real shame if people consider the use of a technique as a superior checkbox so that they no longer need to think about their data.

@kcoleman80
Copy link

The first instance of this was mentioned last year. Why did it take over a year to comment back on this? This is the 3/5 rule put into clear data sets that have been flagged as racist. McKinsey has a pretty bad reputation, carrying this over into data science is a poor choice.

@qbphilip
Copy link
Contributor

qbphilip commented Feb 3, 2021

@koaning

Apologies for the late reply, I wanted to dedicate enough time to give you a proper response.

  1. The ambition is to make that tutorial a reference for how to use causal(nex) models to combat biases in data, that actually moves the needle. To make it clear, the intention is not to tick the box and move on. I have started discussions with colleagues working in the fairness field and this tutorial will be a joint effort with them. Q: Do you think there is a better approach (within scope of this package)?
  2. In my opinion a key difference of causalnex to other causal packages is in fact the ability to incorporate domain expertise in the modelling. You can do this via tabu_edges/parents/children (conditional independence) or a change of the networkx graph itself. So, from a design perspective I believe the package as a whole is aligned with your goal. However, I do think we should put more focus on that "human in the loop" component. Currently the tutorials are not explicit (enough) on the topic, see Learning the structure for example. The sklearn interface tutorial, which currently uses the Boston Housing dataset, provides an sklearn Regressor/Classifier API to the same functionality although framed as a supervised model. However, as you spotted, the human-in-the-loop component is not mentioned. We could improve this by at least linking to the more elaborate docs material. A bit further in the future I would like to streamline the docs, to explicitly demonstrate how to include domain expertise to address unobserved confounders, which includes systemic biases like racism. Q: Do you have a suggestion how to restructure the docs/tutorials to make the active role of the user clearer? Have you seen good examples in other packages?

For transparency: we also use the boston housing data set in the Distribution schema tutorial tutorial, albeit without the B column. We should also change that as an immediate second step.

Hello @kcoleman80 ,
We made a mistake in using the dataset when adding the mentioned tutorial two months ago. As outlined above, we are working on several steps to rectify this mistake.

@kcoleman80
Copy link

Thanks so much, I have some ideas - comment about the factor of B = 1000(Bk – 0.63)2 (Bk is proportion of Blacks/African Americans per town according to the census in the 70's), and how the information has been widely used on Kaggle and on sklearn, yet due to how outdated it is, especially in this time of reckoning, it's time to retire in favour of updated information from HUD, and discuss the redlining that occurred in Boston in order to provide historical context, and provide a section about ethics in AI, data collection, and the real impacts we must face (ie: issues with facial recognition for non-white faces, COMPAS, Clearview AI, etc.).
There's a fantastic paper that revisits the 1978 data set 'Revisiting the Boston data set – Changing the units of observation affects estimated willingness to pay for clean air':
https://openjournals.wu.ac.at/region/paper_107/107.html

There's some pretty good open government data sets here:
https://www.huduser.gov/portal/pdrdatas_landing.html#dataset-title
https://open.canada.ca/en/open-data
Or Vaccine tracking during this pandemic:
https://mdl.library.utoronto.ca/covid-19/data
Thanks for responding, crazy times we're in. 2020 was a lesson in 'data is not just data', as the pandemic has split the inequality gap wide open. ❤️😭

@koaning
Copy link
Author

koaning commented Feb 3, 2021

A buddy of mine came up with a quote a while ago;

"The learning is in the labels".

Every model suffers from this, even causal ones. If there's something not quite right about the labels then the model is going to pick this up. Even with fairness techniques in the pipeline. This is even more true when you put a model into production.

image

Suppose we have an application for resume scoring. How are we going to guarantee that the predictions are not going to reinforce the bias that is in the data? You typically only hear back from the candidates that you actually hire. But what about the candidates that you didn't consider? There might be some good hires in there.

Your data feed might suffer from very poor recall. So how might we prevent bad lessons from being learned? By changing the model? Or ... by changing the feedback mechanism?

You might achieve some good by fiddling around the model space, but to me, it seems far more likely that you're going to understand the actual problem if you take all of that effort and assign it to the data instead. In the recruitment case, maybe the best causal thing you can do is to hire some employees you normally might not. That way, there's at least something of a signal coming in that might challenge a model's learned bias.

A clear example of this "take the effect of data serious"-phenomenon, someplace early in the docs, would be grand. I don't have a nice example of a dataset that would be well suited for your project though. I do believe that a lesson along these lines might be both appropriate and interesting to folks reading about your tool. You might even experiment with how much you can steer the model in the right direction, as long as it's not dismissed that the learning depends on the data.

@kcoleman80
Copy link

I feel like the data science field is getting exceedingly over saturated with so many courses available and a lack of professional oversight (versus say, accounting, law, engineering where requirements for training and experience need to be met). If big impactful decisions are going to be made on data, how do we keep it accountable? Do we get social scientists involved to review work done? Does the profession get certified and regulated? Context at the very least needs to be included in datasets.

@qbphilip
Copy link
Contributor

qbphilip commented Feb 3, 2021

Thanks everyone for your input! I will give it a read tomorrow when my mind is fresh. Hopefully I find time and can start working on the notebook update as well.

tsanikgr pushed a commit that referenced this issue Mar 11, 2021
* replace dataset
* add fairness evaluation to all datasets used in that notebook
@qbphilip
Copy link
Contributor

Hello all,

Thank you again for creating awareness on the Boston housing dataset.

The updated tutorial notebook got merged into develop and we will get a minor release out soon, which will update the docs.
We collaborated with our Fairness R&D team and QB's Diversity & Inclusion initiative and hopefully got a satisfactory result.

What we have done:

  1. Replace Boston housing with the Breast cancer dataset (also sklearn)
  2. Add a fairness evaluation (see below) to all datasets in the notebook (this is not yet in the other notebooks):
  3. We started working on a tutorial about causalnex for counterfactual fairness on the adult dataset, and this is ongoing.

For example, what we added for the Diabetes dataset is:

Dataset bias evaluation:

As we're dealing with the data of the individuals and the predictions of the model may have a profound impact on their lives, we should evaluate the dataset on the potential presence of bias.

  1. Sample bias/Data collection:
    • The papers do not explain the protocol of how the samples were generated, therefore unintended biases that could have been introduced can not be detected.
    • Data biases could be a result of inequalities in access to healthcare, e.g. due to insurance coverage, limited access to diabetes screenings in underdeveloped regions or neighbourhoods. Undocumented sensitive variables, e.g. ethnicity, would need to be statistically independent of the choice to be included in the dataset.
  2. Data bias estimation with respect to available sensitive attributes: The dataset includes direct information on age and sex, note that both are standardized. A careful evaluation of the possible bias in the sensitive attributes, includes the comparison of ratios in groups in the data with their population rates, or benchmarks from literature. In our case, without information about geography nor ethnicity and the masking of the actual values the variables take, there is hardly any conclusion to be made from the bias estimation. The breakdown below shows roughly uniform distribution across the two variables. Follow up questions to be assessed include: Which one of the variables is each sex? can we expect the disease to progress similarly in both sexes? and similarly for age distribution.
  3. Risks for model deployment: We cannot determine the distribution across some omitted sensitive variables, nor have the information regarding the data gathering. This poses fairness risks when deploying a model trained on the dataset on populations different to the study cohort.

When deploying the model in the context of healthcare, make sure it is equally performant in the subgroups with respect to sensitive attributes and their intersection.

We recommend always assessing the bias and fairness risks at each step of the process (from problem understanding, data collection, processing, modelling and deployment), when working on models to be deployed, to minimize undesired outcomes


By adopting the 3 steps of fairness evaluation, we can hopefully identify problematic datasets before they are used.

In the future, we could think of adding something like the data nutrition labels: https://datanutrition.org

@kcoleman80
Copy link

Thank you so much Philip! This is an amazing example of learning and fixing! We need people like this in AI/Data Science/Software. I'm so happy, you've helped restore my faith in people today! 😊😊😊😊😊😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants