author | title | semester | footer | license |
---|---|---|---|---|
Christian Kaestner and Claire Le Goues |
MLiP: Measuring Fairness |
Spring 2023 |
Machine Learning in Production/AI Engineering • Christian Kaestner, Carnegie Mellon University • Spring 2023 |
Creative Commons Attribution 4.0 International (CC BY 4.0) |
Required:
- Nina Grgic-Hlaca, Elissa M. Redmiles, Krishna P. Gummadi, and Adrian Weller. Human Perceptions of Fairness in Algorithmic Decision Making: A Case Study of Criminal Risk Prediction In WWW, 2018.
Recommended:
- Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter and Julia Lane. Big Data and Social Science: Data Science Methods and Tools for Research and Practice. Chapter 11, 2nd ed, 2020
- Solon Barocas and Moritz Hardt and Arvind Narayanan. Fairness and Machine Learning. 2019 (incomplete book)
- Pessach, Dana, and Erez Shmueli. "A Review on Fairness in Machine Learning." ACM Computing Surveys (CSUR) 55, no. 3 (2022): 1-44.
- Understand different definitions of fairness
- Discuss methods for measuring fairness
- Outline interventions to improve fairness at the model level
How do we measure fairness of an ML model?
Source: Mortiz Hardt, https://fairmlclass.github.io/
- Anti-classification (fairness through blindness)
- Group fairness (independence)
- Equalized odds (separation)
- ...and numerous others and variations!
- Large loans repaid over long periods
- Home ownership is key path to build generational wealth
- Past decisions often discriminatory (redlining)
- Replace biased human decisions by accurate ML model
- income, other debt, home value
- past debt and payment behavior (credit score)
Fairness discourse asks questions about how to treat people and whether treating different groups of people differently is ethical. If two groups of people are systematically treated differently, this is often considered unfair.
- Distribute loans equally across all groups of protected attribute(s) (e.g., ethnicity)
- Prioritize those who are more likely to pay back (e.g., higher income, good credit history)
Note: slack vote
Individuals can and do fall into multiple groups!
Subgroup fairness gets extremely technically complicated quickly.
We therefore focus on the simple cases for the purposes of the material in this class.
Withold services (e.g., mortgage, education, retail) from people in neighborhoods deemed "risky"
Map of Philadelphia, 1936, Home Owners' Loan Corps. (HOLC)
- Classification based on estimated "riskiness" of loans
Source: Federal Reserve’s Survey of Consumer Finances
Notes: much of fairness discourse here is trying to account for unequal starting positions
- Anti-classification (fairness through blindness)
- Group fairness (independence)
- Equalized odds (separation)
- ...and numerous others and variations!
- Also called fairness through blindness or fairness through unawareness
- Ignore certain sensitive attributes when making a decision
- Example: Remove gender and race from mortgage model
"After Ms. Horton removed all signs of Blackness, a second appraisal valued a Jacksonville home owned by her and her husband, Alex Horton, at 40 percent higher."
https://www.nytimes.com/2022/03/21/realestate/remote-home-appraisals-racial-bias.html
Easy to implement, but any limitations?
Features correlate with protected attributes
- Loan lending: Gender discrimination is illegal.
- Medical diagnosis: Gender-specific diagnosis may be desirable.
- ML models discriminate based on input data by construction.
- The problem is unjustified differentiation; i.e., discriminating on factors that should not matter
- Discrimination is a domain-specific concept
- Ignore certain sensitive attributes when making a decision
- Advantage: Easy to implement and test
- Limitations
- Sensitive attributes may be correlated with other features
- Some ML tasks need sensitive attributes (e.g., medical diagnosis)
How to train models that are fair w.r.t. anti-classification?
How to train models that are fair w.r.t. anti-classification?
--> Simply remove features for protected attributes from training and inference data
--> Null/randomize protected attribute during inference
(does not account for correlated attributes, is not required to)
How do we test that a classifier achieves anti-classification?
Straightforward invariant for classifier
(does not account for correlated attributes, is not required to)
Test with any test data, e.g., purely random data or existing test data
Any single inconsistency shows that the protected attribute was used. Can also report percentage of inconsistencies.
See for example: Galhotra, Sainyam, Yuriy Brun, and Alexandra Meliou. "Fairness testing: testing software for discrimination." In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 498-510. 2017.
Testing of anti-classification barely needed, because easy to ensure by constructing during training or inference!
Anti-classification is a good starting point to think about protected attributes
Useful baseline for comparison
Easy to implement, but only effective if (1) no proxies among features and (2) protected attributes add no predictive power
- Anti-classification (fairness through blindness)
- Group fairness (independence)
- Equalized odds (separation)
- ...and numerous others and variations!
Key idea: Outcomes matter, not accuracy!
Compare outcomes across two groups
- Similar rates of accepted loans across racial/gender groups?
- Similar chance of being hired/promoted between gender groups?
- Similar rates of (predicted) recidivism across racial groups?
Disparate treatment: Practices or rules that treat a certain protected group(s) differently from others
- e.g., Apply different mortgage rules for people from different backgrounds
Disparate impact: Neutral rules, but outcome is worse for one or more protected groups
- Same rules are applied, but certain groups have a harder time obtaining mortgage in a particular neighborhood
Relates to disparate impact and the four-fifth rule
Can sue organizations for discrimination if they
- mostly reject job applications from one minority group (identified by protected classes) and hire mostly from another
- reject most loans from one minority group and more frequently accept applicants from another
-
$X$ : Feature set (e.g., age, race, education, region, income, etc.,) -
$A \in X$ : Sensitive attribute (e.g., gender) -
$R$ : Regression score (e.g., predicted likelihood of on-time loan payment) -
$Y'$ : Classifier output-
$Y' = 1$ if and only if$R > T$ for some threshold$T$ - e.g., Grant the loan (
$Y' = 1$ ) if the likelihood of paying back > 80%
-
-
$Y$ : Target variable being predicted ($Y = 1$ if the person actually pays back on time)
Setting classification thresholds: Loan lending example
- Also called independence or demographic parity
- Mathematically,
$Y' \perp A$ - Prediction (
$Y'$ ) must be independent of the sensitive attribute ($A$ )
- Prediction (
- Examples:
- The predicted rate of recidivism is the same across all races
- Both women and men have the equal probability of being promoted
- i.e., P[promote = 1 | gender = M] = P[promote = 1 | gender = F]
Notes: probability is the same across all groups
What are limitations of group fairness?
- Ignores possible correlation between
$Y$ and$A$ - Rules out perfect predictor
$Y' = Y$ when$Y$ &$A$ are correlated!
- Rules out perfect predictor
- Permits abuse and laziness: Can be satisfied by randomly assigning a positive outcome (
$Y' = 1$ ) to protected groups- e.g., Randomly promote people (regardless of their job performance) to match the rate across all groups
Notes: firing practices
Select different classification thresholds (
Example: Mortgage application
- R: Likelihood of paying back the loan on time
- Suppose: With a uniform threshold used (i.e., R = 80%), group fairness is not achieved
- P[R > 0.8 | A = 0] = 0.4, P[R > 0.8 | A = 1] = 0.7
- Adjust thresholds to achieve group fairness
- P[R > 0.6 | A = 0] = P[R > 0.8 | A = 1]
- Wouldn't group A = 1 argue it's unfair? When does this type of adjustment make sense?
How would you test whether a classifier achieves group fairness?
Collect realistic, representative data (not randomly generated!)
- Use existing validation/test data
- Monitor production data
- (Somehow) generate realistic test data, e.g. from probability distribution of population
Separately measure the rate of positive predictions
- e.g., P[promoted = 1 | gender = M], P[promoted = 1 | gender = F] = ?
Report issue if the rates differ beyond some threshold
- Anti-classification (fairness through blindness)
- Group fairness (independence)
- Equalized odds (separation)
- ...and numerous others and variations!
Key idea: Focus on accuracy (not outcomes) across two groups
- Similar default rates on accepted loans across racial/gender groups?
- Similar rate of "bad hires" and "missed stars" between gender groups?
- Similar accuracy of predicted recidivism vs actual recidivism across racial groups?
Accuracy matters, not outcomes!
Relates to disparate treatment
Typically, lawsuits claim that protected attributes (e.g., race, gender) were used in decisions even though they were irrelevant
- e.g., fired over complaint because of being Latino, whereas other White employees were not fired with similar complaints
Must prove that the defendant had intention to discriminate
- Often difficult: Relying on shifting justifications, inconsistent application of rules, or explicit remarks overheard or documented
Statistical property of separation:
- Prediction must be independent of the sensitive attribute conditional on the target variable
Can we explain separation in terms of model errors?
$P[Y'=1∣Y=0,A=a] = P[Y'=1∣Y=0,A=b]$ $P[Y'=0∣Y=1,A=a] = P[Y'=0∣Y=1,A=b]$
-
$Y' \perp A | Y$ : Prediction must be independent of the sensitive attribute conditional on the target variable - i.e., All groups are susceptible to the same false positive/negative rates
- Example: Y': Promotion decision, A: Gender of applicant: Y: Actual job performance
Requires realistic representative test data (telemetry or representative test data, not random)
Separately measure false positive and false negative rates
- e..g, for FNR, compare P[promoted = 0 | female, good employee] vs P[promoted = 0 | male, good employee]
How is this different from testing group fairness?
Notes: need labels hard in some applications
In groups, post to #lecture
tagging members:
- Does the model meet anti-classification fairness w.r.t. gender?
- Does the model meet group fairness?
- Does the model meet equalized odds?
- Is the model fair enough to use?
Notes: prob cancer male vs female
- Anti-classification (fairness through blindness)
- Group fairness (independence)
- Equalized odds (separation)**
- ...and numerous others and variations!
Many measures proposed
Some specialized for tasks (e.g., ranking, NLP)
Some consider downstream utility of various outcomes
Most are similar to the three discussed
- Comparing different measures in the error matrix (e.g., false positive rate, lift)
Next lecture: Fairness is a system-wide concern
- Identifying and negotiating fairness requirements
- Fairness beyond model predictions (product design, mitigations, data collection)
- Fairness in process and teamwork, barriers and responsibilities
- Documenting fairness at the interface
- Monitoring
- Promoting best practices
- Three definitions of fairness: Anti-classification, group fairness, equalized odds
- Tradeoffs between fairness criteria
- What is the goal?
- Key: how to deal with unequal starting positions
- Improving fairness of a model
- In all pipeline stages: data collection, data cleaning, training, inference, evaluation
- 🕮 Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter and Julia Lane. Big Data and Social Science: Data Science Methods and Tools for Research and Practice. Chapter 11, 2nd ed, 2020
- 🕮 Solon Barocas and Moritz Hardt and Arvind Narayanan. Fairness and Machine Learning. 2019 (incomplete book)
- 🗎 Pessach, Dana, and Erez Shmueli. "A Review on Fairness in Machine Learning." ACM Computing Surveys (CSUR) 55, no. 3 (2022): 1-44.