## Python Data Balance Analysis

## Context
Data Balance Analysis is relevant for overall understanding of datasets, but it becomes essential when thinking about building Machine Learning models in a responsible way, especially in term of fairness. It is all too easy to build an ML Model that produced biased results for subsets of the population by training or testing the model of biased ground truth data. There are multiple case studies of biased models assisting in granting loans healthcare, recruitment opportunities and many other decision making tasks. In most of these examples, the data from which these models are trained was the common issue. These findings emphasize how important it is for model creators and auditors to analysis data balance: to measure training data across various sub-populations and ensure the data has good coverage and a balanced representation of labels across sensitive categories adn category combinations and to check that test data is representative of the target population

In summary, Data Balance Analysis when used a step for building ML models has the following benefits: 
- reduces the risk of unbalanced models, ensuring service fairness and reducing the costs of ML building by identifying data representation gaps early on and prompting data scientists to seek mitigation steps before proceeding on the training portion of Machine Learning model development
- Enables easy end-to-end debugging of ML systems in combination with Fairlearn by providing a clear view if an issue in a model is tied to the data or the model itself.

## Usage 
Data Balance Analysis supports three different types of metrics. 
- FeatureMeasures - supervised (requires a label column)
- DistributionMeasures - unsupervised (does not require a label column)
- AggregateMeasures - unsupervised (does not require a label column)


1. First we import all of the classes we are interested in.

In [1]:
from databalanceanalysis.databalanceanalysis.aggregate_measures import AggregateMeasures
from databalanceanalysis.databalanceanalysis.feature_measures import FeatureMeasures
from databalanceanalysis.databalanceanalysis.distribution_measures import DistributionMeasures

ModuleNotFoundError: No module named 'databalanceanalysis'

2. Load the dataset, define the features of interest and ensure that the label column is binary. Currently, the FeatureBalance measure calculator only supports binary labels. 

For example:

In [None]:
import pandas as pd
sensitive_features = ["Gender", "Race"]

df =  pd.read_csv('datasets/AdultCensusIncome.csv')

# convert to 0 and 1 encoding
dataset['income'] = dataset['income'].apply(lambda x: 0 if x == "<=50K" else 1)

3. Create an instance of the FeatureMeasure class and set the sensitives to the column you are interested in seeing and the label column to the name of the column of interest.

For example: 

4. Create an instance of the DistributionMeasures class and set the sensitive columns to the columns you are interested in seeing. 

For example:

5. Create an instance of the AggregateMeasures class and set the sensitive columns parameter to the columns of interest. 

## Explanations of Data Balance Measures
### Feature Balance Measures
Feature Balance Measures allow us to see whether each combination of sensitive feature is receiving the positive outcome (true prediction) at balanced probability.

In this context, we define a feature balance measure, also referred to as the parity, for label y as the difference between the association metrics of two different sensitive classes $([x_A, x_B])$, with respect to the association metric $(A(x_i, y))$. That is:

$$parity(y \vert x_A, x_B, A(\cdot)) \coloneqq A(x_A, y) - A(x_B, y) $$

Using the dataset, we can see if the various sexes and races are receiving >50k income at equal or unequal rates.

Note: Many of these metrics were influenced by this paper [Measuring Model Biases in the Absence of Ground Truth.](https://arxiv.org/abs/2103.03417)

| Association Metric | Family | Description | Interpretation | Reference
| --- | --- | --- | --- | --- |
| Demographic Parity | Fairness | Proportion of each segment of a protected class (e.g. gender) should receive the positive outcome at equal rates.	| As close to 0 means better parity. $(DP = P(Y \vert A = Male) - P(Y \vert A = Female))$. Y = Positive label rate.| [Link](https://en.wikipedia.org/wiki/Fairness_%28machine_learning%29) |
|Pointwise Mutual Information (PMI), normalized PMI | Entropy	|The PMI of a pair of feature values (ex: Gender=Male and Gender=Female) quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions (assuming independence). | Range (normalized) [-1, 1]. -1 for no co-occurences. 0 for co-occurences at random. 1 for complete co-occurences.| [Link](https://en.wikipedia.org/wiki/Pointwise_mutual_information) |
| Sorensen-Dice Coefficient (SDC) | Intersection-over-Union| Union	Used to gauge the similarity of two samples. Related to F1 score. |Equals twice the number of elements common to both sets divided by the sum of the number of elements in each set. | [Link]() | 
|Jaccard Index | Intersection-over-Union | Similar to SDC, guages the similarity and diversity of sample sets. | Equals the size of the intersection divided by the size of the union of the sample sets. | [Link] | 
|Kendall Rank Correlation | Correlation and Statistical Tests | Used to measure the ordinal association between two measured quantities. | High when observations have a similar rank and low when observations have a dissimilar rank between the two variables. |[Link] |
|Log-Likelihood Ratio | Correlation and Statistical Tests |  Statistical Tests	Calculates the degree to which data supports one variable versus another. Log of the likelihood ratio, which gives the probability of correctly predicting the label in ratio to probability of incorrectly predicting label. | If likelihoods are similar, it should be close to 0. |[Link] | 
|t-test | Correlation and Statistical Tests | Used to compare the means of two groups (pairwise). | Value looked up in t-Distribution tell if statistically significant or not. |[Link] | 


Source: This notebook is adaptation of this notebook in the SynapseML documentation https://microsoft.github.io/SynapseML/docs/features/responsible_ai/Data%20Balance%20Analysis/ written by Kashyap Patel