## Instructions
* Read each cell and implement the **TODOs** sequentially. The markdown/text cells also contain instructions which you need to follow to get the whole notebook working.
* Do not change the variable names unless the instructor allows you to.
* Do not delete the **TODO** comment blocks.
* Aside from the TODOs, there will be questions embedded in the notebook and a cell for you to provide your answer (denoted with A:). Answer all the markdown/text cells with **"A: "** on them. 
* You are expected to search how to some functions work on the Internet or via the docs. 
* You may add new cells for "scrap work".
* The notebooks will undergo a "Restart and Run All" command, so make sure that your code is working properly.
* You are expected to understand the data set loading and processing separately from this class.
* You may not reproduce this notebook or share them to anyone.

Place your answers to the questions directly inline on the same cell as **A:**

For example:

<span style='color:red'>**Question 00:**</span> What is your favorite ice cream flavor?

<span style='color:red'>**A00:**</span> My favorite flavor ice cream flavor is pistachio.

# Assignment 2.4 - Bias and Fairness
In this notebook, you will be experimenting on checking for fairness / disparity and some simple bias reduction techniques.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier

## Case Study: Project Proposal Filtering
We want to build a model that determines the top 1000 (worst) project submissions that are **NOT** likely to get funded in order to prioritize resource allocation. We immediately reject them so that we don't have to waste resources and time in reviewing them.


Assume that we already have a trained model, the test data below shows the predictions of the model, the ground truth label, and some demographic information of the submitter.

In [None]:
initial_preds_df = pd.read_csv('data/initial_predictions_df.csv.gz', compression='gzip')
initial_preds_df

The `score` column represents the model prediction and the `label_value` is the ground truth label. 

Given a dataframe of test predictions similar to the one shown above, let's compute the true positive rate (TPR) metric.

True positive rate would indicate the fraction of true positives with respect to those labeled as positive.

$$TPR = \frac{TP}{TP + FN}$$

Hint: This would be easier to compute if you add extra columns that compute first the true positive (TP), false positive (FP), false negative (FN), and true negative (TN).

In [None]:
############################################################
# TODO-01: Implement a function that computes the true     #
# positive rate (TPR) given a dataframe of predictions.    #
# You can assume that the predictions are always going do  #
# be named `score`, but the target label column is a       #
# parameter.                                               #
############################################################
def compute_tpr(df, target_col):
    pass

############################################################
#                    End of your code.                     #
############################################################

In [None]:
print("True positive rate (TPR) =", compute_tpr(initial_preds_df, target_col="label_value") )

Now, lets check if there is a big disparity between different groups.

Let's first implement the function to compute TPR with respect to the grouping of interest.
This would be fraction of true positives within the label positive entities of a group $g$.

$$TPR_g = \frac{TP_g}{TP_g + FN_g}$$

Hint: You can do this by using groupby method of pandas dataframe followed by apply. Here are some resources related to this.
- https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.GroupBy.apply.html
- https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.apply.html

In [None]:
############################################################
# TODO-02: Implement a function that computes the true     #
# positive rate (TPR) with respect to each group as        #
# specified by the grouping column. You can assume that    #
# the predictions are always going do be named `score`,    #
# but the target label column is a parameter.              #
############################################################

def grouped_tpr(df, target_col, group_column):
    pass


############################################################
#                    End of your code.                     #
############################################################

Let's compute the amount of disparity. We define this as how many times smaller or larger is the metric than the reference group metric value. For example, if the reference group A is $1.0$ and the group B is $3.0$, then we say group B is $3$ times larger than the reference group A. If group C is $0.25$, then we say group C is $4$ times lower than the reference group A. To simplify things, we can use a negative sign to indicate that the value is smaller, i.e., $-4$ would mean that it is $4$ times lower. Note that this is only a simplifying convention, don't forget to take this convention into account in the your analysis and use absolute values when necessary.

Hint: you can use `.index` to get the row index labels

In [None]:
############################################################
# TODO-03: Implement a function that computes the disparity#
# defined as how many times smaller or larger the metric   #
# is for a particular group with respect to the reference  #
# group. Store this in a dictionary disparity_group such   #
# that disparity_group[group] = disparity.                 #
############################################################

def compute_disparity(metric_df, ref_group_label):
    pass
    
############################################################
#                    End of your code.                     #
############################################################

Now, to determine if there is a significant disparity, we would need to define a tolerance threshold of disparity. For now, lets set this to be $\text{tolerance} = 1.3$.

<span style='color:red'>**TODO-04:**</span> Compute the group disparities for the property `poverty_level` with respect to the group `lower`.

<span style='color:red'>**Question 01:**</span> Based on your results, is there a significant disparity between the `poverty_level` groups?

<span style='color:red'>**A01:**</span>

<span style='color:red'>**TODO-05:**</span> Compute the group disparities for the property `metro_type` with respect to the group `suburban_rural`.

<span style='color:red'>**Question 02:**</span> Based on your results, is there a significant disparity between the metro_type groups?

<span style='color:red'>**A02:**</span>

<span style='color:red'>**TODO-06:**</span> Compute the group disparities for the property `teacher_sex` with respect to the group `male`.

<span style='color:red'>**Question 03:**</span> Based on your results, is there a significant disparity between the `teacher_sex` groups?

<span style='color:red'>**A03:**</span>

# Bias reduction

After checking for disparities, we want to retrain our model to reduce the disparity. Let's first load the datasets.

There will be 2 sets of dataframes for train and for test. The first one is the input features and the second one are extra attributes.

In [None]:
traindf = pd.read_csv('data/train_20120501_20120801.csv.gz', compression='gzip')
train_attr_df = pd.read_csv('data/train_20120501_20120801_protected.csv.gz', compression='gzip')

testdf = pd.read_csv('data/test_20121201_20130201.csv.gz', compression='gzip')
test_attr_df = pd.read_csv('data/test_20121201_20130201_protected.csv.gz', compression='gzip')

In [None]:
train_attr_df

The target label column in this dataframe is `quickstart_label`.

We already provide functions below to train and test the model.

In [None]:
def train_model(target_column, train_df):
    hyperparameters = {
        'n_jobs': -1,
        'criterion': 'gini',
        'max_depth': 30,
        'max_features': 'sqrt',
        'n_estimators': 87,
        'random_state': 213500298,
        'min_samples_leaf': 44,
        'min_samples_split': 3
    }
    model = RandomForestClassifier(**hyperparameters)
    
    y_train = train_df[target_column].values
    X_train = train_df.drop(['entity_id','as_of_date', target_column], axis = 1)
    model.fit(X_train, y_train)

    return model

In [None]:
def test_model(model, target_column, test_df, test_attrdf):
    X_test = test_df.drop(['entity_id','as_of_date',target_column], axis = 1)
    y_pred = model.predict_proba(X_test)[:,1]
    preds_df = test_df[['entity_id','as_of_date',target_column]].copy()
    preds_df['predict_proba'] = y_pred
    preds_df = preds_df.sort_values('predict_proba', ascending = False).reset_index(drop=True).copy()
    preds_df['score'] = preds_df.apply(lambda x: 1.0 if int(x.name)  < 1000 else 0.0, axis=1) # As mentioned in the case study description above, we are only getting the top 1000 (worst) submissions to reject.
    
    return pd.merge(preds_df, test_attrdf, how='left', on=['entity_id','as_of_date'], sort=True, copy=True)

In [None]:
def compute_precision(preds_df, target_column):
    return preds_df[preds_df['score'] > 0][target_column].mean()

## Bias reduction via Unawareness

Bias reduction via unawareness is simply removing the protected attributes so that the model has no access to these features. We'll look at only the poverty level for now.

In [None]:
protected_attributes = [
    'project_features_entity_id_all_poverty_level_highpoverty',
    'project_features_entity_id_all_poverty_level_highestpoverty',
    'project_features_entity_id_all_poverty_level_lowpoverty',
    'project_features_entity_id_all_poverty_level_moderatepoverty',
]

<span style='color:red'>**TODO-07:**</span> Remove the protected attributes and train a new model. Compute the model's precision and also assess whether there are any disparities left for the poverty level groups with respect to true positive rates (TPR).

You may add as many cells below as necessary.

<span style='color:red'>**Question 04:**</span> Based on your experiment results, how effective is the strategy of unawareness? Explain your answer. 

<span style='color:red'>**A04:**</span>

## Bias reduction via Resampling

Now let's try the resampling approach. The idea is to resample the training data points so that some statistic of the groups are equalized. Common statistics are group size and prevalence.

Group size simply refers to the counts of the group.

Prevalence on the other hand is the proportion of positive labels with respect to a group.
$$\text{Prevalance} = P(Y = 1 | \text{Group} ) = \frac{\text{Number of positively labeled samples}} {\text{Group size} }$$


<span style='color:red'>**TODO-08:**</span> Compute for the group sizes with respect to `poverty_level`.

<span style='color:red'>**TODO-09:**</span> Compute for the prevalence of each group with respect to `poverty_level`. 

Hint: It would be easier to do if you merge the `traindf` and `train_attr_df`. The `groupby` followed by `apply` done above is also useful for doing this.

<span style='color:red'>**TODO-10:**</span> Resample the poverty level group `highest` to match the group size of the `lower` group. 

Hint: `numpy.random.choice` can be a useful function to use here.

You may add as many cells below as necessary.


In [None]:
# np.random.seed(0)

<span style='color:red'>**TODO-11:**</span> Retrain a model with the newly resampled dataset having equal group sizes. Then, analyze the performance and disparities. 

You may add as many cells below as necessary.


**Note:** You should try this for 5 different random seeds (0,1,2,3,4) for the random sampling (previous TODO), so that you get a more stable estimate of the performance gains and reduce the influence of randomness.

<span style='color:red'>**TODO-12:**</span> Resample the poverty level group `highest` to match the prevalence of the `lower` group. 

Hint: `numpy.random.choice` can be a useful function to use here.

You may add as many cells below as necessary.

In [None]:
# np.random.seed(0)

<span style='color:red'>**TODO-13:**</span> Retrain a model with the newly resampled dataset having equal prevalence. Then, analyze the performance and disparities. 

You may add as many cells below as necessary.

**Note:** You should try this for 5 different random seeds (0,1,2,3,4) for the random sampling (previous TODO), so that you get a more stable estimate of the performance gains and reduce the influence of randomness.

<span style='color:red'>**TODO-14:**</span> Resample the poverty level group `highest` to match **BOTH** the group size and prevalence of the `lower` group. 

Hint: `numpy.random.choice` can be a useful function to use here.

You may add as many cells below as necessary.

In [None]:
# np.random.seed(0)

<span style='color:red'>**TODO-15:**</span> Retrain a model with the newly resampled dataset having equal group size and prevalence. Then, analyze the performance and disparities. 

You may add as many cells below as necessary.

**Note:** You should try this for 5 different random seeds (0,1,2,3,4) for the random sampling (previous TODO), so that you get a more stable estimate of the performance gains and reduce the influence of randomness.

<span style='color:red'>**Question 05:**</span> Analyze the three different resampled datasets that you made and compare their performances to each other. How effective are they at reducing bias?

<span style='color:red'>**A05:**</span>

<span style='color:red'>**Question:**</span> How much time did it take you to answer this notebook?

<span style='color:red'>**A:**</span>

<span style='color:red'>**Question:**</span> What parts of the assignment did you like and what parts did you not like?

<span style='color:red'>**A:**</span>

<span style='color:red'>**Question:**</span> How do you think it could be improved?

<span style='color:red'>**A:**</span>

<span style='color:red'>**Question:**</span> Do you have any case studies in mind that would be nice to suggest / include in the assignment?

<span style='color:red'>**A:**</span>