<a href="https://colab.research.google.com/github/newyearsnight/AI-102-AIEngineer/blob/master/CS617(SP23)_Machine_Learning_%26_Data_Bias.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Machine Learning & Data Bias ⚖️

Adapted from [Exploring Unfairness and Bias in Data](http://modelai.gettysburg.edu/2020/bias/)  by Jonathan Chen, Tom Larsen, and Marion Neumann


## Learning Objectives
*   Understand how machine learning is used to solve real problems
*   Describe on a high level how classification algorithms operate
Understand how performance metrics such as validation scores
*   Understand how performance metrics such as validation scores
*   Understand how bias can arise though even unintentionally


## Agenda

1. [Warm-Up](#1.-Unfairness)
2. [Data Exploration](#2.-Data-Exploration)
3. [Modelling](#3.-Building-a-Model)
4. [Discussion](#4.-Becoming-Data-and-Fairness-Aware)

##1. Warm-Up (Video)

Watch [this video on Amazon’s AI recruitment tool](https://www.youtube.com/watch?v=QvRZuHQBTps) and have students discuss the consequences of this technology.



## 2. Data Exploration

Imagine that you are a data scientist at a bank and that one of your company's primary business areas is in lending money. The current loan approval process, that has been in place since the founding of the bank, has always relied on manual review of applications -- a process that is tedious and doesn't scale well in the modern age. The company wants to expand their business, but this archeic system is holding them back.

Think about how to approach this problem, you immediately think of using the bank's past loan approval records to build a model that can learn how a human application reviewer decides which applications to approve and which to reject.


### Acquiring the Data



Before we begin, let's make sure that we have the data. The cell reads data from a shared Google Drive file.

In [None]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

link = 'https://drive.google.com/open?id=1JWvFY96F5BlIdH35pFZVy-Bi3LMupdA4'

stuff, id = link.split('=')


Next, let's load our data. In the cell below, we read our [CSV][1] file into a [Pandas][2] [`DataFrame`][3] called `data`.



[1]: https://en.wikipedia.org/wiki/Comma-separated_values
[2]: https://pandas.pydata.org/
[3]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

In [None]:
import pandas as pd

In [None]:
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('Filename.csv')  
data = pd.read_csv('Filename.csv')
# Dataset is now stored in a Pandas Dataframe

Let's take a look at what we have.

In [None]:
data

Unnamed: 0,loan_id,loan_status,principal,terms,effective_date,due_date,paid_off_time,past_due_days,age,education,gender
0,xqd20166231,PAIDOFF,1000,30,9/8/2016,10/7/2016,9/14/2016 19:31,,45,High School or Below,male
1,xqd20168902,PAIDOFF,1000,30,9/8/2016,10/7/2016,10/7/2016 9:00,,50,Bachelors,female
2,xqd20160003,PAIDOFF,1000,30,9/8/2016,10/7/2016,9/25/2016 16:58,,33,Bachelors,female
3,xqd20160004,PAIDOFF,1000,15,9/8/2016,9/22/2016,9/22/2016 20:00,,27,College,male
4,xqd20160005,PAIDOFF,1000,30,9/9/2016,10/8/2016,9/23/2016 21:36,,28,College,female
...,...,...,...,...,...,...,...,...,...,...,...
495,xqd20160496,COLLECTION_PAIDOFF,1000,30,9/12/2016,10/11/2016,10/14/2016 19:08,3.0,28,High School or Below,male
496,xqd20160497,COLLECTION_PAIDOFF,1000,15,9/12/2016,9/26/2016,10/10/2016 20:02,14.0,26,High School or Below,male
497,xqd20160498,COLLECTION_PAIDOFF,800,15,9/12/2016,9/26/2016,9/29/2016 11:49,3.0,30,College,male
498,xqd20160499,COLLECTION_PAIDOFF,1000,30,9/12/2016,11/10/2016,11/11/2016 22:40,1.0,38,College,female


**Question 1.** How many examples are in our data set? How many features does it have?

500, 11



**Write-up!** With your neighbor, come up with a description of what you think each feature is and what type of feature each one is. Which one should be our target variable? Which ones do you think will be useful for our model?

* loan_id: id for record-keeping
* loan_status: whether the loan has been paid off
* principal: original amount borrowed
* terms: amount of days given to pay back the loan
* effective_date: loan granted date
* due_date: loan due date
* paid_off_time: date and time the loan was paid off
* part_due_days: number of days the loan is past due (NaN * if paid on time)
* age: age of the borrower
* education: education level of the borrower
* gender: gender of the borrower
```
Answer:
Target variable: load_status
Useful features: education, past_due_days, age
```









### Making Some Adjustments

Now let's drop the columns in `data` that contain features that we are not interested in. Since `loan_id`s are not informative for predicting new loans, we can ignore them. Additionally, `effective_date`, `due_date`, and `paid_off_time` are all encoded in `past_due_days`. It is unlikely that the specifics of when a loan was due is predictive of success.

In [None]:
not_interested = ['loan_id', 'effective_date', 'due_date', 'paid_off_time']

data = data.drop(not_interested, axis=1)

Let's see our new data set.

In [None]:
data.head()

Unnamed: 0,loan_status,principal,terms,past_due_days,age,education,gender
0,PAIDOFF,1000,30,,45,High School or Below,male
1,PAIDOFF,1000,30,,50,Bachelors,female
2,PAIDOFF,1000,30,,33,Bachelors,female
3,PAIDOFF,1000,15,,27,College,male
4,PAIDOFF,1000,30,,28,College,female


Did you notice that `past_due_days` has `NaN` values?

**Write-up!** Why might some of the values in `past_due_days` be `NaN`?  With your neighbor, discuss what we should do about these values and note your conclusion below.

Because this person has paid before the due day.

**Try this!** Replace the values in `past_due_days` with a reasonable value. `HINT` you can use the `fillna` function on `DataFrame`s to do this.

In [None]:
# your code here
data['past_due_days'] = data['past_due_days'].fillna(value=0)

Let's see if it worked.

In [None]:
data.head()

Unnamed: 0,loan_status,principal,terms,past_due_days,age,education,gender
0,PAIDOFF,1000,30,0.0,45,High School or Below,male
1,PAIDOFF,1000,30,0.0,50,Bachelors,female
2,PAIDOFF,1000,30,0.0,33,Bachelors,female
3,PAIDOFF,1000,15,0.0,27,College,male
4,PAIDOFF,1000,30,0.0,28,College,female


Nice!

### Visualizing the Data Set

Now that we have narrowed down the features we want to use, let's visualize them.

**Try this!** For each feature, make a new cell below and create a plot that we can use to understand the values of that feature. These plots should be appropriate for the type of each feature (e.g. use a bar plot for categorical features). Ensure that you have all the components off a nice plot, making sure to include things like axes labels, a legend, and a title. Also include a `raw` cell below each, describing what you see. `HINT` you can copy and paste groups of cells by shift-clicking them on the left.

In [None]:
# import python library matplotlib
!pip install matplotlib



In [None]:
import matplotlib.pyplot as plt
# this cell is free!

# your code here

# Bar plot for Age
ages = data['age'] 
bins = [10,20,30,40,50,60]

plt.hist(ages, bins, histtype='bar', rwidth=0.8)

plt.xlabel('x')
plt.ylabel('y')
plt.title('Feature: Age')
plt.show()

NameError: ignored

In [None]:
# Bar plot for Education
data['education'].value_counts().plot(kind='bar');

In [None]:
data['gender'].value_counts().plot(kind='bar');

In [None]:
# Use "Raw" cells to describe what you see
# You response here:  
# Age: most borrowers age are around 20 to 40
# Education: The most common degree of the borrowers is a college degree
# Gender: There are way mor males than females

## 3. Building a Model

Now that we have a sense for the nuances of our dataset we can try building some models.

Before we continue, we will need to turn our categorical features into numbers instead of the string values that they currently have. As a reminder, this is what our dataset looks like right now.

In [None]:
data

Unnamed: 0,loan_status,principal,terms,past_due_days,age,education,gender
0,PAIDOFF,1000,30,0.0,45,High School or Below,male
1,PAIDOFF,1000,30,0.0,50,Bachelors,female
2,PAIDOFF,1000,30,0.0,33,Bachelors,female
3,PAIDOFF,1000,15,0.0,27,College,male
4,PAIDOFF,1000,30,0.0,28,College,female


An easy way to do this encoding is to use the [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) from `sklearn`. In the cell below, we create a list called `categorical` containing the names of the columns corresponding to the categorical features in our dataset. We then create and instance of a `LabelEncoder` and use it to transform the categorical features.

In [None]:
# import sklearn library
!pip install sklearn



In [None]:
from sklearn.preprocessing import LabelEncoder

categorical = ['loan_status', 'education', 'gender']

# create an instance of a LabelEncoder
encoder = LabelEncoder()

# make a copy of our data
encoded = data.copy()

# apply the encoder's `fit_transform` method to the values for each categorical
# feature column
encoded[categorical] = data[categorical].apply(encoder.fit_transform)

Let's take a look at the results.

In [None]:
encoded

Unnamed: 0,loan_status,principal,terms,past_due_days,age,education,gender
0,2,1000,30,0.0,45,2,1
1,2,1000,30,0.0,50,0,0
2,2,1000,30,0.0,33,0,0
3,2,1000,15,0.0,27,1,1
4,2,1000,30,0.0,28,1,0
...,...,...,...,...,...,...,...
495,1,1000,30,3.0,28,2,1
496,1,1000,15,14.0,26,2,1
497,1,800,15,3.0,30,1,1
498,1,1000,30,1.0,38,1,0


Next let's separate our features (X) from our target variable, `loan_status` (y).

In [None]:
X, y = encoded.loc[:, encoded.columns != 'loan_status'], encoded.loan_status

### Establishing a Baseline

Now we're ready to start building models. First, let's create a train/test split of our data.

Notice the `test_size = 0.2` means that 80% of our data is split into the training subset and the other 20% the testing subset.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=3)

Then, let's train and evaluate a LogisticRegression model.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear', multi_class='auto')
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

**Try this!** In the cell below, evaluate the model's performance on the testing set.

In [None]:
# your code here
print(f'''
validation score: {model.score(X_test, y_test)}
''')


validation score: 0.97



**Write-up!** How does our model perform on the test set?

0.97 is a high validation score which means our algorithm is working pretty well on the test data. 

Let's also try looking at the model's performance on test examples of different genders.

In [None]:
print(f'''
validation (men) score: {model.score(X_test[X_test['gender'] == 1], y_test[X_test['gender'] == 1]):0.3f}
validation (women) score: {model.score(X_test[X_test['gender'] == 0], y_test[X_test['gender'] == 0]):0.3f}
''')


validation (men) score: 0.988
validation (women) score: 0.889



Yikes!

**Write-up!** What do you notice about these scores? How does these compare with the initial score we saw for the entire test set? What does this imply about our model?

### Dropping Gender

So our model is biased with respect to gender and gender is a feature of the model. Would it help to ignore the gender feature during training? Let's try it out.

Let's start by creating another train/test split, but this time using a copy of `X` and `y` that don't include `gender`.

In [None]:
X_without_gender = X.drop(['gender'], axis=1)

X_train, X_test, y_train, y_test = \
    train_test_split(X_without_gender, y, test_size=0.2, stratify=y, random_state=3)

Let's see what `X_train` looks like now.

In [None]:
X_train.head()

Unnamed: 0,principal,terms,past_due_days,age,education
215,1000,30,0.0,29,1
196,1000,30,0.0,29,1
118,1000,30,0.0,35,0
432,800,7,2.0,34,0
496,1000,15,14.0,26,2


Now let's repeat our procedure for our baseline experiment.

In [None]:
model = LogisticRegression(solver='liblinear', multi_class='auto')
model.fit(X_train, y_train)

print(f'''

validation score: {model.score(X_test, y_test)}
validation (men) score: {model.score(X_test[X.iloc[X_test.index]['gender'] == 1],
                                     y_test[X.iloc[X_test.index]['gender'] == 1]):0.3f}
validation (women) score: {model.score(X_test[X.iloc[X_test.index]['gender'] == 0],
                                       y_test[X.iloc[X_test.index]['gender'] == 0]):0.3f}
''')



validation score: 0.97
validation (men) score: 0.988
validation (women) score: 0.889



The results are the same. We should take care to have a representative sample of each gender in order to acheive similarly accurate outcomes for both.


**Write-up!** With your neighbor, discuss what this might imply about our model and our data. Also, discuss why it may or may not be a good idea to ignore "protected variables" like "gender" when training a model. Record your response below.

## 4. Becoming Data and Fairness Aware

Discussion with the class on machine learning fairness. 

Refer to the Lesson Plan for instructions.