# Week 3 Homework: Predicting Loan Approval Using Logistic Regression

In this homework, we will be training a classifier to predict whether or not someone is a good candidate for a loan using machine learning. We have access to various features about each person and whether or not they are considered to be good candidates. We will use this as training data for a simple logistic regression classifier.

The topic of predicting loan eligibility using machine learning is a very touchy subject. Obviously, useful information for making the decision includes protected class attributes, such as race and gender. Remember: bias in gives bias out--data from humans often has societal bias imbued in it, so it is important that your model doesn't accidentally reflect this bias.

The focus of this assignment will be seeing how we can use the same techniques/code from class for an entirely new domain. We will also examine the gender bias of our algorithm at the end of the assignment!

In [0]:
import pandas as pd
import numpy as np

np.random.seed(42)
np.set_printoptions(suppress=True)

import requests
import zipfile
import io
# Download class resources...
r = requests.get("http://web.stanford.edu/class/cs21si/resources/unit2_resources.zip")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

from sklearn.preprocessing import OneHotEncoder
from sklearn.utils import shuffle

We will be working with the German Credit Dataset. Each row represents a person, and each column gives their value for a specific feature. The features are shown at the top. The application of the resulting prediction task could be an automated way of approving loans!

Run the following cell to load the data below to see a sample of the data.

In [0]:
credit_df = pd.read_csv('unit2_resources/german_credit_data.csv')
# Remove unneeded columns from data.
credit_df = credit_df.drop('Unnamed: 0', 1).drop('Saving accounts', 1).drop('Checking account', 1).drop('Credit amount', 1).drop('Duration', 1).drop('Age', 1)
credit_df.head()

We can see that each individual has some features associated with them (sex, job, housing purpose), along with a groundtruth risk values. While this dataset isn't difficult to interpret for a human, it is not well-suited for input into a simple machine learning model yet. Why? 

1) Our logistic regression model makes predictions on numbers, not text. Given (male, 2, own, radio/TV), a logistic regression model is hard-pressed to apply a dot-product to text. 

2) Even for numerical features (e.g., "Job") it is often more useful to split the feature up into a one-hot encoding, which is a series of 0's with one 1 in a position indicating the value of a feature. For example, since "Job" takes on 6 values, if we want to indicate that an individual has the second of the 6 jobs, we encode this as [0, 1, 0, 0, 0, 0] (or a vector with all 0's except a 1 in the second position). At a high level, this allows our model to separate out the effects of different job types better. You can find a more detailed explanation of one-hot-encoding and why it is useful [here](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f). 

We perform the necessary data processing steps for you below.

In [0]:
def one_hot_encode_dataset(credit_df):
    credit_df = credit_df.replace({'Sex': {'male': 0, 'female': 1}})
    credit_df = credit_df.replace({'Housing': {'own': 0, 'rent': 1, 'free': 2}})
    credit_df = credit_df.replace({'Purpose': {'car': 0, 
               'furniture/equipment': 1, 'radio/TV': 2, 
               'domestic appliances': 3, 'repairs': 4, 'education': 5, 
               'business': 6, 'vacation/others': 7}})
    credit_df = credit_df.replace({'Risk': {'good': 1, 'bad': 0}})
    enc = OneHotEncoder(categories='auto')
    enc.fit(credit_df.values)
    dataset = enc.transform(credit_df.values).toarray()
    
    # List of binary columns in the final data, where the last column is the
    # risk to be predicted.
    columns = ['female', 'job:1', 'job:2', 'job:3', 'job:4', 'job:5', 'job:6', 
               'housing:own', 'housing:rent', 'housing:free', 
               'purpose:car', 'purpose:furniture/equipment', 'purpose:radio/TV', 
               'purpose:domestic appliances', 'purpose:repairs', 
               'purpose:education', 'purpose:business', 
               'purpose:vacation/others', 'risk:good']
    
    # Convert back to dataframe for easy viewing.
    processed_credit_df = pd.DataFrame(dataset, columns=columns)
    
    X, y = dataset[:, :-1], dataset[:, -1]
    
    return shuffle(X, y, random_state=0), processed_credit_df

(X, y), processed_credit_df = one_hot_encode_dataset(credit_df)

X_train, X_dev, X_test = X[:800], X[800:900], X[900:]
y_train, y_dev, y_test = y[:800], y[800:900], y[900:]

print("Training data shape", X_train.shape, y_train.shape)
print("Dev data shape", X_dev.shape, y_dev.shape)
print("Test data shape", X_test.shape, y_test.shape)


In [0]:
processed_credit_df.head()

Let's figure out the class imbalance so we have a baseline to better understand our model's performance. Hint: you want to find the mean value in *y_train*.

In [0]:
### YOUR CODE HERE ###

### END CODE ###

**Expected output**

0.70375

Fill in our familiar logistic regression helpers below. These are almost the same as in this week's exercises, but since there are no word vectors, we use the input *x* directly as a vector, rather than retrieving the word vector for a word.

In [0]:
def sigmoid(z):
    return 1.0 / (1 + np.exp(-z))
  
def compute_logistic_regression(x, weights, bias):
  ### YOUR CODE HERE ###
  return None
  ### END CODE ###
  
def get_loss(y, y_hat):
  ### YOUR CODE HERE ###
  return None
  ### END CODE ###
  
def get_weight_gradient(y, y_hat, x):
  ### YOUR CODE HERE ###
  return None
  ### END CODE ###
  
def get_bias_gradient(y, y_hat, x):
  ### YOUR CODE HERE ###
  return None
  ### END CODE ###

We also have our handy *evaluate_model* function, which is unchanged from before. Make sure you still understand it!

In [0]:
def evaluate_model(eval_data, weights, bias):
    num_examples = len(eval_data)
    total_correct = 0.0
    true_positives = 0.0
    false_positives = 0.0
    false_negatives = 0.0
    for i in range(num_examples):
        x, y = eval_data[i]
        pred = compute_logistic_regression(x, weights, bias)
        
        total_correct += 1 if (pred > .5 and y == 1 or pred <= .5 and y == 0) else 0
        true_positives += 1 if pred > .5 and y == 1 else 0
        false_positives += 1 if pred > .5 and y == 0 else 0
        false_negatives += 1 if pred <= .5 and y == 1 else 0
    print("Evaluation accuracy: ", total_correct / num_examples)
    print("Precision: ", true_positives / (true_positives + false_positives))
    print("Recall: ", true_positives / (true_positives + false_negatives))
    print()

Fill in *fit_logistic_regression* below. This will be similar to code you've written before, but we will also be evaluating on our dev dataset every 10 epochs (iterations through the training data). Note that *weights* should have dimensionality 18, since this is the size of each input vector.

In [0]:
def fit_logistic_regression(training_data, dev_data, NUM_EPOCHS=50, LEARNING_RATE=0.0005):
    np.random.seed(42)
    # YOUR CODE HERE - initialize weights and bias

    # END CODE
    
    for epoch in range(NUM_EPOCHS):
        loss = 0
        for example in training_data:
            x, y = example
            # YOUR CODE HERE

            
            
            
            
            
            # END CODE
        if epoch % 10 == 0:
            print("Epoch %d, loss = %f" % (epoch, loss))   
            print("Evaluating model on dev data...")
            ### YOUR CODE HERE ###

            ### END CODE ###
    return weights, bias

We can then call our new fitting function to train while evaluating on dev data.

In [0]:
training_data = list(zip(X_train, y_train))
dev_data = list(zip(X_dev, y_dev))
weights, bias = fit_logistic_regression(training_data, dev_data)

** First few lines of expected output:**

Epoch 0, loss = 600.347192

Evaluating model on dev data...

Evaluation accuracy:  0.67

Precision:  0.7065217391304348

Recall:  0.9154929577464789

** Last few lines of expected output:**

Epoch 40, loss = 231.450579

Evaluating model on dev data...

Evaluation accuracy:  0.91

Precision:  0.9305555555555556

Recall:  0.9436619718309859



Below, we evaluate our trained model on test data. This is where we see how we did on unseen data, i a real-world setting!

In [0]:
test_data = list(zip(X_test, y_test))
evaluate_model(test_data, weights, bias)

**Expected Output**

Evaluation accuracy:  0.91

Precision:  0.9253731343283582

Recall:  0.9393939393939394


We can see that we're doing well on both dev and test data! While this is a small toy dataset, our experiments suggest that even simple machine learning models are capable of making predictions on humans that can have long lasting effects (whether I get a loan can impact my financial situation years down the road). In coming weeks (particularly weeks 5 and 6), we will learn about the implications of potentially biased models that make predictions on humans.

Let's now investigate how important the "gender" feature is to the model in making the decision. First, we will grab all females from the test set:

In [0]:
X_test_female = X_test[X_test[:, 0] == 1]
print("There are %i females in the test set." % len(X_test_female))

**Expected Output**

There are 68 females in the test set.


Now, let's see how many of them had loans approved:

In [0]:
results = [compute_logistic_regression(x, weights, bias) for x in X_test_female]
total_good = sum(1 if result > .5 else 0 for result in results)
print("%i females had loans approved." % total_good)

**Expected Output**

44 females had loans approved.

Finally, let's change the gender feature to not female. We will input the same exact features into the model, except with gender changed from female to male. Hopefully, the number of loans approved will stay the same!

In [0]:
X_test_all_male = X_test_female.copy()
X_test_all_male[:, 0] = 0

results = [compute_logistic_regression(x, weights, bias) for x in X_test_all_male]
total_good = sum(1 if result > .5 else 0 for result in results)
print("%i females with gender changed had loans approved." % total_good)

**Expected Output**

39 females with gender changed had loans approved.



Oh no! Seems like the number of loans approved have decreased! In other words, even with everything else the same, only a different gender causes certain people (such as males and non-binary people) to be disadvantaged by our algorithm. Although the difference is minimal, remember that the test set size is just 100. This means that 5% of people were "misclassified" based off of a protected class--that's huge!

This is why it's important to be sure to debias your dataset and do a thorough hyperparameter sweep. See if you can change the hyperparamters to get a more fair model!