# Seldonian Algorithm Application on Predicting Students' GPA in Brazil

### Author: Dasha Asienga

We've seen how the Seldonian Algorithm can be applied to synthetic data in a regression setting. Now, let's extend that further by demonstrating the application to a real-world data set, still within the regression setting. We'll be using a data set that contains anonymized information about an applicant's scores on nine exams taken as part of the application process to a university in Brazil (Federal University of Rio Grande do Sul), as well as their corresponding GPA during the first three semesters at university.

The codebook for the data is contained in the `Data Sets` folder of the repository. 

There are 2 sensitive variables: race and gender. 

We'll largely mimic the set-up of the tutorial with a few edits to suit our data set, especially when defining our fairness constraints. 

### Import Necessary Libraries

In [1]:
import math
import numpy as np
import sys
import pandas as pd # Work with data sets 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from scipy.stats import t
from scipy.optimize import minimize # The black-box optimization algorithm used to find a candidate solution 

In [2]:
np.set_printoptions(precision=5, suppress=True)

### Read in the Data

Next, let's read in the data set.

In [3]:
columns = ["Gender", "Physics", "Biology", 
    "History", "Second_Language", "Geography", 
    "Literature", "Portuguese_and_Essay", 
    "Math", "Chemistry", "GPA"]

df = pd.read_csv("DataSets/data.csv", header = None, names=columns)

In [4]:
df.head()

Unnamed: 0,Gender,Physics,Biology,History,Second_Language,Geography,Literature,Portuguese_and_Essay,Math,Chemistry,GPA
0,0,622.6,491.56,439.93,707.64,663.65,557.09,711.37,731.31,509.8,1.33333
1,1,538.0,490.58,406.59,529.05,532.28,447.23,527.58,379.14,488.64,2.98333
2,1,455.18,440.0,570.86,417.54,453.53,425.87,475.63,476.11,407.15,1.97333
3,0,756.91,679.62,531.28,583.63,534.42,521.4,592.41,783.76,588.26,2.53333
4,1,584.54,649.84,637.43,609.06,670.46,515.38,572.52,581.25,529.04,1.58667


Let's extract the predictor variables.

In [5]:
X = df.drop(columns=["GPA"])

In [6]:
X.head()

Unnamed: 0,Gender,Physics,Biology,History,Second_Language,Geography,Literature,Portuguese_and_Essay,Math,Chemistry
0,0,622.6,491.56,439.93,707.64,663.65,557.09,711.37,731.31,509.8
1,1,538.0,490.58,406.59,529.05,532.28,447.23,527.58,379.14,488.64
2,1,455.18,440.0,570.86,417.54,453.53,425.87,475.63,476.11,407.15
3,0,756.91,679.62,531.28,583.63,534.42,521.4,592.41,783.76,588.26
4,1,584.54,649.84,637.43,609.06,670.46,515.38,572.52,581.25,529.04


Let's now convert this to a `numpy` array. 

In [7]:
X = X.values

In [8]:
X

array([[  0.  , 622.6 , 491.56, ..., 711.37, 731.31, 509.8 ],
       [  1.  , 538.  , 490.58, ..., 527.58, 379.14, 488.64],
       [  1.  , 455.18, 440.  , ..., 475.63, 476.11, 407.15],
       ...,
       [  0.  , 798.75, 817.58, ..., 662.05, 773.15, 835.25],
       [  0.  , 527.66, 443.82, ..., 583.41, 395.46, 509.8 ],
       [  0.  , 512.56, 415.41, ..., 538.35, 448.02, 496.39]])

Let's extract the response variable.

In [9]:
Y = df["GPA"]
Y.head()

0    1.33333
1    2.98333
2    1.97333
3    2.53333
4    1.58667
Name: GPA, dtype: float64

In [10]:
Y = Y.values
Y

array([1.33333, 2.98333, 1.97333, ..., 3.75   , 2.5    , 3.16667])

Finally, let's extract the sensitive attribute.

In [11]:
gender = df["Gender"]
gender.head()

0    0
1    1
2    1
3    0
4    1
Name: Gender, dtype: int64

In [12]:
gender = gender.values
gender

array([0, 1, 1, ..., 0, 0, 0])

### Implement Simple/ Helper Functions

In [13]:
def tinv(p, nu):
    return t.ppf(p, nu)

In [14]:
def stddev(v):
    n = v.size
    variance = (np.var(v) * n) / (n-1) # Variance with Bessel's correction
    return np.sqrt(variance)           # Compute the standard deviation

In [15]:
def ttestUpperBound(v, delta):
    n  = v.size
    res = v.mean() + stddev(v) / math.sqrt(n) * tinv(1.0 - delta, n - 1)
    return res

In [16]:
def predictTTestUpperBound(v, delta, k):
    # conservative prediction of what the upper bound will be in the safety test for the a given constraint
    res = v.mean() + 2.0 * stddev(v) / math.sqrt(k) * tinv(1.0 - delta, k - 1)
    return res

### Implement the QSA

In [17]:
# Uses the weights in theta to predict the output value, y, associated with the provided x.
# This function assumes we are performing linear regression, so that theta has two elements: 
# the y-intercept (first parameter) and slope (second parameter)
def predict(theta, x):
    return theta[0] + theta[1] * x

In [18]:
# Estimator of the primary objective, in this case, the negative sample mean squared error
def fHat(theta, X, Y):
    n = X.size          # Number of points in the data set
    res = 0.0           # Used to store the sample MSE we are computing
    for i in range(n):  # For each point X[i] in the data set ...
        prediction = predict(theta, X[i])                # Get the prediction using theta
        res += (prediction - Y[i]) * (prediction - Y[i]) # Add the squared error to the result
    res /= n            # Divide by the number of points to obtain the sample mean squared error
    return -res         # Returns the negative sample mean squared error

In [19]:
# Returns unbiased estimates of g_1(theta), computed using the provided data
def gHat1(theta, X, Y):
    n = X.size          # Number of points in the data set
    res = np.zeros(n)   # We will get one estimate per point; initialize res to store these estimates
    for i in range(n):
        prediction = predict(theta, X[i])                   # Compute the prediction for the i-th data point
        res[i] = (prediction - Y[i]) * (prediction - Y[i])  # Compute the squared error for the i-th data point
    res = res - 2.0     # We want the MSE to be less than 2.0, so g(theta) = MSE-2.0
    return res

# Returns unbiased estimates of g_2(theta), computed using the provided data
def gHat2(theta, X, Y):
    n = X.size          # Number of points in the data set
    res = np.zeros(n)   # We will get one estimate per point; initialize res to store these estimates
    for i in range(n):
        prediction = predict(theta, X[i])                   # Compute the prediction for the i-th data point
        res[i] = (prediction - Y[i]) * (prediction - Y[i])  # Compute the squared error for the i-th data point
    res = 1.25 - res    # We want the MSE to be at least 1.25, so g(theta) = 1.25-MSE
    return res

In [39]:
# Run ordinary least squares linear regression on data (X,Y)
def leastSq(X, Y):
    #X = np.expand_dims(X, axis=1) # Places the input  data in a matrix
    #Y = np.expand_dims(Y, axis=1) # Places the output data in a matrix
    reg = LinearRegression().fit(X, Y)
    #theta0 = reg.intercept_[0]  # Gets theta0, the y-intercept coefficient
    #theta1 = reg.coef_[0][0]     # Gets the slope coefficients
    theta0 = reg.intercept_  # Gets theta0, the y-intercept coefficient
    theta1 = reg.coef_     # Gets the slope coefficients
    #return np.array([theta0, theta1])
    return np.concatenate((np.array([theta0]), theta1))

In [21]:
# Our Quasi-Seldonian linear regression algorithm operating over data (X,Y).
# The pair of objects returned by QSA is the solution (first element) 
# and a Boolean flag indicating whether a solution was found (second element).
def QSA(X, Y, gHats, deltas):
  # Put 40% of the data in candidateData (D1), and the rest in safetyData (D2)
    candidateData_len = 0.40
    candidateData_X, safetyData_X, candidateData_Y, safetyData_Y = train_test_split(
      X, Y, test_size=1-candidateData_len, shuffle=False)
  
  # Get the candidate solution
    candidateSolution = getCandidateSolution(candidateData_X, candidateData_Y, gHats, deltas, safetyData_X.size)

  # Run the safety test
    passedSafety      = safetyTest(candidateSolution, safetyData_X, safetyData_Y, gHats, deltas)

  # Return the result and success flag
    return [candidateSolution, passedSafety]

In [22]:
# Run the safety test on a candidate solution. Returns true if the test is passed.
#   candidateSolution: the solution to test. 
#   (safetyData_X, safetyData_Y): data set D2 to be used in the safety test.
#   (gHats, deltas): vectors containing the behavioral constraints and confidence levels.
def safetyTest(candidateSolution, safetyData_X, safetyData_Y, gHats, deltas):

    for i in range(len(gHats)):  # Loop over behavioral constraints, checking each
        g         = gHats[i]  # The current behavioral constraint being checked
        delta     = deltas[i] # The confidence level of the constraint

    # This is a vector of unbiased estimates of g(candidateSolution) -- defined above
        g_samples = g(candidateSolution, safetyData_X, safetyData_Y) 

    # Check if the i-th behavioral constraint is satisfied
        upperBound = ttestUpperBound(g_samples, delta) 

        if upperBound > 0.0: # If the current constraint was not satisfied, the safety test failed
            return False

  # If we get here, all of the behavioral constraints were satisfied      
    return True

In [23]:
# The objective function maximized by getCandidateSolution.
#     thetaToEvaluate: the candidate solution to evaluate.
#     (candidateData_X, candidateData_Y): the data set D1 used to evaluated the solution.
#     (gHats, deltas): vectors containing the behavioral constraints and confidence levels.
#     safetyDataSize: |D2|, used when computing the conservative upper bound on each behavioral constraint.
def candidateObjective(thetaToEvaluate, candidateData_X, candidateData_Y, gHats, deltas, safetyDataSize): 

  # Get the primary objective of the solution, fHat(thetaToEvaluate)
    result = fHat(thetaToEvaluate, candidateData_X, candidateData_Y)

    predictSafetyTest = True     # Prediction of what the safety test will return. Initialized to "True" = pass
    
    for i in range(len(gHats)):  # Loop over behavioral constraints, checking each
        g         = gHats[i]       # The current behavioral constraint being checked
        delta     = deltas[i]      # The confidence level of the constraint

    # This is a vector of unbiased estimates of g_i(thetaToEvaluate)
        g_samples = g(thetaToEvaluate, candidateData_X, candidateData_Y)

    # Get the conservative prediction of what the upper bound on g_i(thetaToEvaluate) will be in the safety test
        upperBound = predictTTestUpperBound(g_samples, delta, safetyDataSize)

    # We don't think the i-th constraint will pass the safety test if we return this candidate solution
        if upperBound > 0.0:

            if predictSafetyTest:
        # Set this flag to indicate that we don't think the safety test will pass
                predictSafetyTest = False  
    
        # Put a barrier in the objective. Any solution that we think will fail the safety test will have a
        # large negative performance associated with it
                result = -100000.0    

      # Add a shaping to the objective function that will push the search toward solutions that will pass 
      # the prediction of the safety test
            result = result - upperBound

  # Negative because our optimizer (Powell) is a minimizer, but we want to maximize the candidate objective
    return -result  

In [24]:
# Use the provided data to get a candidate solution expected to pass the safety test.
#    (candidateData_X, candidateData_Y): data used to compute a candidate solution.
#    (gHats, deltas): vectors containing the behavioral constraints and confidence levels.
#    safetyDataSize: |D2|, used when computing the conservative upper bound on each behavioral constraint.
def getCandidateSolution(candidateData_X, candidateData_Y, gHats, deltas, safetyDataSize):
  
  # Chooses the black-box optimizer we will use (Powell)
    minimizer_method = 'Powell'
    minimizer_options={'disp': False}

  # Initial solution given to Powell: simple linear fit we'd get from ordinary least squares linear regression
    initialSolution = leastSq(candidateData_X, candidateData_Y)

  # Use Powell to get a candidate solution that tries to maximize candidateObjective
    res = minimize(candidateObjective, x0=initialSolution, method=minimizer_method, options=minimizer_options, 
    args=(candidateData_X, candidateData_Y, gHats, deltas, safetyDataSize))

  # Return the candidate solution we believe will pass the safety test
    return res.x

In [25]:
def main():
    np.random.seed(0)  # Create the random number generator to use, with seed zero
    numPoints = 5000   # Let's use 5000 points

    # Create the behavioral constraints - each is a gHat function and a confidence level delta
    gHats  = [gHat1, gHat2] # The 1st gHat requires MSE < 2.0. The 2nd gHat requires MSE > 1.25
    deltas = [0.1, 0.1]

    (result, found) = QSA(X, Y, gHats, deltas) # Run the Quasi-Seldonian algorithm
    
    if found:
        print("A solution was found: [%.10f, %.10f]" % (result[0], result[1]))
        print("fHat of solution (computed over all data, D):", fHat(result, X, Y))
    else:
        print("No solution found")

In [40]:
main()

IndexError: index 17321 is out of bounds for axis 0 with size 17321

In [41]:
leastSq(X,Y)

array([ 0.5067 , -0.34821,  0.00008,  0.00043,  0.00065,  0.00039,
        0.0003 ,  0.00142,  0.00089, -0.00006,  0.0002 ])