Part of our skepticism with mechanized models of assessment is a simple, yet overlooked, fact: simple code can create inequitable situations. Below, we offer actual, runnable Python code (targeted at programming novices) that implements an adaptive assessment for a grade school mathematics student. The way to wade through this code--regardless of your programming proficiency--is to notice how basic, core decisions (that a programmer might not even consider!) will generate inequitable outcomes for students. Assessment inequity does not require malevolence - almost everyone who is in education means well. 
While building an adaptive assessment may sound complicated, it does not need to be. It is our hope that even non-programmers will be able to read the program code below and see points at which good intentions generate problematic situations. Berland teaches a class on computational research in education: an adaptive assessment like the ones below can be built by a relatively novice student for a week’s homework. Like everything in the book, building an assessment in which bias is exposed and which leverages or at least considers notions of equity in the structure of the assessment is not necessarily harder.
Let’s say you start with a simple math problem. In Canvas, a system used in classes at both Stanford and UW–Madison, instructors with no programming background can create a problem generator. The text below, offers such a tool in Python for a basic arithmetic class (if the language below looks intimidating, the language that follows # is a step-by-step guide explaining what each line does and how): 


In [4]:

# we will use randomness in this code, so we need to "import"
# what's called a "library" - you can look up any number of these
# at python.org - there are hundreds.
import random
 
# def declares a function in python
# a function is something that can be called repeatedly
# calling addend() returns and integer between 0 and 9
def simple_addend():
   # random.choice() takes a list and returns a random element
   # range(N) returns a list of integers from 0 to N
   return random.choice(range(10))
x = simple_addend() # random integer, eg 5
y = simple_addend() # random integer, eg 7
 
# int() takes a string (ie text) or number (eg 3.442) and tries to give you the integer of that
# e.g. int(2.332) == 2
# input("hello!") prints that text (here: hello!) and then waits for user input
# int(input(text)) turns the user input into its best guess of an integer
response = int(input('\n{} + {} = ?\n'.format(x,y)))
if response == (x+y):
   print("correct")
else:
   print("incorrect")
 
# ** a sample run **
# python> 4 + 7 = ?
# user> 11
# python> correct



9 + 8 = ?
17
correct



That process is not particularly tricky. The process becomes something only slightly more tricky when the problem generated needs to be adaptive.

In [3]:
# this time, addend() takes a parameter, difficulty
def addend(difficulty):
   # we never want difficulty below 1,
   # so we always take the maximum of
   # the two options: difficulty and 1
   difficulty = max(1,difficulty)
  
   # we multiply difficulty by our random
   # integer, likely making it
   # a more difficult number to add
   return difficulty * random.choice(range(10))
# difficulty starts at 3
# since our range above is 0-9, addends now
# range from 0-27
difficulty = 3
x = addend(difficulty)
y = addend(difficulty)
response = int(input('\n{} + {} = ?\n'.format(x,y)))
if response == (x+y):
   # if correct, we make things harder for next time by
   # adding one to the difficulty
   difficulty += 1
   print("correct")
else:
   # if incorrect, we make things easier for next time by
   # subtracting one from the difficulty
   difficulty -= 1
   print("incorrect")
print("difficulty is now {}.".format(difficulty))
# ** a sample run **
# python> 12 + 3 = ?
# user> 22
# python> incorrect
# python> difficulty is now 2.



9 + 15 = ?
23
incorrect
difficulty is now 2.


In [5]:

# we will use randomness in this code, so we need to "import"
# what's called a "library" - you can look up any number of these
# at python.org - there are hundreds.
import random
 
# def declares a function in python
# a function is something that can be called repeatedly
# calling addend() returns and integer between 0 and 9
def simple_addend():
   # random.choice() takes a list and returns a random element
   # range(N) returns a list of integers from 0 to N
   return random.choice(range(10))
x = simple_addend() # random integer, eg 5
y = simple_addend() # random integer, eg 7
 
# int() takes a string (ie text) or number (eg 3.442) and tries to give you the integer of that
# e.g. int(2.332) == 2
# input("hello!") prints that text (here: hello!) and then waits for user input
# int(input(text)) turns the user input into its best guess of an integer
response = int(input('\n{} + {} = ?\n'.format(x,y)))
if response == (x+y):
   print("correct")
else:
   print("incorrect")
 
# ** a sample run **
# python> 4 + 7 = ?
# user> 11
# python> correct



3 + 4 = ?
7
correct



That process is not particularly tricky. The process becomes something only slightly more tricky when the problem generated needs to be adaptive.


In [6]:

# this time, addend() takes a parameter, difficulty
def addend(difficulty):
   # we never want difficulty below 1,
   # so we always take the maximum of
   # the two options: difficulty and 1
   difficulty = max(1,difficulty)
  
   # we multiply difficulty by our random
   # integer, likely making it
   # a more difficult number to add
   return difficulty * random.choice(range(10))
# difficulty starts at 3
# since our range above is 0-9, addends now
# range from 0-27
difficulty = 3
x = addend(difficulty)
y = addend(difficulty)
response = int(input('\n{} + {} = ?\n'.format(x,y)))
if response == (x+y):
   # if correct, we make things harder for next time by
   # adding one to the difficulty
   difficulty += 1
   print("correct")
else:
   # if incorrect, we make things easier for next time by
   # subtracting one from the difficulty
   difficulty -= 1
   print("incorrect")
print("difficulty is now {}.".format(difficulty))
# ** a sample run **
# python> 12 + 3 = ?
# user> 22
# python> incorrect
# python> difficulty is now 2.



21 + 6 = ?
32
incorrect
difficulty is now 2.



In some sense, we can call it a day. We have created an adaptive assessment. The more problems that a student gets right, the harder the following problems become, and the teacher gets a “max difficulty” score. Not only that, since the tool doesn’t penalize you for getting a question wrong (instead making the next question is relatively easier), you are incentivized to try to keep going and not worry about working on problems that may be a bit too hard. Sounds good, right? So what is wrong with this?

The biggest problem is that very few problems scale in the way this generator is designed. The problems that do rarely assess understanding meaningfully. For that we need “features.”

As we saw in many of the cases in the chapter, the decision of what counts as a “feature” (e.g. a measurable aspect, an axis, a column, a dimension) is extremely fraught. As we will see through the following code examples, “feature engineering” is a salient way that bias and control are baked into the algorithms that shape modern schooling. Much of the feature engineering on massively successful machine learning in the marketplace is done by novices and interns. Despite the cutting edge-allure of AI, the cleaning and quantizing of data is a job that is tedious, tiresome, and relegated as rote labor.

Going back to our example again, we create something simple but include two features:


In [7]:

def addend(difficulty_range):
   # random.choice() takes a list and returns a random element
   # range(N) returns a list of integers from 0 to N
   return difficulty_range * random.choice(range(10))
 
# this function returns a function (like in math)
def arithmetic_operation(difficulty):
   if difficulty < 3:
       # at low difficulty, addition
       return ("+",lambda x,y: x+y)
   elif difficulty < 5:
       # at higher difficulty, subtraction
       return ("-",lambda x,y: x-y)
   else:
       # highest difficulty is multiplication
       return ("*",lambda x,y: x*y)
 
def two_feature_problem(difficulty):
   x = addend(difficulty)
   y = addend(difficulty)
  
   # calling arithmetic_operation() returns the name
   # so we can print it out to the user
   # and the function, so we can do the calculation
   # ourselves to check their answer
   op_name, op_function = arithmetic_operation(difficulty)
   response = int(input('\n{} {} {} = ?\n'.format(x,op_name,y)))
   if response == op_function(x,y):
       difficulty += 1
       print("correct")
   else:
       difficulty -= 1
       print("incorrect")
   print("difficulty is now {}.".format(difficulty))
  
   # we have changed the difficulty so we need to tell the program
   # outside the function the new difficulty
   return difficulty
 
difficulty = 4
 
# do this 3 times
for i in range(3):
   # since this changes every time, we set it equal
   # to the returned value of the
   # two_feature_problem function
   difficulty = two_feature_problem(difficulty)
 
# ** a sample run **
# python> 0 - 28 = ?
# user>  -28
# python> correct
# python> difficulty is now 5.
# python> 25 * 30 = ?
# user>  3
# python> incorrect
# python> difficulty is now 4.
# python> 36 - 24 = ?
# user> 12
# python> correct
# python> difficulty is now 5.
 


12 - 4 = ?
3
incorrect
difficulty is now 3.

15 - 9 = ?
3
incorrect
difficulty is now 2.

12 + 8 = ?
0
incorrect
difficulty is now 1.



These features introduce a new, more fundamental question: Should operation and range scale difficulty in the same way? This short code segment masks the equity implications of the scaling: different students will excel at different trajectories through these problems. For some students, addition of negative numbers will halt their progress through them, for others it may be the multiplication of multidigit numbers. It may be the case that there is a systematicity to these different trajectories. Considering the ways these would play out in the real world, it is  incumbent on programmers to account for different trajectories. 
Teachers use tools everyday that carry these kinds of assumptions, and, because they might be implicit in simple code (e.g. like the code above), they go unexamined. It's not even clear that anyone involved with many of the systems would be aware of the need to look for systematic problems of the sort. Raising awareness that even simple code like this contains inherent systematic inequity (often residing in the IH quadrant of AnSpec) is not enough. 
One possible best practice could be to agree that adaptive software offer some way to expose assumptions. While it doesn’t “fix” the inequality baked into an assessment, it at least makes such work obvious. Unfortunately, this either requires the user (in this case, teachers who already have plenty of other demands on their schedules) to dedicate their time to learning how to do critical analysis of code, or we need to create some sort of language for exposing and expressing these biases in ways that do not require teachers to go and get another Master's degree. 
Turning the code above into an assessment only requires keeping track of the assessment implicit in the adaptivity:


In [8]:

def analyze_incorrect(user_answer, x1, x2, op_function):
   possible_operations = [lambda x,y: x+y,
                          lambda x,y: x-y,
                          lambda x,y: x*y]
   correct_answer = op_function(x1,x2)
   # if they simply flipped the order or used another operation
   # lower the difficulty of the operation
   if user_answer in map(lambda op: op(x1,x2), possible_operations) or user_answer in map(lambda op: op(x2,x1), possible_operations):
       print("are you sure you did the right operation?")
       return (0,-1) 
   # don't change range, make operation less difficult
   # or if they were within 10% of the correct answer
   # then change the range, but don’t change the operation
   elif abs(user_answer - correct_answer) <= abs(0.1 * correct_answer):
       print("are you sure you did the calculation correctly?")
       return (-1,0)
   # otherwise, no guesses! difficulty for both is lowered by one
   else:
       return (-1,-1)
 
difficulty_range = 3
difficulty_operation = 3
max_range_difficulty = 0
max_operation_difficulty = 0
 
for i in range(5):
   x = addend(difficulty_range)
   y = addend(difficulty_range)
   op_name,op_function = arithmetic_operation(difficulty_operation)
   response = int(input('\n{} {} {} = ?\n'.format(x,op_name,y)))
   if response == op_function(x,y):
       difficulty_range += 1
       difficulty_operation += 1
       print("correct")
   else:
       delta_difficulty_range, delta_difficulty_operation = analyze_incorrect(response,x,y,op_function)
       difficulty_range += delta_difficulty_range
       difficulty_operation -= delta_difficulty_operation
       print("incorrect")
   print("range difficulty is now {}.".format(difficulty_range))
   print("operation difficulty is now {}.".format(difficulty_operation))
   max_range_difficulty = max(max_range_difficulty,difficulty_range)
   max_operation_difficulty = max(max_operation_difficulty,difficulty_operation)
print("user score (max range difficulty): {}".format(max_range_difficulty))
print("user score (max operation difficulty): {}".format(max_operation_difficulty))
 
# ** a sample run **
# python> 24 - 3 = ?
# user> 21
# python> correct
# python> range difficulty is now 4.
# python> operation difficulty is now 4.
#
# python> 32 - 20 = ?
# user> 12
# python> correct
# python> range difficulty is now 5.
# python> operation difficulty is now 5.
#
# python> 25 * 35 = ?
# user> 875
# python> correct
# python> range difficulty is now 6.
# python> operation difficulty is now 6.
#
# python> 0 * 48 = ?
# user> 0
# python> correct
# python> range difficulty is now 7.
# python> operation difficulty is now 7.
#
# python> 21 * 21 = ?
# user> 2
# python> incorrect
# python> range difficulty is now 6.
# python> operation difficulty is now 8.
# python> user score (max range difficulty): 7
# python> user score (max operation difficulty): 8



21 - 3 = ?
0
incorrect
range difficulty is now 2.
operation difficulty is now 4.

12 - 16 = ?
0
incorrect
range difficulty is now 1.
operation difficulty is now 5.

9 * 1 = ?
0
incorrect
range difficulty is now 0.
operation difficulty is now 6.

0 * 0 = ?
0
correct
range difficulty is now 1.
operation difficulty is now 7.

3 * 9 = ?
0
incorrect
range difficulty is now 0.
operation difficulty is now 8.
user score (max range difficulty): 2
user score (max operation difficulty): 8




We can see how quickly our adaptive math tutor elides into a sort of de facto (if not de jure) assessment. In many ways, this is superior to a classic math test: it doesn't punish failure, it allows exploration, it is impossible to "cheat" the system (as the numbers are randomized), and it does not report your trajectory, so to speak. On the other hand, it may make implicit what is understood to be explicit about a test: that it is an evaluation, that it is not expected to relate to lived practice, that it is an exercise for assessment rather than learning. Unless that is understood by all parties, it is simply another potential opportunity to rank students unnecessarily. There is no expectation that, simply because it is adaptive, that it is authentic to any meaningful tasks. That's a dangerous leap, but one that many people seem to make. We may just be reproducing many of the worst elements of school, faster.