# Test Generator

Read in the candidates and item data and generate a randomised test from them.

We assume that the 1PL model is used.

$$
Pr(X=1) = \frac{exp(\theta-b)}{1 + exp(\theta-b)}
$$

This test generator can generate tests with a mix of dichotomous and polytomous (scalar) items. The items are loaded from the file `data/items.csv` which looks like this:

```
UIID,a,b,se,rating,k
A1L#01_7616,1.0,-6.59,0.0,A1,1
A1L#02_20679,1.0,-5.36,0.0,A1,1
A1L#03_5480,1.0,-4.64,0.0,A1,1
A2L#04_5483,1.0,-4.2,0.0,A2,1
```

Here, the first two items: `A1L#01_7616` and `A1L#02_20679` are dichotomous (ie. `k = 1`), in other words they can take values '0' (incorrect) or '1' (correct); they have difficulty parameters `b` of -6.59 and -5.36 respectively. Difficulty thresholds used for any polytomous items are hardcoded in the `POLY_THRESHOLDS` list (see below) - they should really be different for each item.

## Cohorts
There are two cohorts: a high ability and low ability. Each cohort will take a slightly different set of test items: high ability candidates will take a test comprising high ability items and medium ability items, low ability candidates will take a test comprising low ability items and medium ability items. Note that we randomly assign candidates as being 'high' or low' ability based on a random (binomial) distribution.

## Data Ingest

There are two files in the `data` folder that we need: `items.csv` and `candidates.csv`. From these we generate a randomised test.

In [1]:
import numpy as np
from numpy.random import seed
from typing import List, Tuple
from csv import reader
import pandas as pd
import re


# these are the threshold params (deltas) used for the scalar items
# in a proper item bank, each item would have its own set of thresholds
POLY_THRESHOLDS = [
    {'1': -5.39, '2': 3.90}, # k = 2
    {'1': -1.8, '2': -0.2, '3': 1.2}, # k = 3
    {'1': -2.0, '2': -0.5, '3': 0.4, '4': 1.4}, # k = 4
    {'1': -2.25, '2': -1.0, '3': 0.0, '4': 1.0, '5': 2.04}, # k = 5
    {'1': -2.5, '2': -1.25, '3': -0.5, '4': 0.5, '5': 1.25, '6': 2.5}  # k = 6
]

COHORTS = ['Lo', 'Hi']

def getDataAsList(datafile: str) -> List[Tuple]:
    """Turn a CSV datafile into a list of tuples

    :param datafile: the CSV file to load data from
    :return: a list of rows (tuples)
    """
    with open(datafile, 'r', encoding='utf-8-sig') as fs:
        csv_reader = reader(fs)
        row_list = list(map(tuple, csv_reader))
        return row_list[1:]    # ignore the header row
    

# convert the raw data from the candidates.csv file into
# a simple quadruple of ( systemname, givenName, familyName, theta )
def getCandidates() -> List[Tuple]:
    candidates = getDataAsList('data/candidates.csv')
    new_list = [(c[0], c[1], c[2], float(c[3])) for c in candidates]
    return new_list
    

# convert the raw data from the items.csv file into
# a simple triple of ( uiid, a, b )
def getItems() -> List[Tuple]:
    items = getDataAsList('data/items.csv')
    new_list = [(i[0], float(i[1]), float(i[2]), int(i[5])) for i in items]
    return new_list

In [2]:
items = getItems()
candidates = getCandidates()

## Item Response Generation

The `getItemResponse()` function is used to generate a randomised response: correct (1) or incorrect (0) for a given candidate taking an item.

In [3]:
def getItemResponse(b: float, theta: float, seed: int = None) -> str:
    """Gets a randomised dichotomous item response for a given candidate
    according to the 1PL model:

    P(X=1) = e^(theta-b) / 1 + e^(theta-b)

    :param b: the difficulty parameter for the item
    :param theta: the latent ability of the candidate
    :return: '0' = incorrect, '1' = correct
    """
    if seed is None:
        rng = np.random.default_rng()
    else:
        rng = np.random.default_rng(seed)
    rv = np.random.default_rng().normal(0.5, 0.2)
    rv = max(0, rv)
    rv = min(1, rv)

    p1 = np.exp(theta - b) / (1 + np.exp(theta - b))
    p0 = 1 - p1

    assert p0 <= 1.0
    assert p1 <= 1.0

    rLookup = {
        '0': [0.00, p0],
        '1': [p0, 1.00]
    }
    r = {k: v for (k, v) in rLookup.items() if v[0] <= rv <= v[1]}
    rKey = list(r.keys())

    assert rKey[0] == '1' or rKey[0] == '0'

    return rKey[0]


def getScalarResponse(b: float, theta: float, k: int = 1, seed: int = None) -> str:
    """Gets a randomised polytomous item response for a given candidate
    according to the 1PL model; uses the partial credit model (PCM) approach.
    It uses the thresholds from POLY_THRESHOLDS to calculate what the difficulties
    should be for each level.

    :param b: the difficulty parameter for the item
    :param theta: the latent ability of the candidate
    :param k: the number of levels for the item (1 = dichotomous, 2+ = polytomous)
    :param maxValue: the maximum possible value of the scalar response
    :return: '0' = incorrect, '1' = correct
    """
    assert k > 0
    
    response = 0
    
    if k == 1:
        response = getItemResponse(b, theta, seed)
    else:        
        if seed is None:
            rng = np.random.default_rng()
        else:
            rng = np.random.default_rng(seed)
        rv = rng.random()
        
        rLookup = {}
        rThresholds = POLY_THRESHOLDS[k-2]
        rLookup['0'] = [0.00, rThresholds['1']]
        p1 = p0 = 0.00
        
        for t in range(k+1):
            p0 = p1
            if t == k:
                p1 = 1.0
            else:
                b = rThresholds[str(t+1)]
                p1 = 1 - (np.exp(theta - b) / (1 + np.exp(theta - b)))
            rLookup[str(t)] = [p0, p1]

        r = {k: v for (k, v) in rLookup.items() if v[0] <= rv <= v[1]}
        rKey = list(r.keys())    
        response = rKey[0]
        
    return response


We iterate through the data and genereate item responses for each candidate. Each candidate takes a test comprising each item; with a simulated response being generated for each.

In [4]:
def GenerateRandomTests(seed: int = None):
    test_responses = [] # a list of lists

    # generate a header row for the results
    header = []
    header.append('systemname')
    for i in items:
        header.append(i[0])

    # now create the simulated test responses
    for c in candidates:
        test = []
        test.append(c[0])

        # choose a cohort for this candidate ('Lo', or 'Hi')
        r = np.random.binomial(n=1, p=0.5)
        candidateCohort = COHORTS[r]

        for i in items:
            uiid = i[0]
            cohortMatch = re.search("Lo|Hi", uiid)
            if cohortMatch is not None:
                itemCohort = cohortMatch.group()
            else:
                itemCohort = None
            if (itemCohort is None) or (candidateCohort == itemCohort):
                if i[3] > 2:
                    # proivde a polytomous response
                    r = int(getScalarResponse(i[2], c[3], i[3], seed))
                else:
                    # provide a dichomtomous response
                    r = int(getItemResponse(i[2], c[3], seed))
            else:
                r = ''
            
            test.append(r)
        test_responses.append(test)

    df = pd.DataFrame(test_responses, columns=header)
    return df

In [5]:
df = GenerateRandomTests()

In [6]:
(df)

Unnamed: 0,systemname,A1L#01_7616,A1L#02_20679,A1L#03_5480,A2L#04_5483,A2L#05_24442,A2L#06_7620,A2L#07_7627,B1L#08_20849,B1L#10_21135,...,R3LoR#08_Cats,R3LoS#09_Cats,R3LoS#10_Cats,R3HiR#02_Epics,R3HiT#04_Epics,R3HiI#05_Epics,R3HiE#06_Epics,R3HiL#07_Epics,R3HiS#09_Epics,R3HiS#10_Epics
0,DT0001,1,1,1,0,0,1,1,0,0,...,1,1,1,,,,,,,
1,DT0002,1,1,1,0,1,0,1,0,0,...,0,1,1,,,,,,,
2,DT0003,1,1,0,0,0,0,0,0,0,...,0,1,0,,,,,,,
3,DT0004,1,0,1,0,0,0,0,0,0,...,0,1,0,,,,,,,
4,DT0005,1,1,0,0,0,0,0,0,0,...,,,,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,DT4996,1,1,1,1,0,0,1,0,0,...,0,1,0,,,,,,,
4996,DT4997,1,1,0,0,0,0,0,0,0,...,,,,0,0,0,0,1,0,0
4997,DT4998,1,1,1,1,0,1,1,0,0,...,,,,1,1,1,1,1,1,1
4998,DT4999,1,1,1,1,1,0,1,0,0,...,1,0,1,,,,,,,


## Next Steps

You can call the `GenerateRandomTests()` function as many times as you want to re-generate a test. It will generate different results every time (unless you pass in an integer seed value).

Add items and candidates to the data files to generate larger tests.

When you are happy with the results you can write out to a results CSV file like this:

In [7]:
df.to_csv('data/results.csv', index=False)