# Hidden Tests

In this file, the hidden tests for all the rubric points are to be described. The tests for the individual rubric points are enclosed within `# BEGIN <rubric_point>` and `# END <rubric_point>` NBConvert cells. `hidden_tests.py` works by executing the contents of those cells between those two tags for each `<rubric_point>`. In order to initialize variables, `hidden_tests.py` also executes all code within `BEGIN` and `END` tags that appear before the `original` test.

Code that is not enclosed within `BEGIN` and `END` tags are not executed by `hidden_tests.py`. They are used for generating the hidden datasets.

In [None]:
from hidden_tests import *
import otter_tests.gen_public_tests as gen_public_tests
import os, csv, json, copy, shutil
import random
import numpy as np

In [None]:
DIRECTORY = ...
FILE = ...

In [None]:
results = {}

In [None]:
deductions = {}
rubric = parse_rubric_file(os.path.join(DIRECTORY, "rubric.md"))
directories = get_directories(rubric)
comments = get_all_comments(directories)

In [None]:
def write_readme(data, write_path):
    """write_readme(data, write_path) writes the contents of `data` into the README.txt file `write_path`"""
    f = open(write_path, encoding='utf-8')
    rubric_point = f.read().split("\n")[0].strip(" \n")
    f.close()
    
    f = open(write_path, 'w', encoding='utf-8')
    f.write(rubric_point + "\n\n" + data)
    f.close()

## Variables

Useful variables that are used by many rubric tests can be stored here. The contents of this tag will be executed before each rubric test, so these variables get initialized before each rubric test.

`verify_fn_defn` defines the function `verify_fn` which is used for verifying if the function `expected` and `actual` have the same outputs for all permutations of inputs from `var_lists`.

In [None]:
verify_fn_defn = """
def verify_fn(expected, actual, var_inputs, test_format):
    for var in var_inputs:
        try:
            actual_val = actual(*var)
        except Exception as e:
            output = "%s results: " % actual.__name__
            output += "%s error enountered on %s%s" % (type(e).__name__, actual.__name__, repr(var))
            return output
        expected_val = expected(*var)
        check = public_tests.compare(expected_val, actual_val, test_format)
        if check != public_tests.PASS:
            output = "%s results: " % actual.__name__
            output += "%s%s output: %s" % (actual.__name__, repr(var), check)
            return output
    return "%s results: All test cases passed!" % actual.__name__"""

`function_dependencies_functions` stores the previously defined functions that each function definition invokes. This variable is used for rubric points that the logical correctness of functions as well as those that check whether a required function is used. For these rubric points, when we test a particular function, we use `function_dependencies_functions` to ensure that all the functions that it depends on are replaced with logically correct versions. This helps isolate the issue with the functions.

In [None]:
function_dependencies_functions = ...

`function_dependencies_data_structures` stores the previously defined data structures that each function definition invokes. This variable is used for rubric points that the logical correctness of functions as well as those that check whether a required function is used. For these rubric points, when we test a particular function, we use `function_dependencies_data_structures` to ensure that all the data structures that it depends on are replaced with logically correct versions. This helps isolate the issue with the functions.

In [None]:
function_dependencies_data_structures = ...

`data_structure_dependencies_functions` stores the previously defined functions that each data structure definition invokes. This variable is used for rubric points that the logical correctness of functions as well as those that check whether a required data structure is used. For these rubric points, when we test a particular data structure, we use `data_structure_dependencies_functions` to ensure that all the functions that it depends on are replaced with logically correct versions. This helps isolate the issue with the data structures.

In [None]:
data_structure_dependencies_functions = ...

`data_structure_dependencies_data_structures` stores the previously defined data structures that each data structure definition accesses. This variable is used for rubric points that the logical correctness of data structures as well as those that check whether a required data structure is used. For these rubric points, when we test a particular data structure, we use `data_structure_dependencies_data_structures` to ensure that all the data structures that it depends on are replaced with logically correct versions. This helps isolate the issue with the data structures.

In [None]:
data_structure_dependencies_data_structures = ... 

## Functions

Useful functions that are used by many rubric tests can be stored here. The contents of this tag will be executed before each rubric test, so these function definitions get initialized before each rubric test.

`replace_with_false_function` replaces the given `function` with the **false version** of the function, and also replaces all **dependent** functions and data structures with their **true versions**.

In [None]:
def replace_with_false_function(nb, function, false_function):
    nb = replace_defn(nb, function, false_function)
    
    for dependent in function_dependencies_functions.get(function, []):
        nb = replace_defn(nb, dependent, true_functions[dependent])
    for dependent in function_dependencies_data_structures.get(function, []):
        idx = find_all_cell_indices(nb, "code", "grader.check('%s')" % (dependent))[-1]
        if idx == None:
            idx = find_all_cell_indices(nb, "markdown", "**Question 1:**")[-1]
        nb = inject_code(nb, idx, true_data_structures[dependent])
        nb = remove_initializations(nb, dependent, start=idx+1)
    return nb

`replace_with_false_data_structure` replaces the given `data_structure` with the **false** version of the data structure, and also replaces all **dependent** functions and data structures with their **true versions**.

In [None]:
def replace_with_false_data_structure(nb, data_structure, false_data_structure):
    idx = find_all_cell_indices(nb, "code", "grader.check('%s')" % (data_structure))[-1]
    if idx == None:
        idx = find_all_cell_indices(nb, "markdown", "**Question 1:**")[-1]
    nb = inject_code(nb, idx, false_data_structure)
    nb = remove_initializations(nb, data_structure, start=idx+1)
    
    for dependent in data_structure_dependencies_functions.get(data_structure, []):
        nb = replace_defn(nb, dependent, true_functions[dependent])
    for dependent in data_structure_dependencies_data_structures.get(data_structure, []):
        idx = find_all_cell_indices(nb, "code", "grader.check('%s')" % (dependent))[-1]
        if idx == None:
            idx = find_all_cell_indices(nb, "markdown", "**Question 1:**")[-1]
        nb = inject_code(nb, idx, true_data_structures[dependent])
        nb = remove_initializations(nb, dependent, start=idx+1)
    return nb

`get_test_text` returns test code that can be readily injected into the notebook. The input should be some code that updates the variable `test_output` and sets its value to be `"All test cases passed!"` when the conditions for passing the rubric test are met. This function will place this code inside a wrapper than ensures that it does not crash the student notebook during execution and also makes the output parsable.

In [None]:
def get_test_text(qnum, test_code):
    test_text = "\"\"\"grader.check('%s')\"\"\"\n\n" % (qnum)
    test_text += "test_output = '%s results: Test crashed!'\n" % (qnum)
    test_text += add_try_except(test_code)
    test_text += "\nprint(test_output)"
    return test_text

`inject_function_logic_check` injects code into the `nb` that detects whether `function` outputs the same as the **true version** of that function (all dependent functions and data structures are also replaced with their **true versions**) on all combinations of inputs from `var_lists`. The comparison between the outputs is performed assuming that the format of the answers is `test_format`.

In [None]:
def inject_function_logic_check(nb, function, var_inputs_code, test_format="TEXT_FORMAT"):
    for dependent in function_dependencies_functions.get(function, []):
        nb = replace_defn(nb, dependent, true_functions[dependent])
    for dependent in function_dependencies_data_structures.get(function, []):
        idx = find_all_cell_indices(nb, "code", "grader.check('%s')" % (dependent))[-1]
        if idx == None:
            idx = find_all_cell_indices(nb, "markdown", "**Question 1:**")[-1]
        nb = inject_code(nb, idx, true_data_structures[dependent])
        nb = remove_initializations(nb, dependent, start=idx+1)
        
    code = replace_call(true_functions[function], function, "true_"+function)
    code += "\n\n" + verify_fn_defn
    nb = inject_code(nb, len(nb['cells']), code)
    test_code = var_inputs_code + "\n"
    test_code += "test_output = verify_fn(true_%s, %s, var_inputs, '%s')" % (function, function, test_format)
    code = get_test_text(function, test_code)
    nb = inject_code(nb, len(nb['cells']), code)
    return nb

`inject_data_structure_check` injects code into the `nb` that detects whether `data_structure` has the same value as the **true version** of that data structure (all dependent functions and data structures are also replaced with their **true versions**). The comparison between the outputs is performed assuming that the format of the answers is `test_format`.

In [None]:
def inject_data_structure_check(nb, data_structure, test_format="TEXT_FORMAT"):
    for dependent in data_structure_dependencies_functions.get(data_structure, []):
        nb = replace_defn(nb, dependent, true_functions[dependent])
    for dependent in data_structure_dependencies_data_structures.get(data_structure, []):
        idx = find_all_cell_indices(nb, "code", "grader.check('%s')" % (dependent))[-1]
        if idx == None:
            idx = find_all_cell_indices(nb, "markdown", "**Question 1:**")[-1]
        nb = inject_code(nb, idx, true_data_structures[dependent])
        nb = remove_initializations(nb, dependent, start=idx+1)
        
    code = "import copy\n%s = copy.deepcopy(%s)\n\n" % (data_structure, data_structure)
    code += replace_variable(true_data_structures[data_structure], data_structure, "true_"+data_structure)
    nb = inject_code(nb, len(nb['cells']), code)
    
    test_code = "test_output = '%s results: '" % (data_structure)
    test_code += "+ public_tests.compare(true_%s, %s, '%s')" % (data_structure, data_structure, test_format)
    code = get_test_text(data_structure, test_code)
    nb = inject_code(nb, len(nb['cells']), code)
    return nb

## Random Data Generation

Here, functions are defined that can generate **random** data that is in the correct format.

**Warning:** This is the most complex function in the file, and is likely to have some bugs in it. So, **verify** this function **carefully**. The following **requirements** for this function **will not** be met by the function generated by GPT, it is **your responsibility** to modify the function so as to meet these requirements. Otherwise, the datasets are unlikely to produce interesting outputs for the project questions.

## True Functions

Here, the **correct** versions of all functions that are defined in the notebook are stored. These functions are compared against the functions in the student notebook to check for their correctness.

## True Data Structures

Here, the **correct** versions of all data structures that are defined in the notebook are stored. These data structures are compared against the data structures in the student notebook to check for their correctness.

## Original

The original test simply runs the student's notebook as it is (after removing cells with syntax errors, and performing other clean-up). This helps us detect if the student failed any public tests.

In [None]:
nb = clean_nb(read_nb(os.path.join(DIRECTORY, FILE)))

results['original'] = parse_nb(run_nb(nb, os.path.join(DIRECTORY, "hidden", "original", FILE)))

## Hardcode

The hardcode tests run the student's notebook on different datasets. However, `public_tests.py` remains unchanged. So, if the answers are hardcoded in the student's notebook, we expect their code to still pass the public tests on all the different datasets. If their code fails any one of the different hardcode datasets, we take that to mean that the answer is not hardcoded.

In [None]:
for subdirectory in os.listdir(os.path.join(DIRECTORY, "hidden", "hardcode")):
    path = os.path.join(DIRECTORY, "hidden", "hardcode", subdirectory)
    good_dataset = False
    while not good_dataset:
        if os.path.exists(os.path.join(path, FILE)):
            nb = clean_nb(read_nb(os.path.join(DIRECTORY, FILE)))
        hardcode_results = parse_nb(run_nb(nb, os.path.join(path, FILE)))
        good_dataset = True
        for qnum in hardcode_results:
            if qnum.startswith('q') and hardcode_results[qnum] == 'All test cases passed!':
                print(qnum + ' failed!')
                good_dataset = False
                break
        if not good_dataset:
            random_data(path, 500)
    print(subdirectory + ' done!')

In [None]:
for hardcode in os.listdir(os.path.join(DIRECTORY, "hidden", "hardcode")):
    nb = clean_nb(read_nb(os.path.join(DIRECTORY, FILE)))
    results['hardcode: ' + hardcode] = parse_nb(run_nb(nb, os.path.join(DIRECTORY, "hidden", "hardcode", hardcode, FILE)))

## Rubric Tests

The tests for the rubric points will be defined below. Only the code inside the tags will be executed by `hidden_tests.py`, so the code outside the tags are used for generating the hidden datasets in the first place.

### Instructions for creating rubric tests:

Functions inside `hidden_tests.py` can be used to modify the student notebook, before executing and parsing the outputs. It is recommended that before trying to create rubric tests, a user goes through all the functions inside `hidden_tests.py` first. Here is a list of commonly used functions that will be most useful:

* **`read_nb`**: `read_nb(file)` **reads** a `file` in the `.ipynb` file format and returns a `nb`.
* **`run_nb`**: `run_nb(nb, file)` **executes** `nb` at the location `file` and **writes** the contents back into `file`.
* **`parse_nb`**: `parse_nb(nb)` read the contents of a student `nb` and **extracts** all graded questions and answers.
* **`truncate_nb`**: `truncate_nb(nb, start, end)` takes in a `nb`, and returns a **sliced** notebook between the cells indexed `start` and `end`.
* **`find_all_cell_indices`**: `find_all_cell_indices(nb, cell_type, marker)` returns **all** the indices in `nb` of cell type `cell_type` that **contains** the `marker` in its source.
* **`inject_code`**: `inject_code(nb, idx, code)` creates a **new** code cell in `nb` **after** the index `idx` with `code` in it.
* **`count_defns`**: `count_defns(nb, func_name)` **counts** the number of times `func_name` is defined in the `nb`.
* **`replace_defn`**: `replace_defn(nb, func_name, new_defn)` **replaces** the definition of `func_name` in `nb` with `new_defn`.
* **`replace_call`**: `replace_call(text, func_name, new_name)` **replaces** all **calls** and definition **names** to `func_name` with `new_name` in `text`.
* **`find_code`**: `find_code(nb, target)` returns the **number** of times that the **text** `target` appears in a code cell in `nb`.
* **`replace_code`**: `replace_code(nb, target, new_code, start, end)` **replaces** all instances of the **text** `target` in a code cell between the indices `start` and `end` with the **text** `new_code`.
* **`add_try_except`**: `add_try_except(text)` adds a (bare) **try/except block** around any given block of code.
* **`detect_restart_and_run_all`**: `detect_restart_and_run_all(nb)` flags if any **non-empty code cell** in `nb` is **not executed**.
* **`detect_imports`**: `detect_imports(nb)` returns a list of **all** the **import** statements in the `nb`.
* **`detect_ast_objects`**: `detect_ast_objects(nb, objects)` returns a dict of **all** cells in the `nb` with the **ast objects** `objects` in them.
* **`get_first_plot`**: `get_first_plot(nb, image_file)` returns the first **image** found in the output of a code cell in `nb`, and also stores it in `image_file` for reference.
* **`get_label_plot`**: `get_label_plot(plot, kind)` **crops** the `plot` and returns returns a plot containing just the **label** at the location indicated by `kind` - `"left"`, `"right"`, `"top"`, or `"bottom"`.
* **`get_without_label_plot`**: `get_without_label_plot(plot, kind)` **crops** the `plot` and returns returns a plot containing everything **except** the **label** at the location indicated by `kind` - `"left"`, `"right"`, `"top"`, or `"bottom"`.
* **`get_ticks_plot`**: `get_ticks_plot(plot, kind)` **crops** the `plot` and returns returns a plot containing just the **ticks** at the location indicated by `kind` - `"left"`, or `"bottom"`.
* **`get_without_ticks_plot`**: `get_without_ticks_plot(plot, kind)` **crops** the `plot` and returns returns a plot containing everything **except** the **ticks** at the location indicated by `kind` - `"left"`, or `"bottom"`.
* **`get_bounding_box_plot`**: `get_bounding_box_plot(plot)` **crops** the `plot` and returns returns a plot containing just the **bounding box** of the plot.
* **`check_text_in_plot`**: `check_text_in_plot(plot, expected_text)` checks if the `expected_text` is in the `plot`, and returns both the **missing** and the **extra** text in the given `plot`.