Configuration and Testing Ideas for algorithm_rbg.py

A review of the code and tests in https://github.com/AgPipeline/transformer-rgb-indices with suggestions for using pytest to create unit tests as well as using the typing module to create classes/types to more finely describe data.

Encoding metainformation

Currently the algorithm_rbg uses top-level, global variables to define meta-information about the module, e.g.:

# Definitions
VERSION = '1.0'

# Information on the creator of this algorithm
ALGORITHM_AUTHOR = 'Chris Schnaufer, Clairessa Brown, David Lebauer'
ALGORITHM_AUTHOR_EMAIL = 'schnaufer@arizona.edu, clairessabrown@email.arizona.edu, dlebauer@email.arizona.edu'
ALGORITHM_CONTRIBUTORS = ["Jacob van der Leeuw"]

Question: Why is ALGORITHM_CONTRIBUTORS defined as a list but ALGORITHM_AUTHOR/ALGORITHM_AUTHOR_EMAIL are defined as a str? It would seem all of these would be lists.

One immediate danger is that these are still variables and as such are mutable:

>>> import algorithm
>>> algorithm.VERSION
'1.0'
>>> algorithm.VERSION = 'foobar'
>>> algorithm.VERSION
'foobar'

Note	There are many instances of lines exceeding 79 characters in length such as the `ALGORITHM_AUTHOR_EMAIL` or this:

ALGORITHM_DESCRIPTION = 'This algorithm performs a variety of calculations using RGB pixels from images in order' \
                        'to assess plant and crop health and growth'

According to PEP8 (https://www.python.org/dev/peps/pep-0008/#maximum-line-length): "Limit all lines to a maximum of 79 characters."

One way to fix this is to use the implicit concatenation of adjacent strings (https://www.python.org/dev/peps/pep-3126/):

ALGORITHM_DESCRIPTION = ('This algorithm performs a variety of '
    'calculations using RGB pixels from images in order'
    'to assess plant and crop health and growth')

Or use + to join shorter strings:

ALGORITHM_DESCRIPTION = 'This algorithm performs a variety of ' + \
    'calculations using RGB pixels from images in order' + \
    'to assess plant and crop health and growth'

Or a join:

ALGORITHM_DESCRIPTION = ''.join([
    'This algorithm performs a variety of'
    'calculations using RGB pixels from images in order'
    'to assess plant and crop health and growth'])

To access these values, the testing.py program uses the hasattr() function, e.g.:

if not hasattr(algorithm_rgb, 'VARIABLE_NAMES')

The hasattr() function is problematic as it calls getattr() and checks for an exception. The article https://hynek.me/articles/hasattr/ does a good job explaining why this is not an ideal way to check for the existence of a property or method using this function, so I would recommend discontinuing its use.

At the least, it would seem better to replace all hasattr() calls with getattr() using a default value.

Also, why have 14 different variables, all of which are str types with the exception of the ALGORITHM_CONTRIBUTORS which is a list (cf "stringly typed" code: https://wiki.c2.com/?StringlyTyped, https://www.techopedia.com/definition/31876/stringly-typed)?

VERSION
ALGORITHM_AUTHOR
ALGORITHM_AUTHOR_EMAIL
ALGORITHM_CONTRIBUTORS
ALGORITHM_NAME
ALGORITHM_DESCRIPTION
CITATION_AUTHOR
CITATION_TITLE
CITATION_YEAR
VARIABLE_NAMES
VARIABLE_UNITS
VARIABLE_LABELS
WRITE_BETYDB_CSV
WRITE_GEOSTREAMS_CSV

If you used a dict for this information, then you’d only be exporting one value:

CONFIG = {
    'VERSION': '',
    'ALGORITHM_AUTHOR': '',
    'ALGORITHM_AUTHOR_EMAIL': '',
    'ALGORITHM_CONTRIBUTORS': '',
    'ALGORITHM_NAME': '',
    'ALGORITHM_DESCRIPTION': '',
    'CITATION_AUTHOR': '',
    'CITATION_TITLE': '',
    'CITATION_YEAR': '',
    'VARIABLE_NAMES': '',
    'VARIABLE_UNITS': '',
    'VARIABLE_LABELS': '',
    'WRITE_BETYDB_CSV': '',
    'WRITE_GEOSTREAMS_CSV': ''}

Or eschew getattr() to directly access a module’s value and rather call a function that returns this?

def config() -> dict:
    return {
        'VERSION': '',
        'ALGORITHM_AUTHOR': '',
        'ALGORITHM_AUTHOR_EMAIL': '',
        'ALGORITHM_CONTRIBUTORS': '',
        'ALGORITHM_NAME': '',
        'ALGORITHM_DESCRIPTION': '',
        'CITATION_AUTHOR': '',
        'CITATION_TITLE': '',
        'CITATION_YEAR': '',
        'VARIABLE_NAMES': '',
        'VARIABLE_UNITS': '',
        'VARIABLE_LABELS': '',
        'WRITE_BETYDB_CSV': '',
        'WRITE_GEOSTREAMS_CSV': ''}

Or define a type/class to represent this as an immutable NamedTuple:

from typing import NamedTuple, List


class Config(NamedTuple):
    version: str
    algorithm_author: List[str]
    algorithm_author_email: List[str]
    algorithm_contributors: List[str]
    algorithm_name: str
    algorithm_description: str
    citation_author: str
    citation_title: str
    citation_year: str
    variable_names: List[str]
    variable_units: List[str]
    variable_labels: List[str]
    write_betydb_csv: bool
    write_geostreams_csv: bool

And then return a Config from the function which can by type-checked by mypy:

from config import Config


def config() -> Config:
    return Config(version='1.0',
                  algorithm_author=[
                      'Chris Schnaufer', 'Clairessa Brown', 'David Lebauer'
                  ],
                  algorithm_author_email=[
                      'schnaufer@arizona.edu',
                      'clairessabrown@email.arizona.edu',
                      'dlebauer@email.arizona.edu'
                  ],
                  algorithm_contributors=["Jacob van der Leeuw"],
                  algorithm_name='Greenness Transformer',
                  algorithm_description=(
                      'This algorithm performs a variety of '
                      'calculations using RGB pixels from images in order '
                      'to assess plant and crop health and growth'),
                  citation_author='Clairessa Brown',
                  citation_title='Woebbecke, D.M. et al',
                  citation_year='2020',
                  variable_names=[
                      'excess greenness index', 'green leaf index', 'cive',
                      'normalized difference index', 'excess red', 'exgr',
                      'combined indices 1', 'combined indices 2',
                      'vegetative index', 'normalized green-red difference',
                      'percent green'
                  ],
                  variable_units=[
                      '[-510:510]', '[-1:1]', '[-255:255]', '[-127:129]',
                      '[-255:255]', '[-255:332]', '[-1000:1000]',
                      '[-1000:1000]', '[-255:255]', '[-255:255]', '[0:100]'
                  ],
                  variable_labels=[
                      'excess_greenness_index', 'green_leaf_index', 'cive',
                      'normalized_difference_index(pxarray)', 'excess_red',
                      'exgr', 'combined_indices_1', 'combined_indices_2',
                      'vegetative_index', 'ngrdi', 'percent_green'
                  ],
                  write_betydb_csv=True,
                  write_geostreams_csv=True)

Much easier to test, too:

import algorithm
from config import Config
from typing import List

def test_config():
    conf = algorithm.config()
    assert conf
    assert type(conf) == Config
    assert type(conf.version) == str
    assert type(conf.algorithm_author) == list
    assert type(conf.algorithm_author_email) == list
    assert type(conf.write_betydb_csv) == bool

Which leads me to ask if it’s necessary to encode this metadata into the module. This is static information that essentially is configuration. Further, nothing inside the algorithm_rgb module uses this information (but maybe it should?). So perhaps this would be better encoded as JSON that lives in the same directory as the module?

You could still have this available from a function:

import json
import os


def config() -> dict:
    file = os.path.join(os.path.dirname(__file__), 'config.json')
    with open(file) as fh:
        return json.load(fh)

Called like so:

import algorithm

print(algorithm.config())

You could even have this structure be typed. Consider a small example:

$ cat config.json
{
    "version": "1.0",
    "author": ["Chris Schnaufer", "Ken Youens-Clark"],
    "author_email": ["schnaufer@arizona.edu", "kyclark@arizona.edu"],
    "write_betydb_csv": true
}

Where we define a Config type like so:

$ cat config.py
from typing import NamedTuple, List


class Config(NamedTuple):
    version: str
    author: List[str]
    author_email: List[str]
    write_betydb_csv: bool

Which is used by the "algorithm":

import json
import os
from config import Config


def config() -> Config:
    file = os.path.join(os.path.dirname(__file__), 'config.json')
    with open(file) as fh:
        return Config(**json.load(fh))

Which we can call like so:

$ cat main.py
#!/usr/bin/env python3

import algorithm

print(algorithm.config())

Which will produce a typed, immutable object:

$ ./main.py
Config(version='1.0', author=['Chris Schnaufer', 'Ken Youens-Clark'], author_email=['schnaufer@arizona.edu', 'kyclark@arizona.edu'], write_betydb_csv=True)

Were this information to be stored as JSON, it still begs the question of how to produce valid JSON, so it would be good to consider a proper configuration language like Dhall. In this version, I create the "author" as a structure that includes both the "name" and "email" so that it cannot be possible to generate a configuration that leaves out one of these values. The same could/should be done for the variable name/label/unit:

$ cat config.dhall
-- ./config.dhall

let Prelude =
      https://prelude.dhall-lang.org/v11.1.0/package.dhall sha256:99462c205117931c0919f155a6046aec140c70fb8876d208c7c77027ab19c2fa


let Author = { name : Text, email : Text }

let authors
    : List Author
    = [ { name = "Chris Schnaufer", email = "schnaufer@arizona.edu" }
      , { name = "Ken Youens-Clark", email = "kyclark@arizona.edu" }
      ]

in  { authors = authors
    , version = "1.0"
    , write_betydb_csv = True
    }

From which we can derive JSON:

$ dhall-to-json --file config.dhall --output config.json
$ cat config.json
{
  "authors": [
    {
      "email": "schnaufer@arizona.edu",
      "name": "Chris Schnaufer"
    },
    {
      "email": "kyclark@arizona.edu",
      "name": "Ken Youens-Clark"
    }
  ],
  "version": "1.0",
  "write_betydb_csv": true
}

The Config class would likewise need to be changed to reflect this.

Note	Should every algorithm return the same structure/metadata. That is, are the 14 above listed fields exhaustive or just the minimal set? Can an algorithm return other/more/less data?

Inside "testing.py" is the function check_configuration() which checks if the VARIABLE_NAMES variable exists in the package, so it’s only check one of the 14 values:

def check_configuration():
    """Checks if the configuration is setup properly for testing
    """
    if not hasattr(algorithm_rgb, 'VARIABLE_NAMES') or not algorithm_rgb.VARIABLE_NAMES:
        sys.stderr.write("Variable names configuration variable is not defined yet. Please define and try again")
        sys.stderr.write("    Update configuration.py and set VALUE_NAMES variable with your variable names")
        return False

    return True

Further the _get_variables_header_fields() inspects the VARIABLE_NAMES, VARIABLE_LABELS, and VARIABLE_UNITS to ensure they are all the same length:

def _get_variables_header_fields() -> str:
    """Returns a string representing the variable header fields
    Return:
        Returns a string representing the variables' header fields
    """
    variables = algorithm_rgb.VARIABLE_NAMES.split(',') (1)
    labels = algorithm_rgb.VARIABLE_LABELS.split(',')
    labels_len = len(labels)
    units = algorithm_rgb.VARIABLE_UNITS.split(',')
    units_len = len(units)

    if labels_len != len(variables): (2)
        sys.stderr.write("The number of defined labels doesn't match the number of defined variables")
        sys.stderr.write("  continuing processing")
        sys.stderr.write("\n")
    if units_len != len(variables): (3)
        sys.stderr.write("The number of defined units doesn't match the number of defined variables")
        sys.stderr.write("  continuing processing")
        sys.stderr.write("\n")

    headers = ''
    for idx, variable_name in enumerate(variables):
        variable_header = variable_name
        if idx < labels_len:
            variable_header += ' - %s' % labels[idx]
        if idx < units_len:
            variable_header += ' (%s)' % units[idx]
        headers += variable_header + ','

    return headers (4)

Splitting a string on a comma to get a list. Why not store the values as a list in the first place?
Here and in 3, it’s not a fatal error if the names/labels/units don’t match. So how can we be sure that the values are correct? What if there are 10 names but 5 labels and 15 units? Shouldn’t that be an error?
Also not a error, just a warning.
Why return a string (note "stringly typing" above)? Wouldn’t a List[str] be better? Or a List[Measurement] where the name/label/unit have been explicitly defined?

Note that these values are coded with commas and spaces separating the values:

>>> VARIABLE_NAMES
'excess greenness index, green leaf index, cive, normalized difference index, excess red, exgr, combined indices 1, combined indices 2, vegetative index, normalized green-red difference, percent green'

So splitting them on a comma leaves a space that I imagine is not intentional, e.g., "' green leaf index'" instead of "'green leaf index'":

>>> VARIABLE_NAMES.split(',')
['excess greenness index', ' green leaf index', ' cive', ' normalized difference index', ' excess red', ' exgr', ' combined indices 1', ' combined indices 2', ' vegetative index', ' normalized green-red difference', ' percent green']

Storing the name/label/units in three different strings/lists seems like an invitation to errors. Note the above representation of an Author where "name" and "email" are required values. At the least, consider a data structure that unifies the idea of a variable into one structure, even if it’s just a dict.

Return from algorithm_rbg.calculate()

The algorithm_rbg.calculate() function currently returns a list of floating-point values, but the type is annotated to just a list:

def calculate(pxarray: np.ndarray) -> list:
	return [
	    excess_greenness_index(pxarray),
	    green_leaf_index(pxarray),
	    cive(pxarray),
	    normalized_difference_index(pxarray),
	    excess_red(pxarray),
	    exgr(pxarray),
	    combined_indices_1(pxarray),
	    combined_indices_2(pxarray),
	    vegetative_index(pxarray),
	    ngrdi(pxarray),
	    percent_green(pxarray)
	]

Recommend at least annotating return value as List[float].

What does the calling code expect? Is every algorithm expected to return the same thing?

Should this perhaps return a dict (or TypedDict) so that the values are explicitly available by name rather than assumed to be in position?

return {
    'excess_greenness_index': excess_greenness_index(pxarray),
    'green_leaf_index': green_leaf_index(pxarray),
    'cive': cive(pxarray),
    'normalized_difference_index': normalized_difference_index(pxarray),
    'excess_red': excess_red(pxarray),
    'exgr': exgr(pxarray),
    'combined_indices_1': combined_indices_1(pxarray),
    'combined_indices_2': combined_indices_2(pxarray),
    'vegetative_index': vegetative_index(pxarray),
    'ngrdi': ngrdi(pxarray),
    'percent_green': percent_green(pxarray)
}

This could, of course, just as easily be a list of tuple with ("name", "value").

Could this be better handled as a new type (perhaps based on NamedTuple)? For instance, I see that testing._get_variables_header_fields() inspects the meta data from algorithm_rbg for VARIABLE_NAMES, VARIABLE_LABELS, and VARIABLE_UNITS, verifies that these are all the same length, and then returns a str. Does some other code use these values to match up with the measurements? Should that data be included with the return values for each?

That is, excess_greenness_index() currently returns a float. Should it instead return a record that includes:

value: float
name: str
unit: str
label: str

I notice the "unit" for this measurement is a str like "[-510:510]" which follows a pattern for all the other units that look like possible [low:high] values for this value. Could this be better represented as a tuple like (-510, 510)? This in turn could become a NewType possible:

>>> from typing import NewType, Tuple
>>> Unit = NewType('Unit', Tuple[float, float])
>>> unit1 = Unit((-510, 501))
>>> type(unit1)
<class 'tuple'>
>>> unit1
(-510, 501)

Then you could use type checking to verify the return with a type:

Unit = NewType('Unit', Tuple[float, float])
class Measurement(NamedTuple):
    value: float
    name: str
    label: str
    unit: Unit


def excess_greenness_index(pxarray: np.ndarray) -> Measurement:
    red, green, blue = get_red_green_blue_averages(pxarray)
    return Measurement(
        value = round(2 * green - (red + blue), 2),
        name = 'excess greenness index',
        label = 'excess_greenness_index',
        unit = Unit((-510, 501)))

Then you get an immutable, typed value back from the function:

>>> import algorithm_rgb_type as a2
>>> a2.excess_greenness_index(pix1)
Measurement(value=14.0, name='excess greenness index', label='excess_greenness_index', unit=(-510, 501))

Testing

The current https://github.com/AgPipeline/transformer-rgb-indices/blob/master/testing.py program demonstrates a way to use the algorithm_rgb.py module to see if works in some way, but it falls short of fully testing the module/functions. This program also manages a number of tasks manually that would be better done using standard modules.

Parsing command-line arguments

The "testing.py" module has two functions associated with handling arguments and printing the usage:

def check_arguments():
    """Checks that we have script argument parameters that appear valid
    """
    argc = len(sys.argv) (1)
    if argc < 2: (2)
        sys.stderr.write("One or more paths to images need to be specified on the command line\n")
        print_usage()
        return False (3)

    # Check that the paths exist.
    have_errors = False
    for idx in range(1, argc): (4)
        if not os.path.exists(sys.argv[idx]): (5)
            print("The following path doesn't exist: " + sys.argv[idx])
            have_errors = True

    if have_errors:
        sys.stderr.write("Please correct any problems and try again\n")

    return not have_errors (6)

def print_usage():
    """Displays information on how to use this script
    """
    argc = len(sys.argv) (7)
    if argc:
        our_name = os.path.basename(sys.argv[0]) (8)
    else:
        our_name = os.path.basename(__file__)
    print(our_name + " <folder>|<filename> ...") (9)
    print("    folder:   path to folder containing images to process") (10)
    print("    filename: path to an image file to process")
    print("") (11)
    print("  One or more folders and/or filenames can be used")
    print("  Only files at the top level of a folder are processed")

sys.argv is a list containing the path to the currently running program (i.e., the "testing.py" program itself) followed by any other values. This program appears to rely upon positional parameters only, so no named options. If necessary to manually handle sys.argv, recommend at least to use sys.argv[1:] so as to skip the program name and only handle the actual arguments as this will help avoid off-by-one errors.
We really only need 1 argument, but the off-by-one problem shows here.
Three lines of code to handle printing an error, usage, and returning a False value from the function, but nothing here will make the program itself return an error code to the command line. See below.
Another instance of needing to skip the first value as this is not actually an argument.
Manually checking that a given argument exists which could mean either a directory or a file.
Recommend always using positive variable names like "is_ok" with default of True and setting to False when there is a problem so that you can return is_ok. The brain has to work extra to work out the negative of not have_errors.
The argc variable is used just once. If you change the code to if len(sys.argv): then pylint would complain "Do not use len(SEQUENCE) without comparison to determine if a sequence is empty (len-as-condition)". The more common idiom would be if sys.argv:.
Given the binary choice, an if expression would be better (see below). Also, this code relies on the fact that Python’s variable scoping is really terrible. In a stricter language, our_name would not be visible after the if/else block, but in Python it is. Recommend to initialize the variable before the block or better to use an if expression.
Manually printing the usage.
These 5 separate print() calls could be handled with one print() where the text is provided using a single triple-quoted (""") string which is more idiomatic.
Note that print() (with no arguments) will accomplish the same thing as this.

Remark #1 above relies on a strange behavior of Python in that requesting list slices for non-existent ranges will result in an empty list rather than an exception:

>>> x = ['foo', 'bar']
>>> x[10:]
[]

Remark #3 above is due to how this function is called:

if __name__ == "__main__":
    if check_arguments() and check_configuration():
        process_files()

If check_arguments() returns False, then process_files() never executes, but nothing ever tells the program to exit with a non-zero value. One way to fix this would be to add an explicit sys.exit() call:

if __name__ == "__main__":
    if check_arguments() and check_configuration():
        process_files()
    else:
        sys.exit(1)

Note that calling sys.exit() with a str value will cause the str to be printed to sys.stderr and the program to exit with the value 1:

if __name__ == "__main__":
    if check_arguments() and check_configuration():
        process_files()
    else:
        sys.exit('Something went wrong')

Remark #8, recommend rewriting as such:

our_name = os.path.basename(sys.argv[0] if sys.argv else __file__)

Lastly, while this program will produce a usage, it does not respond to the standard -h or --help flags for usage:

$ ./testing.py -h
The following path doesn't exist: -h
Please correct any problems and try again

Using argparse

I have written an alternate version of this program using the standard argparse module to handle at https://github.com/kyclark/configcode/blob/master/testing/testing.py. Most of the above code can be handled using the standard argparse module. My version will accept one or more input files. The program will produce a usage when run with no arguments or the "help" flags:

$ ./testing.py -h
usage: testing.py [-h] FILE [FILE ...]

Test algorithm

positional arguments:
  FILE        Input file(s)

optional arguments:
  -h, --help  show this help message and exit

Any non-file argument is validated and rejected by argparse:

$ ./testing.py blarg
usage: testing.py [-h] FILE [FILE ...]
testing.py: error: argument FILE: can't open 'blarg': [Errno 2] No such file or directory: 'blarg'

Note that argparse will reject any undefined arguments such as -x:

$ ./testing.py -x test_input/*
usage: testing.py [-h] FILE [FILE ...]
testing.py: error: unrecognized arguments: -x

Here is the relevant section that handles the command-line arguments and usage:

# --------------------------------------------------
def get_args():
    """Get command-line arguments"""

    parser = argparse.ArgumentParser(
        description='Test algorithm',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    parser.add_argument('file',
                        help='Input file(s)',
                        metavar='FILE',
                        type=argparse.FileType('r'), (1)
                        nargs='+') (2)

    return parser.parse_args()


# --------------------------------------------------
def main():
    """Make a jazz noise here"""

    args = get_args()     (3)

    for fh in args.file:  (4)
        fh.close()        (5)
        run_test(fh.name) (6)

This will cause argparse to validate that the positional arguments are readable ('r') files.
The nargs is for the "number of arguments," and + means "one or more."
All the parsing and validation of the arguments happens here. If the arguments are invalid in any way, then this line will fail, the usage will be printed, and the program will exit with a non-zero value.
The args.file value will be a list of one or more open file handles.
Need to close the file handle for gdal.Open() to work.
Pass the file’s name.

Processing the files

In the original "testing.py," the process_files() function needs to decide if the arguments to the program are files or directories the latter of which it will walk to find files:

def process_files():
    """Processes the command line file/folder arguments
    """
    argc = len(sys.argv) (1)
    if argc: (2)
        print("Filename," + _get_variables_header_fields())
        for idx in range(1, argc): (3)
            cur_path = sys.argv[idx]
            if not os.path.isdir(cur_path): (4)
                run_test(cur_path)
            else:
                allfiles = [os.path.join(cur_path, fn) for fn in os.listdir(cur_path)  if  os.path.isfile(os.path.join(cur_path, fn))] (5)
                for one_file in allfiles:
                    run_test(one_file)

This function separately parses the command-line arguments making this an impure function. Recommend that the arguments be parsed in one place (e.g., something like get_args()) and those values passed in. The typing module has a TextIO type that could be used in the function signature like process_files(List[TextIO]) which would make it easier to validate and check with mypy.
Another instance of using a sequence’s length without comparison (see above). Better to say if sys.argv:. Of greater concern is that there is no else. Should something happen if no files are passed?
This is a very C-like for loop using the index positions of the list. If the arguments were taken like args = sys.argv[1:] then you could use a more Pythonic for arg in args: here.
Recommend to avoid not and rather accentuate the positive. Change to if os.path.isfile(cur_path) and swap the blocks.
Extremely long line that should be broken up. Use a code formatter like yapf to fix. Also consider using Python glob library to recursively find files (https://docs.python.org/3/library/glob.html).

Note that all this code is obviated by the version I show that requires the files to be passed as arguments to the program.

Also note that this function marches to 5 levels of indentation which seems a bit much to me (cf https://stackoverflow.com/questions/10959683/preferred-maximum-indentation-in-python).

Running a test

The "testing.py" has a function called run_test() that accepts a filename, processes it with the algorithm_rgb.calculate() function, and raises an exception if it encounters a problem:

def run_test(filename):
    """Runs the extractor code using pixels from the file
    Args:
        filename(str): Path to image file
    Return:
        The result of calling the extractor's calculate() method
    Notes:
        Assumes the path passed in is valid. An error is reported if
        the file is not an image file.
    """
    try:
        open_file = gdal.Open(filename)
        if open_file:
            # Get the pixels and call the calculation
            pix = np.array(open_file.ReadAsArray())
            calc_val = algorithm_rgb.calculate(np.rollaxis(pix, 0, 3)) (1)

            # Check for unsupported types
            if isinstance(calc_val, set): (2)
                raise RuntimeError("A 'set' type of data was returned and isn't supported.  Please use a list or a tuple instead") (3)

            # Perform any type conversions to a printable string
            if isinstance(calc_val, str): (4)
                print_val = calc_val
            else:
                # Check if the return is iterable and comma separate the values if it is
                try:
                    _ = iter(calc_val) (5)
                    print_val = ",".join(map(str, calc_val))
                except Exception:
                    print_val = str(calc_val)

            print(filename + "," + print_val)
    except Exception as ex:
        sys.stderr.write("Exception caught: " + str(ex) + "\n")
        sys.stderr.write("    File: " + filename + "\n")

Quite a bit happens before the calculate() function is called. I have yet to look into the code that would call this function, so I have to assume this will be handled elsewhere and this function is assumed to handle only a matrix of pixels.
Rather than checking for what is not allowed (here a set), maybe check that it’s something that is allowed, i.e., list or tuple. Another instance where type hints could help? Lists and tuples are sort of interchangeable, but I would suggest allowing only one type from a function. I would probably choose a tuple given their immutability.
Why raise a RuntimeError when the except handles nothing but a generic Exception?
It would appear that the calculate() function is allowed to return lists, tuples, and strings, which seems incredibly dangerous to me. A function should return only one type. The typing.Optional or typing.Union might be useful here if it’s necessary to return None or you really need to mix strings and lists, but it still seems like a very bad idea.
Here the iter() function is being called to see if it will produce an exception which seems like a really bad idea. Also note that dictionaries and file handles are iterable and so would pass this line.

Note	This another function with 5 levels of indentation and includes try/catch inside try/catch, both of which strike me as too complicated.

The most notable problem with this function is that it never verifies that the algorithm_rgb.calculate() function returns the correct answer. For instance, I can change the function to this:

def calculate(pxarray: np.ndarray) -> list:
    pass

And then run the program:

$ ./testing.py ../sample_plots/*
Filename,excess greenness index - excess_greenness_index ([-510:510]), green leaf index -  green_leaf_index ( [-1:1]), cive -  cive ( [-255:255]), normalized difference index -  normalized_difference_index(pxarray) ( [-127:129]), excess red -  excess_red ( [-255:255]), exgr -  exgr ( [-255:332]), combined indices 1 -  combined_indices_1 ( [-1000:1000]), combined indices 2 -  combined_indices_2 ( [-1000:1000]), vegetative index -  vegetative_index ( [-255:255]), normalized green-red difference -  ngrdi ( [-255:255]), percent green -  percent_green ( [0:100]),
../sample_plots/rgb_17_7_W.tif,None
../sample_plots/rgb_1_2_E.tif,None
../sample_plots/rgb_33_8_W.tif,None
../sample_plots/rgb_40_11_W.tif,None
../sample_plots/rgb_5_11_W.tif,None
../sample_plots/rgb_6_1_E.tif,None

A test should use a known input and verify that a function/program will produce an expected output.

The https://github.com/kyclark/configcode/blob/master/testing/testing.py version uses this same run_test() and so is not a recommended solution. That program is merely provided to demonstrate how the original program can be shortened from 151 lines of code (LOC) to 85 all while using standard modules.

A different testing scheme

I have provided a separate https://github.com/kyclark/configcode/blob/master/testing/test.py that demonstrates how pytest can be used to create unit tests for the https://github.com/kyclark/configcode/blob/master/testing/algorithm_rgb.py file.

Note

Is it possible for any of the functions to return anything other than a float? That is, the tests only use good input files, and I think it’s crucial to run tests using known bad values to ensure the code gracefully handles errors. What would happen if a corrupted file were used or one that could conceivably create values of 0 for R+G+B which I see is used as a denominator in division.

Here is a simple testing file which uses two known files (input1 and input2) and verifies that each function in the algorithm_rgb will return the correct value:

import algorithm_rgb as al
import os
import osgeo.gdal as gdal
import numpy as np
import json

input1 = './test_input/rgb_1_2_E.tif'
input2 = './test_input/rgb_40_11_W.tif'
meta = './meta.json'


# --------------------------------------------------
def test_input_files():
    """Test input files exist"""

    assert os.path.isfile(input1)
    assert os.path.isfile(input2)


# --------------------------------------------------
def test_get_red_green_blue_averages():
    """Test get_red_green_blue_averages"""

    assert al.get_red_green_blue_averages(
        read_input(input1)) == (166.8537142857143, 160.37885714285713,
                                139.89971428571428)

    assert al.get_red_green_blue_averages(
        read_input(input2)) == (109.85485714285714, 144.25085714285714, 90.381)


# --------------------------------------------------
def test_excess_greenness_index():
    """Test excess_greenness_index"""

    assert al.excess_greenness_index(read_input(input1)) == 14.0
    assert al.excess_greenness_index(read_input(input2)) == 88.27


# --------------------------------------------------
def test_green_leaf_index():
    """Test green_leaf_index"""

    assert al.green_leaf_index(read_input(input1)) == 0.02
    assert al.green_leaf_index(read_input(input2)) == 0.18


# --------------------------------------------------
def test_cive():
    """Test cive"""

    assert al.cive(read_input(input1)) == 16.16
    assert al.cive(read_input(input2)) == -14.96


# --------------------------------------------------
def test_normalized_difference_index():
    """Test normalized_difference_index"""

    assert al.normalized_difference_index(read_input(input1)) == -1.53
    assert al.normalized_difference_index(read_input(input2)) == 18.33


# --------------------------------------------------
def test_excess_red():
    """Test excess_red"""

    assert al.excess_red(read_input(input1)) == 56.53
    assert al.excess_red(read_input(input2)) == -1.44


# --------------------------------------------------
def test_exgr():
    """Test exgr"""

    assert al.exgr(read_input(input1)) == -42.53
    assert al.exgr(read_input(input2)) == 89.71


# --------------------------------------------------
def test_combined_indices_1():
    """Test combined_indices_1"""

    assert al.combined_indices_1(read_input(input1)) == 30.16
    assert al.combined_indices_1(read_input(input2)) == 73.31


# --------------------------------------------------
def test_combined_indices_2():
    """Test combined_indices_2"""

    assert al.combined_indices_2(read_input(input1)) == 12.81
    assert al.combined_indices_2(read_input(input2)) == 24.98


# --------------------------------------------------
def test_vegetative_index():
    """Test vegetative_index"""

    assert al.vegetative_index(read_input(input1)) == 1.02
    assert al.vegetative_index(read_input(input2)) == 1.4


# --------------------------------------------------
def test_ngrdi():
    """Test ngrdi"""

    assert al.ngrdi(read_input(input1)) == -0.02
    assert al.ngrdi(read_input(input2)) == 0.14


# --------------------------------------------------
def test_percent_green():
    """Test percent_green"""

    assert al.percent_green(read_input(input1)) == 0.34
    assert al.percent_green(read_input(input2)) == 0.42


# --------------------------------------------------
def test_calculate():
    """Test calculate"""

    assert al.calculate(read_input(input1)) == [
        14.0, 0.02, 16.16, -1.53, 56.53, -42.53, 30.16, 12.81, 1.02, -0.02,
        0.34
    ]

    assert al.calculate(read_input(input2)) == [
        88.27, 0.18, -14.96, 18.33, -1.44, 89.71, 73.31, 24.98, 1.4, 0.14, 0.42
    ]


# --------------------------------------------------
def read_input(file) -> np.ndarray:
    """Run calculate on a file"""

    if fh := gdal.Open(file):
        pix = np.array(fh.ReadAsArray())
        return np.rollaxis(pix, 0, 3)


# --------------------------------------------------
def test_meta():
    """Test meta"""

    assert os.path.isfile(meta)
    data = json.load(open(meta))
    assert data['authors']

Using pytest to run this test suite will produce a familiar output:

$ pytest -xv test.py
============================= test session starts ==============================
...

test.py::test_input_files PASSED                                         [  6%]
test.py::test_get_red_green_blue_averages PASSED                         [ 13%]
test.py::test_excess_greenness_index PASSED                              [ 20%]
test.py::test_green_leaf_index PASSED                                    [ 26%]
test.py::test_cive PASSED                                                [ 33%]
test.py::test_normalized_difference_index PASSED                         [ 40%]
test.py::test_excess_red PASSED                                          [ 46%]
test.py::test_exgr PASSED                                                [ 53%]
test.py::test_combined_indices_1 PASSED                                  [ 60%]
test.py::test_combined_indices_2 PASSED                                  [ 66%]
test.py::test_vegetative_index PASSED                                    [ 73%]
test.py::test_ngrdi PASSED                                               [ 80%]
test.py::test_percent_green PASSED                                       [ 86%]
test.py::test_calculate PASSED                                           [ 93%]
test.py::test_meta PASSED                                                [100%]

============================== 15 passed in 0.20s ==============================

To demonstrate the output when code is failing, I can introduce an error like so:

def test_cive():
    """Test cive"""

    assert al.cive(read_input(input1)) == None # 16.16 <<<< Changing to None
    assert al.cive(read_input(input2)) == -14.96

And now the test output reads:

$ pytest -xv test.py
============================= test session starts ==============================
...

test.py::test_input_files PASSED                                         [  6%]
test.py::test_get_red_green_blue_averages PASSED                         [ 13%]
test.py::test_excess_greenness_index PASSED                              [ 20%]
test.py::test_green_leaf_index PASSED                                    [ 26%]
test.py::test_cive FAILED                                                [ 33%]

=================================== FAILURES ===================================
__________________________________ test_cive ___________________________________

    def test_cive():
        """Test cive"""

>       assert al.cive(read_input(input1)) == None # 16.16
E       assert 16.16 == None
E         +16.16
E         -None

test.py:52: AssertionError
=========================== short test summary info ============================
FAILED test.py::test_cive - assert 16.16 == None
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
========================= 1 failed, 4 passed in 0.51s ==========================

The pytest module will integrate with the coverage (https://coverage.readthedocs.io/en/coverage-5.2/) module to help determine how much of the code is covered by tests:

$ coverage run -m pytest test.py
============================= test session starts ==============================
...

test.py ...............                                                  [100%]

============================== 15 passed in 0.24s ==============================
$ coverage report
Name               Stmts   Miss  Cover
--------------------------------------
algorithm_rgb.py      41      0   100%
test.py               58      0   100%
--------------------------------------
TOTAL                 99      0   100%

These unit tests cover all the functions in the algorithm_rgb.py module.

Summary

I have only investigated how to use pytest to create unit tests for the given module. I do not know to what extent contributors of algorithms will be expected to create such tests. This may or may not be beyond the capabilities of the typical programmer, but I believe a thorough explanation and demonstration will encourage people to contribute a full test suite. I especially feel that the functional nature of pytest makes it rather easy to create and run tests (as opposed to an object-oriented test suite, cf https://docs.python.org/3/library/unittest.html).

Going further

Next I need to see how an algorithm is integrated into a greater system which is probably where more intense and focused testing should occur.

Author

Ken Youens-Clark <kyclark@arizona.edu>

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
attr		attr
dhall		dhall
json1		json1
json2		json2
testing		testing
type		type
.gitignore		.gitignore
LICENSE		LICENSE
README.adoc		README.adoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Configuration and Testing Ideas for algorithm_rbg.py

Encoding metainformation

Return from algorithm_rbg.calculate()

Testing

Parsing command-line arguments

Using argparse

Processing the files

Running a test

A different testing scheme

Summary

Going further

Author

About

Uh oh!

Releases

Packages

Languages

License

kyclark/configcode

Folders and files

Latest commit

History

Repository files navigation

Configuration and Testing Ideas for algorithm_rbg.py

Encoding metainformation

Return from algorithm_rbg.calculate()

Testing

Parsing command-line arguments

Using argparse

Processing the files

Running a test

A different testing scheme

Summary

Going further

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages