Skip to content

Legacy Conversion Tips

Sulekha Kulkarni edited this page Dec 8, 2020 · 10 revisions

Tips for Converting Legacy Code

This document assumes the use of the checkedc-clang compiler. For information on downloading or building the compiler see its repo.

Convert included headers first

Before doing any other conversion, change the included standard library headers to be the checked versions of these headers, e.g. stdlib_checked.h. There is a script called update-includes.py in the checked-c-convert utilities folder that will convert all the standard headers included in a list of input files.

Optionally, make use of automated conversion tools

The checked-c-convert (3C) utility will examine the source tree of a program and determine what items can be automatically converted to _Ptr types or have bounds-safe interfaces added. It takes a list of input files, similar to update-includes.py, that represents all source files used in the compilation of a program. The easiest way to produce such an input file is to use CMake's EXPORT_COMPILE_COMMANDS option to produce a compile_commands.json and then use the utility script convert-commands.py on that json file to convert it into the appropriate format. If the legacy code to be converted does not use CMake, there are other options, as documented in the UMD fork of 3C. Ideally, the results of compilation before and after running 3C should be identical. 3C can be re-run at other points after additional code has been changed, for further automated conversion help.

Decide on structure interfaces before functions

For programs that define their own data structs with pointer fields, making decisions on the conversion or bounds-safe interfacing of those fields ahead of time will prevent repeated or unnecessary work. Once the bounds are added to the field of a struct, any checked code making use of that struct will then correctly flag unchecked variables that interact with it. As these issues are resolved, adding bounds-safe interfaces to the functions and their callers should be straightforward.

Use bounds-safe interfaces rather than explicit conversions when working with externally linked functions

For functions that are static, if all callers in that translation unit work in a checked scope, then it might be possible to explicitly convert these functions to using Checked C types only. Otherwise, bounds safe-interface annotations on the function's parameters and return value allow interoperability with both external callers and the bounds-checking analysis.

Use a pragma to enforce checked compilation for entire file.

Checked C supports a #pragma for setting an entire scope to be checked. This will surface a number of errors and warnings that may not have been previously visible. You can use

#pragma CHECKED_SCOPE ON

at the top of your file, after including headers.

Use temporary variables for string lengths, bounds-checked pointer arithmetic in function arguments

As a policy, restrict Checked C related modifications to declarations as much as possible, and modify as little of the main body of code as possible.

Code parsing strings might have function calls of the form

function_call(a_string, strlen(a_string))

where a_string has type _Nt_array_ptr<char> and therefore expected length of 0 but function_call expects a string parameter whose bounds match the length parameter.

function_call(a_string + prefix_len + 1, strlen(a_string) - (prefix_len+1))

is a closely related pattern. Both of these will require a _Dynamic_bounds_cast to reset the bounds on the first argument to function_call, and that requires, as the compiler may hint, that any length calculation is done first and then cached in a temporary variable. Additionally for code readability purposes it is often easier to make a temporary variable for the first argument as well, with the appropriate bounds defined via _Dynamic_bounds_cast in the definition of the temporary variable.

Use _Dynamic_bounds_cast to validate bounds that the compiler cannot prove statically

Two common situations are when: a) the pointer moves, and we need to dynamically confirm that it is within bounds (first example below), and b) when the bounds of a pointer need to be narrowed (second example below).

int foo(....) {
     ....
     s = _Dynamic_bounds_cast<_Nt_array_ptr<char>>(base + ((s[0]&0x3f << 8 | s[1], bounds(arg_s, arg_s + len_s));
     ....
}
int foo(...) {
   .....
   _Array_ptr<const char> arg_end_1 : count(arg_l) = _Dynamic_bounds_cast<_Array_ptr<const char>>(end, count(arg_l));
   memcmp(arg_base_1, arg_end_1, arg_l)
   // NOTE: The declared bounds of end are (arg_src, arg_end). We cannot pass the variable 'end' itself as the second argument to function 
   // memcmp because the compiler cannot statically prove that the bounds of the second parameter of the bounds-safe 
   // interface of function memcmp are equal to or narrower than the bounds of variable 'end'.
   ....
}

Places where unchecked blocks might be necessary

Although we seek to convert as much code as possible, some code will require features that cannot be bounds-checked. Surrounding these snippets in _Unchecked { ... } hints to the compiler that errors in these blocks can be ignored and hints to future maintainers of this code that these spots are ones to examine carefully whenever anything goes wrong. Unchecked blocks work like any other compiler block in terms of scoping local variables. Successfully quieting the compiler warnings in an _Unchecked {...} block might require the introduction of temporary unchecked local variables to which checked variables are cast before their use.

Very fragile logic

In some cases, the compiler will warn that it cannot prove the bounds for arguments to some function calls and there is no immediately visible way to ascertain that the bounds are correct. For example, when a function takes buffers but no length arguments and then uses those buffers in other functions that require length arguments e.g. strncmp an analysis of the bounds correctness would have to examine the callers of the function and whether their logic always produces the correct conditions or not. In some cases it may be possible to use the _Dynamic_check operator to assert some fact that should always hold about the buffer (or other warning source). However, in other cases, such as when needing to prove bounds facts about uninitialized memory, it will be necessary to enclose the offending calls in an unchecked block. Functions that produce reactions of well this does just work without buffer overflows, but if anything elsewhere changes, there might be problems will certainly be places for maintainers to check in the future and the use of the unchecked block puts the logic at high priority for such attention.

Functions that take varargs

Functions like printf or sprintf take varargs: specifically a format string and then whatever additional arguments are necessary to fill in that string. At present, the checkedc-clang compiler does not support varargs because of reasoning issues, so all uses of sprintf and related functions will have to occur inside unchecked blocks.

Macros

Macros often hide pointer dereferencing in away that does not support the introduction of a bounds-safe interface. Consider putting the unchecked block in the macro itself, rather than in callers. Additionally, the compiler will emit notes about where the ultimate source of the problem is; if it is a function in a C standard header, please file a bug to see if we can convert it to having a bounds-safe interface via a *_checked.h file in checkedc.

Double pointer subtleties

In many cases, double pointers will be easily converted to appropriate checked pointers. E.g. a list of pointers to objects of type Foo might be _Array_ptr<_Ptr<Foo>> and the list can be given bounds just like a single pointer; a list of names might be _Array_ptr<_Nt_array_ptr<char>> and the length of the list is known but the individual names can be arbitrary lengths.

However, there may also be cases where e.g. char ** is actually a pointer to a list, where the the list is a checked type with known bounds. CheckedC does not currently support bounds on the inner type of a double pointer. Code requiring this may need to be left in an unchecked block. This limitation also means that taking the address of a bounded checked object is not currently supported e.g. _Nt_array_ptr<char> str ....; op(&str);.

Strings that need their bounds specified in terms of their length

_Nt_array_ptr<char> types usually have bounds of count(0). Sometimes a function that uses such a string expects its bounds to be specified with respect to its length. Right now this must be done as follows:

   ....
   unsigned int l = strlen(str);
   _Nt_array_ptr<char> s : count(l) = 0;
   _Unchecked { s = _Assume_bounds_cast<_Nt_array_ptr<char>>(str, count(l)); }
   ...

Note that using _Dynamic_bounds_cast in this case will trigger a runtime check failure.

Updating variables in a bounds expression

Updating a variable that appears in a bounds expression can be tricky in Checked C. Typically when there is a compiler warning or error that is caused by writing to a variable used in some declared bounds, we can make a copy of the original variable and modify the copy instead.

Consider the code

void f() {
  int a[] = {1, 2, 3};
  int len = 3;
  _Array_ptr<int> p : count(len) = a;
  while (len >= 0) {
    *p = 0;
    len--;
  }
}

The compiler issues an error on this code:

error: inferred bounds for 'p' are unknown after decrement
    len--;

To avoid this error, we can introduce a temporary alen that can be used in bounds expression so that the variables used in the bounds expression remain unmodified in the loop. For example,

void f() {
  int a[] = {1, 2, 3};
  int len = 3;
  int alen = len;
  _Array_ptr<int> p : count(alen) = a;
  while (len >= 0) {
    *p = 0;
    len--;
  }
}

While introducing additional variables, restrict Checked C related modifications to declarations as much as possible, and modify as little of the main body of code as possible.

Use rel_align when pointers are cast to point to larger datatypes and then dereferenced

A common situation is when a char pointer is cast to a word pointer and then dereferenced. In this situation, rel_align is required to ensure that the runtime check emitted by the compiler performs the correct bounds check. The example below illustrates this.

void foo(_Array_ptr<char> ad : count(an), unsigned int an) {
    ....
    _Array_ptr<char> d : bounds(ad, ad + an) = ad;
    _Array_ptr<size_t> w : bounds(ad, ad + an) rel_align(char) = 0;
    w = (_Array_ptr<void>) d;
    *w = ....
    ...
}

In the above code, note that the bounds of variable w (which is of type size_t) has its bounds specified by ad, (ad + an) where ad has type char. Had the rel_align(char) not been present in the declaration of w, the runtime check emitted for the dereference of w would have been: ad <= w && w < (ad + an). This would have been incorrect because *w accesses four bytes starting at w, and the last three bytes accessed may potentially be beyond (ad + an). The correct runtime check that the compiler in fact emits is: ad <= w && (w + sizeof(size_t) - 1) < (ad + an). The compiler emits this correct check when rel_align(char) is specified in the declaration of variable w.