Skip to content

Static Type Checker

Masahiro Sakuta edited this page Jan 4, 2023 · 2 revisions

Overview

The static type checker will improve code safety and maintainability. Even dynamic languages like JavaScript and Python are moving towards static typing because people experience more bugs and maintenance challenges without static typing.

There is no denying that dynamic languages are easier to write, at least at first. Languages like JavaScript, Python or Ruby are joyful to work with until 1000 lines of code. After 10,000 lines of code, it will become a nightmare.

There is also a common misconception that scripts won't work with static typing. There are not a lot, but certainly some number of scripting languages with complete static type checking (excluding TypeScript).

What value static typing can bring to us? There are 2 main benefits:

  • Correctness guarantee before running
  • Type-aware optimization

We are going to pursue the first point in our language. The second point is mainly a subject for natively compiled languages, but it can partly apply to intermediate compiled languages like Java.

Also, I don't buy the idea of optional typing. TypeScript is a lanaguage with configurable levels of type checking, but the sloppy type system tend to "infect" other fully type qualified code, and corrupt the whole type integrity. The language should have a strict static types, or none.

Type Checker Architecture

I think there are many ways to implement static type checker, but the one we are using is tree walking type checker. It reads the AST and analyzies the propagation of types by descending the tree recusively. It is in fact very similar to the interpreter, so we explain the type checker with the context of the interpreter.

The interpreter uses the EvalContext struct to keep track of execution environment and use recursive calls to track the tree.

pub struct EvalContext<'src, 'ast, 'native, 'ctx> {
    /// RefCell to allow mutation in super context.
    /// Also, the inner values must be Rc of RefCell because a reference could be returned from
    /// a function so that the variable scope may have been ended.
    variables: RefCell<HashMap<&'src str, Rc<RefCell<Value>>>>,
    /// Function names are owned strings because it can be either from source or native.
    /// Unlike variables, functions cannot be overwritten in the outer scope, so it does not
    /// need to be wrapped in a RefCell.
    functions: HashMap<String, FuncDef<'src, 'ast, 'native>>,
    super_context: Option<&'ctx EvalContext<'src, 'ast, 'native, 'ctx>>,
}

pub fn run<'src, 'ast>(
    stmts: &'ast Vec<Statement<'src>>,
    ctx: &mut EvalContext<'src, 'ast, '_, '_>,
) -> Result<RunResult, EvalError> {
    //...
}

The type checker has very similar struct called TypeCheckContext and a function with recursive calls. The only difference from the interpreter is that the variables are tracked by types, not values.

pub struct TypeCheckContext<'src, 'ast, 'native, 'ctx> {
    /// Variables table for type checking.
    variables: HashMap<&'src str, TypeDecl>,
    /// Function names are owned strings because it can be either from source or native.
    functions: HashMap<String, FuncDef<'src, 'ast, 'native>>,
    super_context: Option<&'ctx TypeCheckContext<'src, 'ast, 'native, 'ctx>>,
}

pub fn type_check<'src, 'ast>(
    stmts: &'ast Vec<Statement<'src>>,
    ctx: &mut TypeCheckContext<'src, 'ast, '_, '_>,
) -> Result<TypeDecl, TypeCheckError> {
    // ...
}

We apply operations between variables and values as the same way as the interpreter, but only check the type compatibility.

Another important difference from the interpreter is that the type checker will not follow the control flow constructs, such as for, while, loop and if. It only checks the compatibility of the returning types, and not actually run loops. For example, if is an expression in our language, as is in Rust, so you can write something like below as an expression.

if cond {
    a
} else {
    b
}

The type checker will check 3 things in this expression.

  • cond has the type compatible with I32 (since we use I32 for conditions, becase we don't have bool type)
  • Type of a and b are compatible
  • Type of the whole expression is compatible with the context of the expression

Type compatibility definition

The word "compatible" is a bit vague in this discussion. Actually, we have following 3 kinds of type compatibility, each governed by a specific function.

  • Coercion compatibility
  • Binary operator compatibility
  • Type cast compatibility

Coercion compatibility

It basically means the values of these types can assign to each other. The actual meaning of compatibility is defined by "type coercion rule", which is given by the function like below.

fn tc_coerce_type(value: &TypeDecl, target: &TypeDecl) -> Result<TypeDecl, TypeCheckError> {
    use TypeDecl::*;
    Ok(match (value, target) {
        (_, Any) => value.clone(),
        (Any, _) => target.clone(),
        (F64 | Float, F64) => F64,
        (F32 | Float, F32) => F32,
        (I64 | Integer, I64) => I64,
        (I32 | Integer, I32) => I32,
        (Str, Str) => Str,
        (Array(v_inner), Array(t_inner)) => Array(Box::new(tc_coerce_type(v_inner, t_inner)?)),
        (Float, Float) => Float,
        (Integer, Integer) => Integer,
        _ => {
            return Err(format!(
                "Type check error! {:?} cannot be assigned to {:?}",
                value, target
            ))
        }
    })
}

You can see there are many combinations of types between value (the type of the value to assign) and target (the type of the variable or memory location to be assigned to). Note that the function recurse into array elements, so that it works with arbitrary nested arrays. Rust's pattern matching is really helpful in writing such kind of a function.

Binary operator compatibility

Binary operator compatibility has actually 2 flavors, but they are very similar. The one is ordinary arithmetic operators that yields the same type as operands, and the other is logical operators that always yields i32s (bools).

For example, we expect a and b have the compatible type in this expression: a + b. Unlike coercion, the operands are symmetric; if you swap a and b, the compatibiilty should be the same.

fn binary_op_type(lhs: &TypeDecl, rhs: &TypeDecl) -> Result<TypeDecl, ()> {
    use TypeDecl::*;
    let res = match (&lhs, &rhs) {
        // `Any` type spreads contamination in the source code.
        (Any, _) => Any,
        (_, Any) => Any,
        (F64, F64) => F64,
        (F32, F32) => F32,
        (I64, I64) => I64,
        (I32, I32) => I32,
        (Str, Str) => Str,
        (Float, Float) => Float,
        (Integer, Integer) => Integer,
        (Float, F64) | (F64, Float) => F64,
        (Float, F32) | (F32, Float) => F32,
        (Integer, I64) | (I64, Integer) => I64,
        (Integer, I32) | (I32, Integer) => I32,
        (Array(lhs), Array(rhs)) => {
            return Ok(Array(Box::new(binary_op_type(lhs, rhs)?)));
        }
        _ => return Err(()),
    };
    Ok(res)
}

This function will return a type as a result of binary operator. If the operands are not compatible, it will return a type error.

As I noted in the comment, Any type can be compatible with any other type, and the resulting type will be Any, so the type sloppiness will spill over to the whole expression. That's why I would like to remove the Any type eventually (probably implement type inference instead).

Most operators follow this rule, but the logical operators such as && and || have slightly different rules.

/// Binary comparison operator type check. It will always return i32, which is used as a bool in this language.
fn binary_cmp_type(lhs: &TypeDecl, rhs: &TypeDecl) -> Result<TypeDecl, ()> {
    use TypeDecl::*;
    let res = match (&lhs, &rhs) {
        (Any, _) => I32,
        (_, Any) => I32,
        (F64, F64) => I32,
        (F32, F32) => I32,
        (I64, I64) => I32,
        (I32, I32) => I32,
        (Str, Str) => I32,
        (Float, Float) => I32,
        (Integer, Integer) => I32,
        (Float, F64 | F32) | (F64 | F32, Float) => I32,
        (Integer, I64 | I32) | (I64 | I32, Integer) => I32,
        (Array(lhs), Array(rhs)) => {
            return binary_cmp_type(lhs, rhs);
        }
        _ => return Err(()),
    };
    Ok(res)
}

It always return I32, since it is the "boolean" type for our language.

Type cast compatibility

The most permssive form of type compatibility is type cast compatibility. It is a type check invoked by type casting operator as. It happens in an expression like a as i32.

fn tc_cast_type<'src>(
    value: &TypeDecl,
    target: &TypeDecl,
    span: Span<'src>,
    ctx: &TypeCheckContext<'src, '_, '_, '_>,
) -> Result<TypeDecl, TypeCheckError<'src>> {
    use TypeDecl::*;
    Ok(match (value, target) {
        (_, Any) => value.clone(),
        (Any, _) => target.clone(),
        (I32 | I64 | F32 | F64 | Integer | Float, F64) => F64,
        (I32 | I64 | F32 | F64 | Integer | Float, F32) => F32,
        (I32 | I64 | F32 | F64 | Integer | Float, I64) => I64,
        (I32 | I64 | F32 | F64 | Integer | Float, I32) => I32,
        (Str, Str) => Str,
        (Array(v_inner), Array(t_inner)) => {
            // Array doesn't recursively type cast for performance reasons.
            Array(Box::new(tc_coerce_type(v_inner, t_inner, span, ctx)?))
        }
        (I32 | I64 | F32 | F64 | Integer | Float, Float) => Float,
        (I32 | I64 | F32 | F64 | Integer | Float, Integer) => Integer,
        _ => {
            return Err(TypeCheckError::new(
                format!(
                    "Type check error! {:?} cannot be casted to {:?}",
                    value, target
                ),
                span,
                ctx.source_file,
            ))
        }
    })
}

In this type compatibility, all numerical values are compatible, no matter integer or float. However, there are 2 interesting exceptions in what as permits.

The first is conversion to and from a string. In our language, strings cannot be casted to numbers or vice versa, because they are often expensive operation and it should be provided by the standard library.

Another case is the array, which do not recursively "cast" the element type like coercion. So, for example, you cannot cast an array of i32s into an array of f32s. This has again performance reasons. Casting a collection of variables in a container is not what as operator is supposed to do, because it can be expensive. The standard library should provide with the functionality as a function.

Abstract types

We have 2 integers i64 nd i32, and 2 floats, f64 and f32. If we try to enforce the correct types between any value operation at all time, it would not be very convenient. For example, a simple value initialization like below, will fail to type check, because numeric literal 1 has i64 type implicitly.

var a: i32 = 1;

The interpreter already converts the types appropriately, but type checker is a tad too strict.

So we introduce the idea of abstract types. They only appear in type checking, and the actual code execution won't have that type. We have 2 abstract types:

  • Integer: an abstraction over I32 and I64
  • Float: an abstraction over F32 and F64

The type checker will accept values of abstract types that can be compatible with concrete types. So the above example will not be an error.

However, we still distinguish concrete types, so the following will be an error.

var a: i32 = 1.5; // Error!
var b: i32 = 10;
var c: i64 = b; // Error!