<br><br><br><br><br>

# Type checking

<br><br><br><br><br>

<br>

Most compiled languages perform an additional tree-to-tree transformation: **type checking**.

Generally, an **untyped AST** (such as the ones we've been dealing with) gets replaced by a **typed AST**, in which each node is marked by a data type, such as `double` or `boolean`. (It's also possible to mark an AST in-place with type labels, but if so, be sure that node instances are unique!)

Type checking was traditionally motivated by the need to generate the right instructions in the output language (e.g. `__add_int32__` vs `__add_float32__` on unlabled 32-bit registers), but it can be much more general than that:

<center style="margin-top: 20px; margin-bottom: 20px"><b>type checking is a formal proof that the program satisfies certain properties.</b></center>

The properties to prove are encoded in the **type system**, which can be specialized to a domain like particle physics.

_What properties do we want particle physics analysis scripts to satisfy?_

<br>

<br>

**Some terminology:**

   * A **type** is a _set of possible values_ that a symbol or expression can have at runtime. Types may be
      * **abstract** if they're specified without reference to a bit-representation, like "all non-negative integers less than `2**32`"
      * **concrete** if a bit-representation is given, like "two's complement 32-bit integers without a sign bit."
   
   
   * A **strongly typed** language stops processing if it encounters values that do not match function argument types: it either stops the compilation or the runtime execution.
   
   * A **weakly typed** language either passes bits without checking them or converts values to fit expectations.
   
   * A **statically typed** language undergoes a type-checking pass before programs are run, usually as part of a compilation.
   
   * A **dynamically typed** language checks types at runtime. Types may be valid at one time and invalid at another.

<br>

<p style="margin-bottom: 0px"><b>Weakly typed (values are assumed to fit operations)</b></p>
<ul style="margin-top: 0px">
  <li>Most assembly languages treat all values as raw bits; programmer has to keep track of types and call the right instructions.
  <li>C is often used as a weakly typed language (e.g. passing everything as <tt>void*</tt>).
</ul>

<p style="margin-bottom: 0px"><b>Weakly typed (values are converted to fit operations)</b></p>
<ul style="margin-top: 0px">
  <li>Perl: <tt>"2" + 8 → "10"</tt> and unknown or unconvertable variables are presumed to be zero.
  <li>Javascript: <tt>"2" + 8 → "28"</tt>
  <li>MATLAB: <tt>"2" + 8 → 58</tt> (because the ASCII value of <tt>"2"</tt> is <tt>50</tt>...)
  <li>Python predicates: <tt>None</tt> or <tt>[]</tt> resolves to <tt>False</tt>, <tt>[0]</tt> resolves to <tt>True</tt> when used with <tt>if/and/or/not</tt>.
  <li>Python 2's handling of byte-strings vs unicode.
  <li>Most languages promote integers to floating-point values in mixed arithmetic.
</ul>

<p style="margin-bottom: 0px"><b>Strongly but dynamically typed</b></p>
<ul style="margin-top: 0px">
  <li>Everything else in Python (<tt>"2" + 8</tt> is a <tt>TypeError</tt>).
  <li>Lisp, Ruby, R, Erlang, Lua, Tcl, Smalltalk, PostScript...
</ul>

<p style="margin-bottom: 0px"><b>Strongly and statically typed</b></p>
<ul style="margin-top: 0px">
  <li>C++, Java, C#, Rust, Go, Swift, Haskell, ML, Scala, Fortran, LLVM's assembly language...
</ul>

In [19]:
import lark
grammar = """
start: or
or:       and -> pass | and "or" and
and:      not -> pass | not "and" not
not:  compare -> pass | "not" not
compare: term -> pass | term "==" term -> eq | term "!=" term -> ne
                      | term  "<" term -> lt | term "<=" term -> le
                      | term  ">" term -> gt | term ">=" term -> ge
term:  factor -> pass | factor "+" factor -> add | factor "-" factor -> sub
factor:  atom -> pass | atom "*" atom -> mul     | atom "/" atom -> truediv
atom:      "(" or ")" | CNAME -> symbol | INT -> int | FLOAT -> float

%import common.CNAME
%import common.INT
%import common.FLOAT
%import common.WS
%ignore WS
"""
parser = lark.Lark(grammar)

In [21]:
print(parser.parse("not x > 0.0 and 2 + 2").pretty())

start
  pass
    and
      not
        pass
          gt
            pass
              pass
                symbol	x
            pass
              pass
                float	0.0
      pass
        pass
          add
            pass
              int	2
            pass
              int	2



In [26]:
# Simplify the Parsing Tree (PT) into an Abstract Syntax Tree (AST), but now we call it UntypedAST.

class UntypedAST:
    _fields = ()
    def __init__(self, *args):
        for n, x in zip(self._fields, args):
            setattr(self, n, x)

class UntypedLiteral(UntypedAST):                   # a literal always knows its type,
    _fields = ("value", "type")                     # even in the UntypedAST
    def __str__(self): return "{0}({1})".format(self.type.__name__, str(self.value))

class UntypedSymbol(UntypedAST):
    _fields = ("symbol",)
    def __str__(self): return self.symbol

class UntypedCall(UntypedAST):
    _fields = ("function", "arguments")
    def __str__(self):
        return "{0}({1})".format(str(self.function), ", ".join(str(x) for x in self.arguments))

In [28]:
def toast(ptnode):
    if ptnode.data == "start" or ptnode.data == "pass" or ptnode.data == "atom":
        return toast(ptnode.children[0])
    elif ptnode.data == "int":
        return UntypedLiteral(int(ptnode.children[0]), int)
    elif ptnode.data == "float":
        return UntypedLiteral(float(ptnode.children[0]), float)
    elif ptnode.data == "symbol":
        return UntypedSymbol(str(ptnode.children[0]))
    else:
        return UntypedCall(str(ptnode.data), [toast(x) for x in ptnode.children])

print(toast(parser.parse("not x > 0.0 and 2 + 2")))

and(not(gt(x, float(0.0))), add(int(2), int(2)))
