C parser and code generator for ruby
Ruby Yacc C
Switch branches/tags
Latest commit 5986def Aug 6, 2017 @ch1c0t ch1c0t committed with Fix a few Minitest warnings about assert_nil

README.markdown

CAST

C parser and abstract syntax tree for Ruby.

Example

require 'cast'

source = File.read('file.c')
ast = C.parse(source)
ast.entities.each do |declaration|
  declaration.declarator.each do |declarator|
    puts "#{declarator.name}: declarator.type"
  end
end

Or in irb:

irb> ast = C.parse('int main(void) { return 0; }')
 => TranslationUnit
    entities:
        - FunctionDef
            type: Function
                type: Int
                params: []
            name: "main"
            def: Block
                stmts:
                    - Return
                        expr: IntLiteral
                            val: 0

irb> puts ast
int main(void) {
    return 0;
}
 => nil

Nodes

C.parse returns a tree of Node objects. Here's the class hierarchy:

  • Node
    • TranslationUnit
    • Comment
    • Declaration
    • Declarator
    • FunctionDef
    • Parameter
    • Enumerator
    • MemberInit
    • Member
    • Statement
      • Block
      • If
      • Switch
      • While
      • For
      • Goto
      • Continue
      • Break
      • Return
      • ExpressionStatement
    • Label
      • PlainLabel
      • Default
      • Case
    • Type
      • IndirectType
        • Pointer
        • Array
        • Function
      • DirectType
        • Struct
        • Union
        • Enum
        • CustomType
        • PrimitiveType
          • Void
          • Int
          • Float
          • Char
          • Bool
          • Complex
          • Imaginary
    • Expression
      • Comma
      • Conditional
      • Variable
      • UnaryExpression
        • PostfixExpression
          • Index
          • Call
          • Dot
          • Arrow
          • PostInc
          • PostDec
        • PrefixExpression
          • Cast
          • Address
          • Dereference
          • Sizeof
          • Plus
          • Minus
          • PreInc
          • PreDec
          • BitNot
          • Not
      • BinaryExpression
        • Add
        • Subtract
        • Multiply
        • Divide
        • Mod
        • Equal
        • NotEqual
        • Less
        • More
        • LessOrEqual
        • MoreOrEqual
        • BitAnd
        • BitOr
        • BitXor
        • ShiftLeft
        • ShiftRight
        • And
        • Or
      • AssignmentExpression
        • Assign
        • MultiplyAssign
        • DivideAssign
        • ModAssign
        • AddAssign
        • SubtractAssign
        • ShiftLeftAssign
        • ShiftRightAssign
        • BitAndAssign
        • BitXorAssign
        • BitOrAssign
      • Literal
        • StringLiteral
        • CharLiteral
        • CompoundLiteral
        • IntLiteral
        • FloatLiteral
    • NodeList
      • NodeArray
      • NodeChain

The bold ones are abstract.

The last 2 (NodeLists) represent lists of Nodes. They quack like standard ruby Arrays. NodeChain is a doubly linked list; NodeArray is an array.

Node Methods

  • parent: return the parent in the tree (a Node or nil).

  • pos, pos=: the position in the source file (a Node::Pos).

  • to_s: return the code for the tree (a String).

  • inspect: return a pretty string for inspection, makes irb fun.

  • match?(str), =~(str): return true iff str parses as a Node equal to this one.

  • detach: remove this node from the tree (parent becomes nil) and return it.

  • detached?, attached?: return true if parent is nil or non-nil respectively.

  • replace_with(node): replace this node with node in the tree.

  • swap_with(node): exchange this node with node in their trees.

  • insert_prev(*nodes), insert_next(*nodes): insert nodes before this node in the parent list. Parent must be a NodeList! Useful for adding statements before a node in a block, for example.

  • Foo?: (where Foo is a module name) return self.is_a?(Foo). This is a convienience for a common need. Example:

    \# print all global variables
    ast.entities.each do |node|
      node.Declaration? or next
      node.declarators.each do |decl|
        unless decl.type.Function?
          puts "#{decl.name}: #{decl.type}"
        end
      end
    end
    

The =~ method lets you do:

if declarator.type =~ 'const int *'
  puts "Ooh, a const int pointer!"
end

This is not the same as declarator.type.to_s == 'const int *'; that'd require you to guess how to_s formats its strings (most notably, the whitespace).

Fields and Children

The big table down below lists the fields of each Node. A field is an attribute which:

  • is used in equality checks (== and eql?).
  • are copied recursively by dup and clone.

Fields listed as children form the tree structure. They only have a Node or nil value, and are yielded/returned/affected by the traversal methods:

  • next, prev: return the next/prev sibling.
  • list_next, list_prev: like next/prev, but also requires the parent to be NodeList. I'll be honest; I don't remember why I added these methods. They may well suddenly disappear.
  • each, reverse_each: Yield all (non-nil) children. Node includes Enumerable, so, you know.
  • depth_first, reverse_depth_first: Walk the tree in that order, yielding two args (event, node) at each node. event is :down on the way down, :up on the way up. If the block throws :prune, it won't descend any further.
  • preorder, reverse_preorder, postorder, reverse_postorder: Walk the tree depth first, yielding nodes in the given order. For the preorders, if the block throws :prune, it won't descend any further.
  • node_after(child), node_before(child): return the node before/after child (same as child.next).
  • remove_node(child): remove child from this node (same as child.detach).
  • replace_node(child, new_child): replace child with yeah you guessed it (same as child.replace_with(newchild)).

Note: don't modify the tree during traversal!

Other notes about the table:

  • Field names that end in '?' are always true-or-false.
  • If no default is listed:
    • it is false if the field name ends in a '?'
    • it is a NodeArray if it is a NodeList.
    • it is nil otherwise.
<style> table.node_desc tr.first_field td { border-top: 1px solid black; }
table.node_desc tr.first_field table td {
  border: none;
}

table.node_desc td {
  padding: 3px;
  vertical-align: top;
{

table.node_desc table td {
  padding: 0px;
{
</style>
Class Field Type / values Default Comments
TranslationUnit entities * NodeList NodeChain[] The root of a parsed file.
Declaration storage :typedef, :extern, :static, :auto, :register Also:
  • #typedef? -- true iff storage == :typedef
  • #extern? -- true iff storage == :extern
  • #static? -- true iff storage == :static
  • #auto? -- true iff storage == :auto
  • #register? -- true iff storage == :register
type * DirectType
declarators * NodeList NodeArray[]
inline? true, false
Declarator indirect_type * IndirectType What's a "declarator?" Consider "int i, *ip;". This is a Declaration with two Declarators:
    Declaration
        type: Int
        declarators:
            - Declarator
                name: "i"
            - Declarator
                indirect_type: Pointer
                name: "ip"
      
The indirect_type of the ip Declarator is a Pointer to nil. To get the complete type of the variable use:
  • #type -- return the complete type. This is a clone; modifying it won't modify the tree.
So calling #type on the ip Declarator gives:
    Pointer
      type: Int
      
name String
init * Expression
num_bits * Integer
FunctionDef storage :extern, :static Also:
  • #extern? -- return true iff storage == :extern
  • #static? -- return true iff storage == :static
  • #prototype? -- same as !no_prototype?
  • #prototype=(val) -- same as no_prototype = !val
no_prototype? means that no prototype was given. That means parameter types weren't given in the parens, but in the "old-style" declaration list. Example:
int main(argc, argv)
    int argc;
    char **argv;
{
    return 0;
}
int main(int argc, char **argv) {
    return 0;
}
No prototype Prototype
Everyone tells you to use prototypes. That's because no type checking is done when calling a function declared without a prototype.
inline? true, false
type * Type
name String
def * Block Block.new
no_prototype? true, false
Parameter register? true, false Used in Functions.
type * Type
name String
Enumerator name String Used in Enums.
val * Expression
MemberInit member * NodeList of (Member or Expression) Used in CompoundLiterals.
init * Expression
Member name String Used in MemberInits.
Block labels * NodeList of Label NodeArray[]
stmts * NodeList of (Statement or Declaration or Comment) NodeArray[]
If labels * NodeList of Label NodeArray[]
cond * Expression
then * Statement
else * Statement
Switch labels * NodeList of Label NodeArray[]
cond * Expression
stmt * Statement
While labels * NodeList of Label NodeArray[] do? means it's a do-while loop.
do? true, false
cond * Expression
stmt * Statement
For labels * NodeList of Label NodeArray[]
init * Expression or Declaration
cond * Expression
iter * Expression
stmt * Statement
Goto labels * NodeList of Label NodeArray[]
target String
Continue labels * NodeList of Label NodeArray[]
Break labels * NodeList of Label NodeArray[]
Return labels * NodeList of Label NodeArray[]
expr * Expression
ExpressionStatement labels * NodeList of Label NodeArray[]
expr * Expression
PlainLabel name String
Default
Case expr * Expression
Comma exprs * NodeList of Expression
Conditional cond * Expression
then * Expression
else * Expression
Variable name String
Index expr * Expression
index * Expression
Call expr * Expression
args * NodeList of (Expression or Type)
Dot expr * Expression
member * String
Arrow expr * Expression
member * String
PostInc expr * Expression
PostDec expr * Expression
Cast type * Type
expr * Expression
Address expr * Expression
Dereference expr * Expression
Sizeof expr * Type or Expression
Positive expr * Expression
Negative expr * Expression
PreInc expr * Expression
PreDec expr * Expression
BitNot expr * Expression
Not expr * Expression
Add expr1 * Expression
expr2 * Expression
Subtract expr1 * Expression
expr2 * Expression
Multiply expr1 * Expression
expr2 * Expression
Divide expr1 * Expression
expr2 * Expression
Mod expr1 * Expression
expr2 * Expression
Equal expr1 * Expression
expr2 * Expression
NotEqual expr1 * Expression
expr2 * Expression
Less expr1 * Expression
expr2 * Expression
More expr1 * Expression
expr2 * Expression
LessOrEqual expr1 * Expression
expr2 * Expression
MoreOrEqual expr1 * Expression
expr2 * Expression
BitAnd expr1 * Expression
expr2 * Expression
BitOr expr1 * Expression
expr2 * Expression
BitXor expr1 * Expression
expr2 * Expression
ShiftLeft expr1 * Expression
expr2 * Expression
ShiftRight expr1 * Expression
expr2 * Expression
And expr1 * Expression
expr2 * Expression
Or expr1 * Expression
expr2 * Expression
Assign lval * Expression
rval * Expression
MultiplyAssign lval * Expression
rval * Expression
DivideAssign lval * Expression
rval * Expression
ModAssign lval * Expression
rval * Expression
AddAssign lval * Expression
rval * Expression
SubtractAssign lval * Expression
rval * Expression
ShiftLeftAssign lval * Expression
rval * Expression
ShiftRightAssign lval * Expression
rval * Expression
BitAndAssign lval * Expression
rval * Expression
BitXorAssign lval * Expression
rval * Expression
BitOrAssign lval * Expression
rval * Expression
StringLiteral val String The String in val is the literal string entered. "\n" isn't converted to a newline, for instance.
CharLiteral val String The String in val is the literal string entered. '\n' isn't converted to a newline, for instance.
CompoundLiteral type * Type

Here's an example:

(struct S){1, .x = 2, .y [3] .z = 4}

parses as:

CompoundLiteral
    type: Struct
        name: "S"
    member_inits:
        - MemberInit
            init: IntLiteral
                val: 1
        - MemberInit
            member:
                - Member
                    name: "x"
            init: IntLiteral
                val: 2
        - MemberInit
            member:
                - Member
                    name: "y"
                - IntLiteral
                    val: 3
                - Member
                    name: "z"
            init: IntLiteral
                val: 4
member_inits * NodeList of MemberInit NodeArray[]
IntLiteral format :dec, :hex, :oct :dec

Also:

  • #dec? -- return true iff format == :dec
  • #hex? -- return true iff format == :hex
  • #oct? -- return true iff format == :oct
val Integer
suffix String
FloatLiteral format :dec, :hex :dec
val Float
exponent Integer
suffix String
Pointer const? true, false
restrict? true, false
volatile? true, false
type * Type
Array const? true, false
restrict? true, false
volatile? true, false
type * Type
length * Expression
Function const? true, false
restrict? true, false
volatile? true, false
type * Type
params * NodeList of Parameter NodeArray[]
var_args? true, false
Struct const? true, false
restrict? true, false
volatile? true, false
name String
members * NodeList of Member NodeArray[]
Union const? true, false
restrict? true, false
volatile? true, false
name String
members * NodeList of Member NodeArray[]
Enum const? true, false
restrict? true, false
volatile? true, false
name String
members * NodeList of Enumerator
CustomType const? true, false For typedef'd names.
restrict? true, false
volatile? true, false
name String
Void const? true, false const is for things like const void *.
restrict? true, false
volatile? true, false
Int const? true, false Also:
  • #short? -- return true iff longness == -1
  • #plain? -- return true iff longness == 0
  • #long? -- return true iff longness == 1
  • #long_long? -- return true iff longness == 2
  • #signed? -- same as !unsigned?
  • #signed=(val) -- same as unsigned = !val
restrict? true, false
volatile? true, false
longness -1, 0, 1, 2 0
unsigned? true, false
Float const? true, false Also:
  • #plain? -- return true iff longness == 0
  • #double? -- return true iff longness == 1
  • #long_double? -- return true iff longness == 2
restrict? true, false
volatile? true, false
longness 0, 1, 2 0
Char const? true, false Also:
  • #signed? -- return true iff signed == true
  • #unsigned? -- return true iff signed == false
  • #plain? -- return true iff signed == nil
Yes, C99 says that char, signed char, and unsigned char are 3 distinct types (unlike with int -- go figure). Like Martian chalk and Venusian cheese: completely different, but you can fit 'em each in one byte.
restrict? true, false
volatile? true, false
signed true, false, nil
Bool const? true, false This is the rarely seen _Bool type.
restrict? true, false
volatile? true, false
Complex const? true, false

This is the rarely seen _Complex type.

  • #plain? -- return true iff longness == 0
  • #double? -- return true iff longness == 1
  • #long_double? -- return true iff longness == 2
restrict? true, false
volatile? true, false
longness 0, 1, 2 0
Imaginary const? true, false

This is the rarely seen _Imaginary type.

  • #plain? -- return true iff longness == 0
  • #double? -- return true iff longness == 1
  • #long_double? -- return true iff longness == 2
restrict? true, false
volatile? true, false
longness 0, 1, 2 0
BlockExpression block * Block Block.new Only if the block_expressions extension is enabled. See "Extensions" section below.

Parser

C.parse will use the default parser (C.default_parser), but you can also manage your own parser(s) if you need finer control over state. Parser state consists of:

  • type_names: a Set of Strings. As a parser eats typedefs, this grows.
  • pos: the Node::Pos this parser will start parsing at.

A Node::Pos has three read-write attributes: filename, line_num, col_num. Default is nil, 1, 0.

Note that the type names the parser has seen affects the parser! For example, consider:

a * b;
  • If only a is a type, this is a declaration.
  • If neither a nor b are types, this is a multiplication statement.
  • Otherwise, it's a syntax error.

You may append type names implicitly, by parsing typedefs, or explicitly like this:

parser.type_names << 'Thing' << 'OtherThing'

Parsing Snippets

C.parse will parse the toplevel C construct, a C::TranslationUnit, but you can also parse other snippets of C:

C::Statement.parse('while (not_looking) { paint_car(); }')
C::Type.parse('void *(*)(int *(*)[][2], ...)')

This works for both concrete and abstract Node subclasses. A Parser may be given as an optional second argument.

Extensions to C99

  • Types are allowed as function arguments. This is needed to parse C99 macros like va_arg().
  • Blocks in parentheses are allowed as expressions (a gcc extension). You need to call parser.enable_block_expressions first. They appear as BlockExpression nodes.

Parsing Full Programs

This can be tricky for a number of reasons. Here are the issues you'll likely encounter.

Preprocessing

Directives that start with # are not handled by the Parser, as they're external to the C grammar. CAST ships with a Preprocessor, which wraps the preprocessor used to build your Ruby interpreter.

cpp = C::Preprocessor.new
cpp.include_path << '/usr/include' << /usr/local/include'
cpp.macros['DEBUG'] = '1'
cpp.macros['max(a, b)'] = '((a) > (b) ? (a) : (b))'
cpp.preprocess(code)

Note however, that preprocessors tend to leave vendor-specific extensions in their output. GNU cpp, for example, leaves "linemarkers" (lines that begin with #) in the output which you'll need to filter out manually before feeding it to a Parser.

Built-in types

Mac OS 10.5's system cpp for instance assumes the compiler will recognize types such as __darwin_va_list.

Syntactic Extensions

Some code may take advantage of compiler-specific extensions to the syntax. For example, gcc supports inline assembly via directives like:

asm("movl %1, %%eax;
    "movl %%eax, %0;"
    :"=r"(y)
    :"r"(x)
    :"%eax");

Such code is fairly rare, so there is no direct support in CAST for this. You'll need to manually massage such constructs out of the Parser input. Or send me patches. Delicious patches.

Contributing

  • Bug reports
  • Source
  • Patches: Fork on Github, send pull request.
    • Include tests where practical.
    • Leave the version alone, or bump it in a separate commit.

Copyright

Copyright (c) George Ogata. See LICENSE for details.