Skip to content
Manu edited this page Jan 26, 2016 · 18 revisions

How to parse C programs with Clang: A tutorial

Written by Nico Weber (9/2008)
Updated by Justin LaPre (10/2009)
Update by Larry Olson (12/2011)

Introduction

From Clang's website

The goal of the Clang project is to create a new C, C++, Objective C and Objective C++ front-end for the LLVM compiler.

What does that mean, and why should you care? A front-end is a program that takes a piece of code and converts it from a flat string to a structured, tree-formed representation of the same program — an Abstract Syntax Tree or AST.

Once you have an AST of a program, you can do many things with the program that are hard without an AST. For example, renaming a variable without an AST is hard: You cannot simply search for the old name and replace it with a new name, as this will change too much (for example, variables that have the old name but are in a different namespace, and so on). With an AST, it’s easy: In principle, you only have to change the name field of the right VariableDeclaration AST node, and convert the AST back to a string. (In practice, it’s a bit harder for some codebases.

Front-ends have existed for decades. So, what’s special about clang? I think the most interesting part is that clang uses a library design, which means that you can easily embed clang into your own programs (and by “easily”, I mean it. Most programs in this tutorial are well below 50 lines). This tutorial tells you how you do this.

So, do you have a large C code-base and want to perform non-trivial analysis? Would you like to have ctags that works better with C++ and at all with Objective-C? Would you like to collect some statistics about your program, and you feel that grep doesn’t cut it? Then clang is for you.

This tutorial will offer a tour through clang’s preprocessor, parser, and AST libraries.

A short word of warning: Clang is a work in progress. Although its API surface is more stable than it was in 2008 when the first version of this tutorial was written, it does not have a stable API, so this tutorial might not be completely up-to-date.

Clang works on all platforms. In this tutorial I assume that you have some Unix-based platform, but everything works on Windows, too.

Getting Started

The official release of Clang is at version 3.0. You can get it here. You can download and add that into your various binary, include, and library paths.

Alternatively, you can get the latest from SVN by following these instructions

A hint for folks on Unix like systems who are pulling straight from SVN and not the official released build. After getting the source for llvm and clang and configuring it per the instructions, run the following:

make happiness
# Note: You'll almost certainly need to run the next command under sudo
make install

make happiness will checkout the latest llvm and clang from svn, build them, and run the resulting binaries through a test suite. The result of this command should look like:

Testing Time: 202.10s
  Expected Passes    : 9678
  Expected Failures  : 74
  Unsupported Tests  : 13

If there are any lines that say:

Unexpected Failures: 3

or something like that, then run the command again, often times a fix will already be waiting and the last update just happened to miss it. Otherwise, check the mailing lists as there may be a bug.

make install will install the built libs, binaries, and include files.

Recommended reading to get some context if needed.

  • Compilers: Principles, Techniques, and Tools by Aho, Lan, Sethi, and Ullman Pay attention to the first two chapters, especially discussions of Lexical Analysis and Syntax Trees. Skim anything else that looks interesting up to the Chapter on Syntax Directed Translation (chapter 5 in the 1st edition). Note: All of this is what Clang is doing for you.
  • Compiler Construction: Principles and Practice by Kenneth C. Louden Pay attention to the first 3 chapters up to section 3.3 again.

Tutorial 1: The bare minimum

A front-end consists of multiple parts. First is usually a lexer, which converts the input from a stream of characters to a stream of tokens. For example, the input while is converted from the five characters ‘w’, ‘h’, ‘i’, ‘l’, and ‘e’ to the token kw_while. For performance reasons, clang does not have a separate preprocessor program, but does preprocessing while lexing.

The Preprocessor class is the main interface to the lexer, and it’s a class you will need in almost every program that embeds clang. So, for starters, let’s try to create Preprocessor object. Our first program will not do anything useful, it only constructs a Preprocessor and exits.

The constructor of Preprocessor takes no less than 6 arguments: A DiagnosticsEngine object, a LangOptions object, a TargetInfo object, a SourceManager object, a HeaderSearch object, and finally a Module Loader object. Let’s break down what those objects are good for, and how we can build them.

First is DiagnosticsEngine. This is used by clang to report errors and warnings to the user. A DiagnosticsEngine object can have a DiagnosticsConsumer, which is responsible for actually displaying the messages to the user. We will use clang’s built-in TextDiagnosticPrinter class, which writes errors and warnings to the console (it’s the same DiagnosticsConsumer that is used by the clang binary).

Next up is LangOptions. This class lets you configure if you’re compiling C or C++, and which language extensions you want to allow. Constructing this object is easy, as its constructor does not take any parameters.

The TargetInfo is easy, too, but we need to call a factory method as the constructor is private. The factory method takes a “host triple” as parameter that defines the architecture clang should compile for, such as “i386-apple-darwin”. We will get and pass the default host triple (getDefaultTargetTriple()), which contains the host triple describing the machine llvm was compiled on. But in principle, you can use clang as a cross-compiler very easily, too. The TargetInfo object is required so that the preprocessor can add target-specific defines, for example __APPLE__. You need to delete this object at the end of the program.

SourceManager is used by clang to load and cache source files. Its constructor takes a DiagnosticsEngine for errors and a FileManager which helps it manage files on disk and in cache.

The constructor of HeaderSearch which also requires a DiagnosticsEngine for errors and a FileManager which helps it manage files on disk and in cache. HeaderSearch configures where clang looks for include files.

Finally, a ModuleLoader is an abstract class whose concrete implementation helps resolve module names. In this case, we'll create a CompilerInstance as the default ModuleLoader.

So, to build a Preprocessor object, the following code is required:

clang::DiagnosticOptions diagnosticOptions;
clang::TextDiagnosticPrinter *pTextDiagnosticPrinter =
    new clang::TextDiagnosticPrinter(
        llvm::outs(),
        diagnosticOptions);
llvm::IntrusiveRefCntPtr<clang::DiagnosticIDs> pDiagIDs;

clang::DiagnosticsEngine *pDiagnosticsEngine =
    new clang::DiagnosticsEngine(pDiagIDs, pTextDiagnosticPrinter);

clang::LangOptions languageOptions;
clang::FileSystemOptions fileSystemOptions;
clang::FileManager fileManager(fileSystemOptions);
clang::SourceManager sourceManager(
    *pDiagnosticsEngine,
    fileManager);
clang::HeaderSearch headerSearch(fileManager, *pDiagnosticsEngine);

clang::TargetOptions targetOptions;
targetOptions.Triple = llvm::sys::getDefaultTargetTriple();

clang::TargetInfo *pTargetInfo = 
    clang::TargetInfo::CreateTargetInfo(
        *pDiagnosticsEngine,
        targetOptions);
clang::CompilerInstance compInst;

clang::Preprocessor preprocessor(
    *pDiagnosticsEngine,
    languageOptions,
    pTargetInfo,
    sourceManager,
    headerSearch,
    compInst);

Note that this is quite verbose. Since we're using a CompilerInstance anyway, I've rebuilt these tutorials using a CompilerInstance object and its helper methods. They make this setup a bit simpler.

Compiling

Now that you've written your first tutorial, you need to compile it. To do that, use llvm-config. Pass it the -fno-rtti flag otherwise you'll get a link error. Also, pass it which backend libraries to use along with a list of clang libraries. See the checked-in makefile in the project.

Tutorial 2: Processing a file

Right now, our program is not very interesting, as it does not do anything yet. We want to change it so that it feeds C files into the preprocessor object. To add an input file to the preprocessor, we call:

const clang::FileEntry *pFile = fileManager.getFile("test.c");
sourceManager.createMainFileID(pFile);
preprocessor.EnterMainSourceFile();

This three lines collectively create a file id for the input file and stores it as the “main file” in the SourceManager object, tells the preprocessor to enter the SourceManager’s main file into its file list, and does some other setup stuff (such as creating a buffer with predefines), too.

Now that a source file is added, we can ask the preprocessor to preprocess it and read the preprocessed input tokens:

clang::Token token;
do {
    preprocessor.Lex(token);
    if( pDiagnosticsEngine->hasErrorOccurred())
    {
        break;
    }
    preprocessor.DumpToken(token);
    std::cerr << std::endl;
} while( token.isNot(clang::tok::eof));

See tutorial2.cpp for the complete program. Note that this is a very complete preprocessor, it handles all kinds of corner cases correctly already. It also strips comments. The program can be compiled exactly like tutorial1.

After building, play around with it a bit. For example, here’s its output when test.c is given as an input file:

int 'int'
identifier 'i'
equal '='
numeric_constant '4'
semi ';'
extern 'extern'
int 'int'
identifier 'j'
semi ';'
int 'int'
identifier 'main'
l_paren '('
r_paren ')'
l_brace '{'
identifier 'printf'
l_paren '('
string_literal '"Hello World\n"'
r_paren ')'
semi ';'
return 'return'
numeric_constant '0'
semi ';'
r_brace '}'
eof ''

You can also try feeding erroneous programs to tutorial2. As you’ll see, some errors are detected, while others are not (yet).

Tutorial 3: Include files

If you feed something with an #include line into our program, chances are that it does not work. For example, if we feed it testInclude.c, we get an error:

testInclude.c:1:10: fatal error: 'stdio.h' file not found
#include <stdio.h>
         ^

We need to tell the preprocessor where it can find its header files, in particular the standard header files for your system.

You add those paths via the HeaderSearchOptions which, via a set of flags and properties, tells the HeaderSearch object how to initialize its search paths or where else to look.

To add the default system header paths to clang, you use the following code:

clang::HeaderSearchOptions headerSearchOptions;

This creates a default HeaderSearchOptions object and tells the system to use default options for this system. This happens via a call to clang::ApplyHeaderSearchOptions which takes in the LangOptions and the TargetInfo. This is called as part of the call to InitializePreprocessor which we added to earlier tutorials.

With this tiny change, #include directives that include system headers already work. See tutorial3.cpp for the complete program.

# . . . LOTS of output omitted
int 'int'
identifier 'main'
l_paren '('
r_paren ')'
l_brace '{'
identifier 'printf'
l_paren '('
string_literal '"Hello World\n"'
r_paren ')'
semi ';'
return 'return'
numeric_constant '0'
semi ';'
r_brace '}'
eof ''

This actually outputs all the tokens found in the included file, so the output is quite long.

Tutorial 4: Parsing the file

The preprocessor gives us a stream of tokens. We could now write a parser that parses that token stream into an AST. Luckily, clang can do this for us.

So, how does clang’s parsing work? There is a call to clang::ParseAST which takes in the configured Preprocessor object, an ASTConsumer, and an ASTContext. The ASTConsumer has virtuals for high level parsing constructs while the ASTContext holds onto long lived AST elements like types and declarations for semantic analysis.

To call ParseAST, we have to build a number of objects.

const clang::TargetInfo &targetInfo = *pTargetInfo;

clang::IdentifierTable identifierTable(languageOptions);
clang::SelectorTable selectorTable;

clang::Builtin::Context builtinContext;
builtinContext.InitializeTarget(targetInfo);
clang::ASTContext astContext(
    languageOptions,
    sourceManager,
    pTargetInfo,
    identifierTable,
    selectorTable,
    builtinContext,
    0 /* size_reserve*/);
clang::ASTConsumer astConsumer;

clang::Sema sema(
    preprocessor,
    astContext,
    astConsumer);

After creating these new objects to hold state (or in the case of the ASTConsumer, in the near future to help us parse a file), we need to invoke the actual parser.

clang::ParseAST(preprocessor, &astConsumer, astContext); 

Finally, to show we've actually done something, let's print out statistics on the Identifier Table.

identifierTable.PrintStats();

Your output should look something like this:

*** Identifier Table Stats:
# Identifiers:   91
# Empty Buckets: 8101
Hash density (#identifiers per bucket): 0.011108
Ave identifier length: 8.483516
Max identifier length: 28

Number of memory regions: 1
Bytes used: 3047
Bytes allocated: 4096
Bytes wasted: 1049 (includes alignment, etc)

Tutorial 5 and 6: Doing something interesting, via semantic analysis with Clang

The original tutorials used the parser directly here to perform semantic analysis. The technique used is no longer exposed through clang's public APIs and instead you should now use Semantic analysis as tutorial 6 did.

By now, we can do some actual analysis of the input code. Remember, we want to write a program that lists information about global variables. Finding all globals in a program is a good start. In order to do that, we will create an ASTConsumer which acts on TopLevelDeclarations. So what then is a Declaration?

Here are some examples of Declarations

const unsigned long int b;  // b is just a constant
int *i;  // declares (and defines) a pointer-to-int variable
extern int a;  //declares an int variable
void printf(const char* c, ...);  // a function prototype
typedef unsigned int u32;  // a typedef
struct { int a } t;  // a variable that has an anonymous struct as type
char *(*(**foo [][8])())[];  // ...wtf? see below :-)

Reading declarations can be tricky and this has good technique for reading them.

A declaration has very roughly two parts: the declared type on the left (e.g. const unsigned long int or struct { int a }) and a list of “modifiers” and the variable name itself on the right (e.g. b, *i, or ((**foo [][8])())[]). The part on the left is represented by a DeclSpec in clang, the (potentially complex) part on the right is a list of DeclaratorChunks. A DeclaratorChunk has a Kind that can be one of Pointer, Reference, Array, or Function.

To differentiate these various types of declarations cannot be done just by a Parser, the Compiler must perform some Semantic Analysis. We gain access to the results of Semantic Analysis by making our own ASTConsumer. The method we want to override is HandleTopLevelDecl(), which is called (surprise!) for every top-level declaration. The method gets passed a Decl object, which is the AST class used for declarations.

We want to handle only VarDecls, and only those that are global and not extern. This yields

virtual bool HandleTopLevelDecl( clang::DeclGroupRef d)
{
    static int count = 0;
    clang::DeclGroupRef::iterator it;
    for( it = d.begin(); it != d.end(); it++)
    {
        count++;
        clang::VarDecl *vd = llvm::dyn_cast<clang::VarDecl>(*it);
        if(!vd)
        {
            continue;
        }
        std::cout << vd << std::endl;
        if( vd->isFileVarDecl() && !vd->hasExternalStorage() )
        {
            std::cerr << "Read top-level variable decl: '";
            std::cerr << vd->getDeclName().getAsString() ;
            std::cerr << std::endl;
        }
    }
    return true;
}

After building and running, Tutorial 6 now gives us:

Read top-level variable decl: 'a
Read top-level variable decl: 'a
Read top-level variable decl: 'b
Read top-level variable decl: 'c
Read top-level variable decl: 'funcp
Read top-level variable decl: 'fp2
Read top-level variable decl: 'fp3
Read top-level variable decl: 't