Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
1790 lines (1333 sloc) 49.5 KB
These notes were, for the most part, written before the corresponding
features were implemented. So they're likely to be speculative and
It's interesting (to me) to see the ideas become more refined.
'form' - a syntactic part of a program
"subr" - a built-in function
"special form" - a function that is handled specially by eval,
often because its args should not be evaluated.
"macro" - a function called from read, not from eval.
"lexpr" - ???
"environment - ???
"dynamic extent" the time a function activation is alive.
abstract superclass
pair (cons)
symbol (atom)
symbol-value(sym, value, env=None)
set-symbol_value(sym. value, value, env=None)
activation frame
built-in atoms
#t #f lambda
dynamic environment
exception handler
variable bindings
other numeric types
string->symbol (aka intern)
bind (global)
math primitives
string primitives
Special forms
storage: blocks of data, format objects.
Recognizing special forms.
It seems like there are three kinds of functions.
Regular functions, syntaxes (syntaces?), and, um,
special forms.
SIOD has these.
#define tc_subr_0 4
#define tc_subr_1 5
#define tc_subr_2 6
#define tc_subr_3 7
#define tc_lsubr 8
#define tc_fsubr 9
#define tc_msubr 10
#define tc_closure 11
tc_subr_0..3 is a regular function of 0..3 arguments.
tc_lsubrs are... gc-status
tc_fsubrs are... define lambda set! quote the-environment
tc_msubrs are... if begin and or let-internal
tc_closures are created at runtime.
I'm certain I don't see the difference yet.
Section 5.2, para 3.
"Variable definitions appearing within the body of such an
expression, or within the bodies of a library or top-level
program, are treated as a set of letrec* bindings"
"such an expression" refers to expressions that declare local
variables, e.g., lambda, let, and friends. So a variable
created by a definition is just another local variable.
And a definition can appear anywhere.
Symbol Table Layout.
Top level and each library have their own environment. (Q. Does
a library's env start empty? Can't be quite empty...) Binding
expressions push a scope onto the current environment. Definitions
create an entry in the current scope.
The absolute simplest implementation would have the env as a list
of scopes, and each scope is a list of (name . value) pairs.
def lookup(name):
for scope in env:
for var, value in scope:
if var == name:
return value
But... name is a symbol. Each symbol is a unique object. Can
we turn it inside out where we start from the symbol and find the
current binding? In constant time, even? Need to be
In SIOD, define takes an optional env. If env is null, then
define just writes the symbol's value cell. If env is non-null,
then define does this.
(set! env (cons (cons var (car env)) (cons val (cdr env))))
So env has two parallel lists. A list of vars (symbols) and
a list of values. At environment exit time, those would just
be peeled back.
That's cool. Symbol lookup is constant time. Symbol definition
is constant time. Scope entry is constant time. Scope exit
is O(number of name conflicts), which is normally small.
The env is part of the reader's state, so the reader has
to have per-thread (probably automatic) state.
Section 5.10 says that some storage is immutable. Figure out how to do that...
define new in-memory object formats.
define new scheme data types.
define new scheme functions.
define new scheme syntax.
A scheme function has:
a namespace (library),
a name,
an arg list,
A scheme syntax has:
a namespace (library),
a name,
an arg list,
A namespace is a Scheme <library name> form, as a C string.
E.g., "(rnrs base (6))".
Imagine two include files:
scheme-extension.h - used to extend the Scheme core
scheme.h - used to embed Scheme.
So,something like this.
#include "extend.h"
DEFINE_PROC("boolean?", (x))
return make_boolean(is_boolean(x));
DEFINE_SYNTAX("set!", (name value))
DEFINE_LIBRARY_PROC("foo", (mylib sublib (1 2 3)))
return make_boolean(false);
primitives have the following macros.
primitives have the following control-flow macros.
YIELD(value); /* only used for multi-value functions */
The last three leave all local variables in an undefined state.
(Where's a good place to save them?)
SAVE(variable); /* var must be type (obj_t *) */
Implement the stack as a list?
labels in SICP section 5.4:
* ev-appl-did-operator
* ev-appl-accumulate-arg
* ev-appl-accum-last-arg
* ev-sequence-continue
* ev-if-decide
* ev-assignment-1
* ev-definition-1
* print-result
* used as return address.
eval calls
eval_list calls
eval_arglist calls
eval_procedure calls
eval_apply calls
eval_multi calls
primitives call
eval simple object:
build one frame: { cont=0, exp,env=..., val=0 }
call b_eval.
b_eval sets val=..., calls RETURN.
eval nilary procedure:
build one frame: { cont=0, exp,env=..., val=0 }
call b_eval.
b_eval pushes 2nd frame: { cont=b_have_operator, exp=... } -> { cont=0, exp,env=..., val=0}
call b_eval. <====???
b_eval sets val=#<proc '='>, calls RETURN.
call b_have_operator.
Read barriers.
I think each obj_* operation can verify its subject on entry with
something like this.
#define ENSURE_TOSPACE(x) (!is_null((x)) && ((x) < to_space || (x) >= to_space_end) ? (x) = move_obj(x) : (x))
Or, as a function,
obj_t *ensure_tospace(obj_t *obj)
if (is_null(obj))
return obj;
if (to_space <= obj && obj < to_space_end)
return obj;
return move_obj(obj);
#define ENSURE_TOSPACE(x) ((x) = ensure_tospace(x))
Anyway, everytime an obj's public method dereferences a pointer field,
it also does the ensure_tospace thing.
obj_t *some_obj_get_thingie(obj_t *obj)
some_obj_t *so = (some_obj_t *)obj;
return ensure_tospace(so->so_thingie);
The alternative is to keep auto storage locations on a linked list
or stack, and traverse that list when flipping.
struct obj_ref {
obj_t **ref;
obj_ref *next;
__thread obj_ref *auto_head;
#define OBJ(x) obj_t *x; obj_ref r = { &x, auto_head }; auto_head = r;
and make it illegal to DECLARE an obj in a loop.
for (o = some_list; o; o = pair_cdr(o))
But how do we find implicit pops? Compare stack pointer?
Zap the list each time through the threaded eval loop?
That doesn't quite cut it. Instead, let's have a new arg that gets
passed to every function. Call it HEAP or ROOT or ALLOC or something.
Each function or block entry has code to initialize HEAP:
my_proc(HEAP__, ...) | This much comes
{ | from the DEFINE_PROC
some_struct HEAP = { HEAP__, ... }; | macro.
return make_pair(HEAP, some_car, some_cdr);
#define THIS_FUNCTION_USES_THE_HEAP some_struct HEAP = { HEAP__, ... }
Then roots are defined from HEAP.
HEAP_ROOT(obj); obj = make_symbol(L"foo");
#define HEAP_ROOT(obj) obj_t *obj = NIL; heap_record_obj(HEAP, obj);
Third alternative. May be independent of the first two.
Make FRAME a thread-global variable.
Then blocks just return a value.
Each thread's FRAME is a root.
I'm up against the problem that a lot of primitives need to be
implemented. Some of them seem to be best implemented as macros.
E.g., let in terms of lambda and letrec (from r6rs Appendix B).
(define-syntax let
(syntax-rules ()
((let ((name val) ...) body1 body2 ...)
((lambda (name ...) body1 body2 ...)
val ...))
((let tag ((name val) ...) body1 body2 ...)
((letrec ((tag (lambda (name ...)
body1 body2 ...)))
val ...))))
Note that only "named let" uses letrec, so you can define let in terms
of letrec and letrec in terms of let.
From the Scheme FAQ:
(define-syntax letrec
(syntax-rules ()
((_ ((var init) ...) . body)
(let ()
(define var init)
(let () . body)
But many primitives can be implemented as simple procedures. They
don't need macros.
Trivial example: list.
(define (list . args) args)
; update 5/9/2009 - this is wrong.
; `list' returns a newly allocated list. (ref. r6rs 11.9)
So I'm thinking I should start on a system of loading libraries from
source files at startup. That means defining a search path and a way
to control loading. Or maybe the C code should just have a read-file
primitive and a mechanism for reading a file into a namespace. Then
I could implement the import mechanism in Scheme.
Finishing the reader.
Things the scanner can return:
begin-datum-comment: #;
identifier (symbol)
booleans: #t, #f
number (fixnum w/ sign)
( ) [ ] .
begin-vector: #(
begin-bytevector: #vu8(
quote: '
quasiquote: `
unquote: ,
unquote-splicing: ,@
syntax: #'
quasisyntax: #`
unsyntax: #,
unsyntax-splicing: #,@
Things not to implement yet: many character formats,
non-decimal-fixnum numbers, vectors, bytevectors.
program ::= datum*
datum ::= lexeme | compound
lexeme ::= boolean | character | string | number
compound ::= ( datum* ) | [ datum* ] | abbrev datum
(or vector or bytevector)
How to handle comment?
comment ::= ... | #; interlexeme-space datum
Have a datum-sequence nonterminal which is made of data and
datum-comments. The actions discard the commented data. Or have a
parallel set of productions for commented data which don't build
anything. (I like the latter...)
It would be good to write the libraries (import statement processing,
search path, etc.) in Scheme. But we need to read a lot of code to
There could be two import mechanisms, a simplified bootstrap method
and the full-boat method. I rejected that dichotomy for read, though.
What does the import mechanism do? Each library has a C file and a
Scheme file with the same basename (foo.c and foo.scm). The C file
has a namespace declaration, something like this.
DEFINE_LIBRARY(L"(rnrs foo (6))");
Then it has the usual function definitions.
The Scheme file is wrapped in a (library ...) declaration.
(library (rnrs foo (6))
(export foo-fcn)
(import (rnrs base)
(rnrs xyzzy))
(define (foo-fcn) ...)
Sometime during program initialization, we iterate through the
declared libraries and try to load this file.
'%s/lib/%s.scm' % (dir_name(executable), lib_name)
To load a file, use fopen() and make_file_instream(), then loop in
read_stream(). Should fail gracefully.
So the library special form has to be a built-in.
There's some recursive magic where if foo imports bar, and bar imports
baz, then baz has to be loaded first. We should also duplicate
Python's import once mechanism.
Or, we "just" implement the I/O library (subset) which I think
would include:
pseudo-code main program.
exec = argv.pop()
if argv:
load_file(argv[0], null_env)
read_loop(stdin, r6rs_lib())
Libraries again.
Here's the R6RS example.
(library (hello)
(export hello-world)
(import (rnrs base)
(rnrs io simple))
(define (hello-world)
(display "Hello World")
Here are the primitives for environments.
(eval <expr> <env>)
(environment <import-spec>)
(export <export-spec> ...)
(import <import-spec> ...)
<library body>)
I think each C source file should have a LIBRARY declaration like
this. LIBRARY(L"(rnrs base (6))") That defines a static variable,
the current library. I can't think of a reason why C sources would
have dependencies.
Each library may also have a Scheme source file which appends to the C
definitions. It's probably an error to have two Scheme files define
the same library, but it's expected to have both Scheme and C files
for the same library.
Need to partition the lib module into two pieces. One implements the
correct semantics of libraries. The other reads files and parses
(library ...) forms.
Library semantics mean that each library has a namespace (an env)
and symbols imported into it are bound immutably. A library has
an import list and an export list, both with rebindings.
What about anonymous libraries? Reachable from C but not from Scheme.
Useful? I may be trying too hard to avoid polluting Scheme's
Aboard Carnival Spirit.
Macro expansion is complex. There are some other expansion things
that I'm not doing, such as splicing begin and let. To handle
all those things, I'd like to be using Scheme. The problem is that
I don't know what the best subset to implement is. Ideally, there
would be a C core which is just enough to implement the full language.
Maybe I should relax the rules, implement the let family and anything
else that seems useful, and get on with it. Then when I've written
syntax-rules using full Scheme, I can go back and figure out what it
could do without and reimplement some bits in Scheme.
And I still don't quite understand why multiple expansion levels
are necessary. R6RS Section 7.2 tries to explain it.
I'm getting the feeling that Scheme is just circular, that Scheme
is the only language for implementing Scheme.
I have a feeling that the next thing to implement is syntax objects,
and that they will be a primitive type.
Does the standard provide a way to extend the numerical tower?
Reading list: 1.4-1.10, 3, 7.2, 9, 10.
Some previously unstated goals:
Follow the standard whenever possible -- don't improvise.
Non-heap memory use is constant. No malloc() and no unbounded
recursion in C.
It should be possible to build the interpreter with good old
"./configure && make && make install".
~7/1/2009 - 8/19/2009
I upgraded my workstation from Ubuntu 8.04 to 9.04. When I finished,
libunicode was gone. Removed from the repository! I turns out
that libunicode has been deprecated for several years, and it was a
poor choice to use it. So I've been unable to compile Scheme for
a while now.
So I've been researching Unicode libraries, and also learning more
about Unicode itself. It's HUGE! We humans have had thousands of
years to design writing systems, and up until about 1980, there was no
thought about making them easy for computers to work with. So
right now I'm reading the Unicode spec and R6RS very carefully to
understand what I actually need.
I'm thinking about implementing my own Unicode library, derived
directly from the published UnicodeData.txt and related files.
There are lots of terms, and I don't know exactly what they mean yet.
- Character
- Code Point
U+0000 through U+10FFFF.
- Glyph
- Property
- Category(?)
aka the General_Category property
- "Unicode scalar value" (from R6RS 11.11) Same as a code point?
R6RS says that a scalar value is a code point outside the
surrogate range (U+D800 .. U+DFFF).
Then there are encoding and decoding. But I'm still planning to use
glibc for that (for now). Internally, it's all UTF-32.
Unicode defines four normalization forms. R6RS has four procedures
that do normalization: string-normalize-nfd, string-normalize-nfkd,
string-normalize-nfc, and string-normalize-nfkc.
Normalization Forms D and KD decompose characters. A-umlaut becomes
A followed by Umlaut.
Normalization Forms C and KC compose characters.
Normalization Forms KC and KD are compatibility forms. I don't know
what that means. Unicode, section 5.3:
"Compatibility. The decomposition may remove some information
(typically formatting information) that is important to preserve
in particular contexts.)"
I still don't know what that means. From the Python documentation
(module unicodedata):
In addition to these two forms, there are two additional normal
forms based on compatibility equivalence. In Unicode, certain
characters are supported which normally would be unified with
other characters. For example, U+2160 (ROMAN NUMERAL ONE) is
really the same thing as U+0049 (LATIN CAPITAL LETTER I). However,
it is supported in Unicode for compatibility with existing
character sets (e.g. gb2312).
I think the only features Scheme needs are:
general category
case-independent comparison
char-upcase (simple uppercase mapping)
char-downcase (simple lowercase mapping)
char-titlecase (simple titlecase mapping)
case folding (both simple and complex).
Python's str type and unicodedata module implement all of those.
R6RS-lib claims that all the properties are available from these files.
We need...
general category: UnicodeData.txt column 2 (zero-based)
comparison: compare code point values
case-independent comparison: case-fold then compare code point values
char-upcase: UnicodeData.txt column 12
char-downcase: UnicodeData.txt column 13
char-titlecase: UnicodeData.txt column 14
char-foldcase: if not Turkic I,
if CaseFolding.txt column 1 in {C, S},
CaseFolding.txt column 2
char-alphabetic?: Alphabetic property from DerivedCoreProperties.txt
char-whitespace?: White_Space property from PropList.txt
char-upper-case?: Uppercase property from DerivedCoreProperties.txt
char-lower-case?: Lowercase property from DerivedCoreProperties.txt
char-title-case?: general category == Lt
string-upcase: SpecialCasing.txt column 3 or UnicodeData.txt column 12
string-downcase: SpecialCasing.txt column 1 or UnicodeData.txt column 13
string-titlecase: SpecialCasing.txt column 2 or UnicodeData.txt column 14
string-foldcase: if CaseFolding.txt column 1 in {C, F},
CaseFolding.txt column 2
normalization: See
Update 8/25
The bytecode chapter specified encoding/decoding of utf-8, utf-16,
and utf32. And the I/O ports chapter has a whole section on
The above summarizes what we need. Let's do the simplest possible
thing. A table with 0x110000 entries in it, and each entry looks
like this.
typedef enum unicode_type_bit {
ut_alphabetic = 1 << 0,
ut_numeric = 1 << 1,
ut_whitespace = 1 << 2,
ut_upper_case = 1 << 3,
ut_lower_case = 1 << 4,
ut_title_case = 1 << 5,
} unicode_type_bit_t;
typedef struct unicode_data {
enum general_category_t ud_general_category;
bool ut_alphabetic:1,
bool ut_numeric:1,
bool ut_whitespace:1,
bool ut_upper_case:1,
bool ut_lower_case:1,
bool ut_title_case:1,
wchar_t ud_simple_upcase;
wchar_t ud_simple_downcase;
wchar_t ud_simple_titlecase;
wchar_t ud_simple_foldcase;
wchar_t *ud_full_upcase;
wchar_t *ud_full_downcase;
wchar_t *ud_full_titlecase;
wchar_t *ud_full_foldcase;
} unicode_data_t;
That wastes a ton of space but it's the simplest thing. Write a
Python script to generate the table; the runtime functions are
The Unicode problem is resolved for now. I didn't implement the full
solution outlined above, but I have the minimal subset to remove
dependencies from libunicode. So Scheme builds again. w00t.
And the framework is there to implement the rest of the Unicode
But let's get back to macros.
A while ago, I created an experimental special form called mu. mu is
the successor to lambda. The Scheme spec calls for some rewriting as
a lambda form is evaluated, which my lambda evaluator doesn't do.
mu, however, passes the incoming form to a Scheme procedure to
transform it, then creates a procedure from the transformed form.
I just now implemented "let" through mu. Observe.
~/scheme> ./scheme
> (define f (mu () (let ((v0 i0) (v1 i1)) (+ v0 v1))))
> (define i0 3)
> (define i1 4)
> (f)
> (procedure? f)
> f
(lambda () ((lambda (v0 v1) (+ v0 v1)) i0 i1))
So that's a start.
I'm feeling the need to rewrite the evaluator. This is partly
from reading the book, "Scheme 9 from Empty Space".
It's not so much a book as a very well commented source code
listing. But it uses a somewhat different interpreter model
than I have. The primitives are more "C-like" than my weird
macro soup.
Current interpreter strengths:
works. (-:
is continuation based. call/cc works.
should allow exceptions to work without modifying the core.
Current interpreter weaknesses:
roots are fragile and slow.
dynamic-wind unimplemented
multiple values unimplemented
API too complex for primitive procs, too simple for primitive syntax.
hard to call Scheme from C
warps my brain
no checking C procs for right number of args
crashes on any error.
Here are the forms declared (non-library) syntax in r5rs.
(quote <datum>)
(<operator> <operand1> ...)
(lambda <formals> <body>)
(if <test> <consequent> <alternate>)
(if <test> <consequent>)
(set! <variable> <expression>)
(quasiquote <qq template>)
`<qq template>
<variable>, <constant>, and (<operator> <operand1> ...) are not keywords.
The keywords are quote, lambda, if, set!, and quasiquote. Notably missing
are define-syntax
Now I have an eval() that calls expand to expand the macros.
And I have a trivial expand, written in Scheme, that does
that job. It looks like this.
(lambda (form env)
The problem is that I can't load any of the standard environment
because the standard environment is defined in lib/r6rs.scm, and
lib/r6rs.scm is imported by read, expand, eval. So, there are no
useful symbols defined like cons, + or lambda. They're *defined* in
C, but they're not in an environment. The default environment, (rnrs
(6)), is created at the end of lib/r6rs.scm. So it's yet another
chicken/egg problem.
I think I have a solution. Here it is.
1. Write a trivial expand which looks like the above, except it's
in C but Scheme-callable. Install it. (Call it expand, I
2. Rip out all the C code to manage libraries. Instead, have a
single namespace called (implementation) and put all the C
primitives there.
3. Read lib/r6rs.scm (or whatever it's called), not as a Scheme
library, but simply as a bunch of code. Let its definitions
also fill up the (implementation) namespace.
4. Write a main() in Scheme. Give it argv, let it load a file
implement the REPL or run the tests or whatever. Write, in
Scheme, a library constructor which main() will use to set up
the default (rnrs (6)) environment for top-level.
But the C files already have namespaces. I guess I'll strip that out,
The first step is, change DEFINE_PROC and friends to always add
to the (implementation) namespace.
The Hygiene test. From tr356.
(define t 42)
(define-syntax or2
(syntax-rules ()
((_ e1 e2)
(let ((t e1)) (if t t e2)))))
(let ((if #f)) (or2 if t))
How do the various Schemes available in Ubuntu handle it?
chicken 3.2.7 inscrutable error
chicken 4.2.0 correct
gambit-C 2.8 inscrutable error
gambit-C 4.5 correct
gauche correct
guile no syntax-rules
mit-scheme correct
PLT scheme correct
scheme2c inscrutable error
scheme48 correct
scm no syntax-rules
SigScheme no define-syntax
stalin no high-level macros
tinyscheme no hygienic macros
Also, chibi-scheme gets it correct.
9/26/2009 Noon
Implemented steps 1-3 described above. Not quite there yet -- there
is no way to export a binding from the "builtin" environment* to a
REPL-visible environment.
A lot of C code is about to be obsolete. All of lib.c, for example.
What does (main) need?
- A primitive to add a binding to an environment
other than the current(?)
- the-environment
* I want a better name than "builtin". Maybe "root"?
9/26/2009 18:30
Works. scheme.scm runs a restricted REPL with just five ops: cons car
cdr quote eval environment. Have code to build libraries from lists
of constituent libraries and symbols. Have find-library.
Renamed builtin environment to root environment.
Macro expansion is still not near.
I need to implement expand. I think it wraps a syntax object around
the input form then traverses the wrapped form looking for macros and
things with bindings. That's more or less orthogonal to needing
Dybvig's paper explains more about syntax objects.
A syntax object is an expr + wrap.
A wrap is a list of marks and substitutions.
A mark is just an opaque cookie.
A substitution maps <symbol, set of marks> onto <label>.
A label is another opaque cookie.
A binding is a <type, value> pair.
A binding's type is macro, lexical, or core.
A binding's value is a Scheme object.
An environment is an association list mapping symbols to bindings.
Did I get that right?
Note that Dybvig's binding [bc-syntax-case] is different from mine. I
put the name inside the binding, and he put it outside. My
representation is a little faster -- one less redirection. His
representation makes it possible for two names to share a binding and
is probably "the Lisp way".
Next push:
Implement letrec* as a core form.
Modify obj_binding to have types.
a) rename binding_type() to binding_mutability() or something.
b) introduce binding_type() with values CORE, EXPAND, LEXICAL
or something. Add arg to constructor, add accessors.
c) augment expand to create a syntax obj and remove it aftewards.
Root pointers and fast alloc.
What if there were a heap pointer, next_root, which points slightly
past next_alloc, and all words between next_alloc and next_root are
roots. alloc() just sets the object header and increments next_alloc
past the words which are already stored in place.
The idea isn't quite baked yet, but seems interesting.
Or maybe the root pointers are allocated from the other end of the
free area?
10/6/2009 12:00 AM
r6rs compatibility in mzscheme
$ mzscheme
Welcome to MzScheme v4.1.3 [3m], Copyright (c) 2004-2008 PLT Scheme Inc.
> (require rnrs/main-6) ; or something like that.
10/6/2009 1:30 AM
The expander is orders of magnitude too slow, and all it's doing is
creating the initial "meta-environment". So, two things are in order.
1. Cache the initial wrap-n-env created from each env rib.
2. Turn the expander/reader relationship inside out.
Instead of
while ((form = read()) != EOF)
We need something like this.
while ((form = read()) != EOF) {
expanded = continue_expanding(form);
The labels create a layer of indirection. We have wrap, which maps
sym -> label, and meta-env, mapping label -> binding. How about
if we re-engineer the environment calls to build that directly?
Then the labels are always up to date and nearly free.
Have to look at how wraps are manipulated during expansion.
(update 10/11/2009) won't work. They're separate data structures.
The expander is sort of working. It can handle constants, lexical
bindings, quote, and expressions of same. Its speed is reasonable
(not great).
But it can't handle define or define-syntax, which modify the current
environment. The problem is that the expander uses the definition for
environment in [bc-syntax-case], which treats the environment as
a list of frames.
So I need to rewrite initial-wrap-and-env, which I already
complexificated to get the caching described above.
The new initial-wrap-and-env will...
... have a cache. Each cache entry will represent a single
environment frame. Empty frames will not have cache entries. Entries
are not freed. The cache keys will be (environment . bindings). If
an environment's bindings have changed, its cache entry will be
Since iwae is called at nearly top level, that will work.
... check each binding value against the special forms. There are
several special forms, currently quote, syntax, lambda, define,
define-syntax, and letrec*, which the expander knows about. But in a
given environment, other names could be bound to those forms. And
those names could be bound to other, non-special values. So the
expander has to watch the bound values, not the names.
... bind lexical names to themselves. I don't think that's right -- a
macro could refer to names private to a library, and when it's
expanded the private names won't be visible.
For example,
(library (foo)
(export public)
(import (rnrs (6)))
(define (private) 42)
(define-syntax public
(syntax-rules ()
[(_) (private)])))
Should export a macro, public, which expands to a call to private,
which isn't in scope when public is expanded.
I can't see how to make this work except to revamp eval so it no
longer resolves names. All names would be resolved at expand time.
So I'm going to blow that part off for now.
Writing Hygienic Macros in Scheme with Syntax-Case
R. Kent Dybvig
Technical Report 356, Indiana Computer Science Department, June 1992.
Syntactic Abstraction in Scheme
R. Kent Dybvig, Carl Bruggeman
Lisp and Symbolic Computation 5:4, 295-396, 1992.
R. Kent Dybvig. Syntactic abstraction: the syntax-case expander.
In Andy Oram and Greg Wilson, editors, Beautiful Code: Leading
Programmers Explain How They Think, chapter 25, 407-428.
O'Reilly and Associates, June 2007.
Still don't have it quite right. Implemented the cache above, and
need to fix up four routines that know the format of the wraps.
But it would be better if we only carried the wraps for symbols with
interesting wraps, and kept them in a single list. The rest could be
derived from the starting environment.
And marks and subst's shouldn't be interleaved. That's leading me to
a wrap format like this.
(define-record-type wrap (fields marks substs env))
That means you can't just prepend items to a wrap with cons, though.
So we probably need a set of methods for operating on wraps. We sort
of have those, but they're kind of irregular.
Procedures that look inside wraps:
id-label (private to id-binding)
initial-wrap-and-env wrap-one
initial-wrap-and-env make-one-wne
initial-wrap-and-env make-frame-wne
Expansion procedures that manipulate wraps:
syntax-car calls extend-wrap
syntax-cdr calls extend-wrap
strip calls top-marked?
exp-macro calls add-mark
exp-identifier calls id-binding
exp-identifier-form calls id-binding
exp-lambda calls ???
exp-define calls ???
exp-define-syntax calls ???
exp-letrec* calls ???
initial-wrap-and-env calls ...
Environment vs. meta-environment.
[bc-syntax-case] has two separate environments during expansion,
"regular" and "meta". Regular holds core, lexical, and macro
bindings. Meta contains macros. Who holds pattern variables?
[bc-syntax-case] section 11:
The run-time environment is used to process ordinary expressions
whose code will appear in the expander's output, while the meta
environment is used to process transformer expressions, e.g., on
the right-hand sides of letrec-syntax bindings, which are
evaluated and used at expansion time. The difference between the
run-time and meta environments is that the meta environment does
not contain lexical variable bindings, since these bindings are
not available when the transformer is evaluated and used.
The most interesting use of the environments is in exp-letrec-syntax.
(define exp-letrec-syntax
(lambda (x r mr)
(syntax-case x ()
[(_ ((kwd expr)) body)
(let ([label (make-label)])
(let ([b (make-binding 'macro
(eval (exp (add-subst #'kwd label #'expr)
mr mr)))])
(exp (add-subst #'kwd label #'body)
(extend-env label b r)
(extend-env label b mr))))])))
The macro name's binding is the expansion of the macro's body with
both r and mr set to mr. That binding is pushed onto both r and mr
while expanding the letrec-syntax form's body. So macros defined
in the body have access to the defined macro, and so do expressions.
Who holds pattern variables?
Pattern variables certainly aren't needed at eval time -- they've
already been replaced long before. They're needed at expansion
time when syntax replaces them. So I guess they go into meta.
I think the next question is, to what is a pattern variable bound? If
there's no ellipsis, it's bound to the form that it matches. If there
is an ellipsis, then it's bound to <matched-form, ellipsis-count>.
[Tr356], section 4, paragraph 2:
For pattern variable names, [the compile-time environment] holds
the nesting depths, i.e., the number of levels of nested ellipses
by which the pattern variable was followed.
So let's write something that parses patterns and produces pattern bindings.
For concreteness, a pattern binding would be
name = (label)
mutable = no
value = (cons ellipsis-count list-of-values)
Pattern matcher.
def match(pat, form):
# return True/False, env frame
if pattern is null and form is null:
return env
if pattern is '_':
return True, {}
if pattern in literals:
return form == pattern, {}
if pattern is symbol:
return True, env + {pattern: (ellipsis-count, form)}
if pattern is pair and (cadr pattern is '...':
return match(pat, form, ellipsis_count+1) + (
if pattern is pair:
return match(
pattern = (a ... b)
form = (1 2 3)
bind a: 1, 2
b: 3
if (cadr pattern) == '...':
bind (car pattern) -> (reverse (cdr (reverse (cdr form))))
This generates a syntax error when expanded.
(define-syntax my-macro
(lambda (x)
(syntax-case x ()
[((a ...) ...)
(syntax (a ...))])))
The problem is that a in the pattern is followed by two ellipses,
but a in the template (argument to syntax) is followed by one.
This is legal.
(syntax (b ... ...))])))
"For a pattern variable followed by an ellipsis, the generated name is
bound to a list of the corresponding forms; for two levels of
ellipses, a list of lists of the corresponding forms; and so on."
([tr256], page 28)
I think it also needs a count of the number of ellipses.
10:30 PM
I've written a pattern matcher in Scheme. It runs under Chicken
Scheme. I can match a pattern and build a (simple linked-list)
environment that has all the pattern variables bound. The value of a
pattern variable is (<level> . <list-of-values>). <level> is the
number of nested ellipses following the variable, and <list-of-values>
is what it says.
For example, if
pattern = (_ (x y ...) ...)
form = (m (a (b c) (d e f)))
x is bound to (1 . (a b d))
y is bound to (2 . (() (c) (e f)))
which makes sense.
Aside from vectors, this recognizes all the patterns of r6rs syntax-case.
However, it doesn't do any alpha-substitution, and I'm convinced I
need to use ribbed environments. So that's coming.
Ten minutes later, I have two answers.
[(and (vector? pattern) (vector? form))
(_ (vector->list form) (vector->list pattern) ...)]
match constructs one rib. So it's easy to prepend it to a ribbed
environment after it's constructed.
I still need to use real binding objects instead of cons cells,
but that's a localized change.
[later] Okay, vectors work.
[later] Okay, bindings work.
Ribbed environments will be in match's caller.
The form is a syntax object, though. So I'll need to sprinkle magic
unwrapping dust throughout. Nonetheless, progress has been made.
[later, 1:10 AM] magic unwrapping dust has been sprinkled. Time
to commit it to git.
Spent 11/14 sprinkling more magic unwrapping dust. I think it's
pretty close now.
Now I'm looking at implementing the syntax keyword. syntax expands a
template, substituting pattern variables from the environment and
expanding ellipses.
Here are the constraints on ellipses that syntax needs to enforce.
Let S be a subtemplate preceding an ellipsis.
Label each pattern variable V in S with the number of ellipses following it.
If V.depth > ellipsis count,
raise a syntax violation, "not enough ellipses".
For i in 1 to V.depth,
If the ellipsis already has a repeat count and it's different,
raise a syntax violation, "ellipsis count mismatch".
Label following ellipsis i with repeat count matching V's at that depth.
If any ellipses are unlabeled, raise a syntax violation, "too many ellipses".
I feel like that should be intermingled with the actual expansion process.
Pseudo Pythonscheme...
def expand(tmpl, pos=0, depth=0):
if template is a pattern variable:
if v.depth > depth:
raise SyntaxViolation('not enough ellipses')
return v[pos] # sort of...
if template is a scalar:
return template
# (... tmpl)
if (and (pair? tmpl) (free-identifier=? (car tmpl) '...)):
unless (and (pair? (cdr tmpl)) (null? (cddr tmpl))):
raise SyntaxViolation('misplaced ellipsis')
return expand_no_ellipses((cadr tmpl), pos, depth)
# (subtmpl ... rest)
if (and (pair? tmpl)
(pair? (cdr tmpl))
(free-identifier=? (cadr tmpl) '...))
out = []
subt = (car tmpl)
for subpos in range((repeat_count subt)):
out.append(expand(subt, subpos, depth + 1))
out.append(expand (cddr tmpl, pos, depth))
return out
if (pair? tmpl):
return (cons (expand (car tmpl) pos depth)
(expand (cdr tmpl) pos depth)
So there it is. I have an algorithm. Now I need to translate it
into scheme and debug it.
In the pseudocode above the pos variable isn't quite right.
We need to know the position in every list. And some of the
lists are too short.
(define-syntax my-macro
(lambda (x)
(syntax-case x ()
[(_ ( x y ...) ...) #''((y ... x) ...)])))
(write (my-macro (a) (b c d e)))
pattern = (_ (x y ...) ...)
input = (my-macro (a) (b c d e))
template = (quote ((y ... x) ...))
The template should expand (ignoring the syntax wrapping) to
expansion = ((a) (c d e b))
The pattern matcher puts these bindings into the environment.
x -> (1 . (a b))
y -> (2 . (() (c d e)))
So we call
tmpl=(quote ((y ... x) ...))
which calls
tmpl=((y ... x) ...)
which calls
tmpl=((y ... x) ...)
We want repeat_count to return 2. The top-level output has two
elements, because y's value list has two elements at top level.
repeat_count calls
tmpl=(y ... x)
depth=2 (because y precedes an ellipsis)
returns 2 (because y has two values at depth 2)
a = (car (binding-value y))
if a == depth: return (length (cdr (binding-value y)))
if a > depth: raise SyntaxViolation('not enough ellipses')
if a < depth: return #f
returns #f (this subtemplate doesn't care)
returns #f (because x's depth is less than 2)
returns #f (this subtemplate doesn't care)
Merging the non-false answers gives 2.
If all answers are false, raise SyntaxViolation('too many ellipses')
If answers conflict, raise SyntaxViolation('???')
So the subtemplate (y ... x) will be expanded twice.
First (left) expansion:
tmpl=(y ... x)
Which returns 0, because (length (list-ref (cdr (binding-value y)) 0)) is 0.
So expand expands y zero times. Then it looks past the ellipsis.
and we look up (list-ref (cdr (binding-value x)) 0)
=> a
Then expand(tmpl=(x), pos=(0), depth=1) => (a)
expand(tmpl=(y ... x), pos=(0), depth=1) => (a)
Now we do the second expansion of (y ... x).
tmpl=(y ... x)
which returns 3, because (length (list-ref (cdr (binding-value y) 1))) is 3.
So we'll expand y three times.
pos=(0 1)
=> c, because (cdr (binding_value y))[1][0] == c
pos=(1 1)
=> d, because (cdr (binding_value y))[1][1] == d
pos=(2 1)
=> e, because (cdr (binding_value y))[1][1] == e
Then do the tail.
=> b, because (cdr (binding_value x))[1] == b
=> ()
So expand(tmpl=(y ... x), pos=(1), depth=1) => (c d e b)
Back to the second-level expand, it now expands the tail, which
is just (), and return the final result, ((a) (c d e b).
And the toplevel just wraps quote around that.
That was a lot of typing. The lowest-level routined needed
takes a list and extracts the corresponding element.
(define (multi-ref list indices)
(if (null? indices)
(multi-ref (list-ref list (car indices)) (cdr indices))))
(assert (equal? 'c (multi-ref '((a b) (c d)) '(1 0))))
Note that indices is the reverse of pos above.
expand.scm is more or less complete. expand can handle pattern
variables and ellipses. Here are the unit tests. (Note that I
actually USED a Scheme macro for something!)
The environment is set up as though...
lex lexically bound to lex
pat -> p
p3 -> (p3 ...)
(test-expand lex => lex)
(test-expand pat => p)
(test-expand (lex . pat) => (lex . p))
(test-expand (p3 ...) => (a b c))
(test-expand (p3 ... pat) => (a b c p))
(test-expand (p3 ... p3 ... pat) => (a b c a b c p))
(test-expand ((p3x ... pat) ...) => ((p) (a p) (a b p)))
(test-expand (... ...) => ...)
(test-expand (x y (... ...)) => (x y ...))
(test-expand (... (pat ...)) => (p ...))
Now it's time to start rewriting to remove the non-core Scheme constructs
so it'll run in my Scheme.
Progress. The code from expand.scm is merged into Scheme. The
expander handles the syntax form (syntax template). It is noticeably
slow converting the global environment to the form that the expander
needs, but that can be fixed.
Now I'm up against another hard place. The expander uses two funny
environments, r and mr. mr is the compile-time environment. It
contains compile-time bindings: syntax-case pattern variables and
macro definitions. r is the run-time environment. It contains
a binding for every object that will be visible at run time.
That means core interpreter forms and lexical bindings, including
the global lexical environment.
In r, the current expression under expansion is a "wrapped syntax
object". A wrapped syntax object is a normal Scheme datum plus a
wrap. A wrap is very much like an environment. It maps <symbol, mark
set> pairs to labels. Then r, the environment, maps labels to
bindings. A label is just a cookie -- unique but no contents. The
indirection is because the same symbol, after macro expansion, can
refer to different locations, depending on which macro it came from.
And the mark set identifies the macro.
The meta-environment, mr, is similar. At the beginning of the
expansion process, r and mr are identical.
The problem is that compile time and run time are intermingled. For
example, when a macro is defined, its body is evaluated. The body is
evaluated in the definition's compile-time environment. We've got
that. It's called mr. But mr doesn't map symbols to values, it
maps labels.
So I'm kind of stuck. I can see some ways I might be able to proceed,
but I'm not sure which, if any, work.
Alternative 1. Use anonymous symbols as labels. Then somehow process
the form to be evaluated to replace the symbols with labels.
Alternative 2. Change the evaluator to understand syntax objects and
lookup labels using the expander's rules.
Alternative 3. Build an eval-ready environment from r. It seems like
this would be slow.
Alternative 4. Expand the expression before evaluating it so that it
only references symbols in the global environment. Is that possible?
Alternative 5 -- ???
(let ([var 'value])
(letrec-syntax ([mac (lambda (x) (syntax (cons var x)))])
(mac 123)))
I think I got confused about what happens when again. When
syntax-case is EXPANDED, it just gets passed through almost unchanged.
The literals and patterns are unchanged (I think). The fenders and
output expressions get the normal expansion process. (What happens to
literals and pattern variables? They have to be kept unchanged.)
Then, when syntax-case is INVOKED (as part of a macro expansion), it
pushes pattern variables onto the runtime environment, which is a real
environment, before invoking the fenders and output expression.
So what should happen to pattern variables at EXPAND time?
At RUN time, they'll be used to fill in the environment.
But they must not conflict with other variables. So I think
they should be handled like lexicals.
I've been thinking about handling exceptions with setjmp/longjmp, and
it just occurred to me that if triggering a GC were an exception too,
it would not be necessary to keep track of root pointers on the C
That implies that every C operation is restartable. Using setjmp and
longjmp for exceptions has the same requirement, though.
Just a thought...
Analyzing the code base for places where longjmp could break
interpreter consistency, I found that symbol_name() needs to be
redesigned to be consistent if its allocation fails. I think it just
needs the order of the two assignments inside the lock to be reversed.
Also, b_accum_arg() is using pair_set_cdr() to build a list of args in
place. That needs to be rewritten anyway.
The same anti-pattern occurs thrice in lib/base.c, where a list is
constructed left-to-right with set-car! . The offending functions are
let_get_args(), let_get_inits(), and the apply procedure.
Those are the only places I found where having longjmp restart the
current operation could cause problems.