-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
226 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,220 @@ | ||
##################### | ||
High-level overview | ||
##################### | ||
|
||
PHP is an interpreted language. Interpreted languages differ from compiled ones in that they aren't | ||
compiled into machine-readable code ahead of time. Instead, the source files are read, processed and | ||
interpreted when the program is executed. This can be very convenient for developers for rapid | ||
prototyping, as it skips a lengthy compilation phase. However, it also poses some unique challenges | ||
to performance, which is one of the primary reasons interpreters can be complex. php-src borrows | ||
many concepts from compilers and other interpreters. | ||
|
||
********** | ||
Concepts | ||
********** | ||
|
||
The goal of the interpreter is to read the users source files from disk, and to simulate the users | ||
intent. This process can be split into distinct phases that are easier to understand and implement. | ||
|
||
- Tokenization - splitting whole source files into words, called tokens. | ||
- Parsing - building a tree structure from tokens, called AST (abstract syntax tree). | ||
- Compilation - turning the tree structure into a list of operations, called opcodes. | ||
- Interpretation - reading and executing opcodes. | ||
|
||
php-src as a whole can be seen as a pipeline consisting of these stages. | ||
|
||
.. code:: haskell | ||
source_code | ||
|> tokenizer -- tokens | ||
|> parser -- ast | ||
|> compiler -- opcodes | ||
|> interpreter | ||
Let's go into these phases in a bit more detail. | ||
|
||
************** | ||
Tokenization | ||
************** | ||
|
||
Tokenization, often called "lexing" or "scanning", is the process of taking an entire program file | ||
and splitting it into a list of words and symbols. Tokens generally consist of a type, a simple | ||
integer constant representing the token, and a lexeme, the literal string used in the source code. | ||
|
||
.. code:: php | ||
if ($cond) { | ||
echo "Cond is true\n"; | ||
} | ||
.. code:: text | ||
T_IF "if" | ||
T_WHITESPACE " " | ||
"(" | ||
T_VARIABLE "$cond" | ||
")" | ||
T_WHITESPACE " " | ||
"{" | ||
T_WHITESPACE "\n " | ||
T_ECHO "echo" | ||
T_WHITESPACE " " | ||
T_CONSTANT_ENCAPSED_STRING '"Cond is true\n"' | ||
";" | ||
T_WHITESPACE "\n" | ||
"}" | ||
While tokenizers are not difficult to write by hand, PHP uses a tool called ``re2c`` to automate | ||
this process. It takes a definition file and generates efficient C code to build these tokens from a | ||
stream of characters. The definition for PHP lives in ``Zend/zend_language_scanner.l``. Check the | ||
`re2c documentation`_ for details. | ||
|
||
.. _re2c documentation: https://re2c.org/ | ||
|
||
********* | ||
Parsing | ||
********* | ||
|
||
Parsing is the process of reading the tokens generated from the tokenizer and building a tree | ||
structure from it. To humans, nesting seems obvious when looking at source code, given indentation | ||
through whitespace and the usage of symbols like ``()`` and ``{}``. The tokens are transformed into | ||
a tree structure to more closely reflect the source code the way humans see it. In PHP, the AST is | ||
represented by generic AST nodes with a ``kind`` field. There are "normal" nodes with a | ||
predetermined number of children, lists with an arbitrary number of children, and | ||
:doc:`../core/data-structures/zval` nodes that store some underlying primitive value, like a string. | ||
|
||
Here is a simplified example of what an AST from the tokens above might look like. | ||
|
||
.. code:: text | ||
zend_ast_list { | ||
kind: ZEND_AST_IF, | ||
children: 1, | ||
child: [ | ||
zend_ast { | ||
kind: ZEND_AST_IF_ELEM, | ||
child: [ | ||
zend_ast { | ||
kind: ZEND_AST_VAR, | ||
child: [ | ||
zend_ast_zval { | ||
kind: ZEND_AST_ZVAL, | ||
zval: "cond", | ||
}, | ||
], | ||
}, | ||
zend_ast_list { | ||
kind: ZEND_AST_STMT_LIST, | ||
children: 1, | ||
child: [ | ||
zend_ast { | ||
kind: ZEND_AST_ECHO, | ||
child: [ | ||
zend_ast_zval { | ||
kind: ZEND_AST_ZVAL, | ||
zval: "Cond is true\n", | ||
}, | ||
], | ||
}, | ||
], | ||
}, | ||
], | ||
}, | ||
], | ||
} | ||
The nodes may also store additional flags in the ``attr`` field for various purposes depending on | ||
the node kind. They also store their original position in the source code in the ``lineno`` field. | ||
These fields are omitted in the example for brevity. | ||
|
||
Like with tokenization, we use a tool called ``Bison`` to generate the parser implementation from a | ||
grammar specification. The grammar lives in the ``Zend/zend_language_parser.y`` file. Check the | ||
`Bison documentation`_ for details. Luckily, the syntax is quite approachable. | ||
|
||
.. _bison documentation: https://www.gnu.org/software/bison/manual/ | ||
|
||
************* | ||
Compilation | ||
************* | ||
|
||
Computers don't understand human language, or even programming languages. They only understand | ||
machine code, which are sequences of simple, mostly atomic instructions for doing one thing. For | ||
example, they may add two numbers, load some memory from RAM, jump to an instruction under a certain | ||
condition, etc. It turns out that even complex expressions can be reduced to a number of these | ||
simple instructions. | ||
|
||
PHP is a bit different, in that it does not execute machine code directly. Instead, instructions run | ||
on a "virtual machine", often abbreviated to VM. This is just a fancy way of saying that there is no | ||
physical machine that understands these instructions, but that this machine is implemented in | ||
software. This is our interpreter. This also means that we are free to make up instructions | ||
ourselves at will. Some of these instructions look very similar to something you'd find in an actual | ||
CPU instruction set (e.g. adding two numbers), while others are on a much higher level (e.g. load | ||
property of object by name). | ||
|
||
With that little detour out of the way, the job of the compiler is to read the AST and translate it | ||
into our virtual machine instructions, also called opcodes. This code lives in | ||
``Zend/zend_compile.c``. The compiler is invoked for each function in your program, and generates a | ||
list of opcodes. | ||
|
||
Here's what the opcodes for the AST above might look like: | ||
|
||
.. code:: text | ||
0000 JMPZ CV0($cond) 0002 | ||
0001 ECHO string("Cond is true\n") | ||
0002 RETURN int(1) | ||
************* | ||
Interpreter | ||
************* | ||
|
||
Finally, the opcodes are read and executed by the interpreter. PHPs uses `three-address code`_ for | ||
instructions. This essentially means that each instructions may have a result value, and at most two | ||
operands. Most modern CPUs also use this format. Both result and operands in PHP are :doc:`zvals | ||
<../core/data-structures/zval>`. | ||
|
||
.. _three-address code: https://en.wikipedia.org/wiki/Three-address_code | ||
|
||
How exactly each opcode behaves depends on its purpose. You can find a complete list of opcodes in | ||
the generated ``Zend/zend_vm_opcodes.h`` file. The VM lives mostly in the ``Zend/zend_vm_def.h`` | ||
file, which contains custom DSL that is expanded by ``Zend/zend_vm_gen.php`` to generate the | ||
``Zend/zend_vm_execute.h`` file, containing the actual VM code. | ||
|
||
Let's step through the opcodes form the example above: | ||
|
||
- We start at the top, i.e. ``JMPZ``. If its first instruction contains a "falsy" value, it will | ||
jump to the instruction encoded in its second operand. If it is truthy, it will simply | ||
fall-through to the next operand. | ||
|
||
- The ``ECHO`` instruction prints its first operand. | ||
|
||
- The ``RETURN`` operand terminates the current function. | ||
|
||
With these simple rules, we can see that the interpreter will ``echo`` only when ``$cond`` is | ||
truthy, and skip over the ``echo`` otherwise. | ||
|
||
That's it! This is how PHP works, fundamentally. Of course, PHP consists of many more opcodes. The | ||
VM is quite complex, and will be discussed separately in the `virtual machine <todo>`__ chapter. | ||
|
||
********* | ||
Opcache | ||
********* | ||
|
||
As you may imagine, running this whole pipeline every time PHP serves a request is time consuming. | ||
Luckily, it is also not necessary. We can cache the opcodes in memory between requests. When a file | ||
is included, we can in the cache whether the file is already there, and verify via timestamp that it | ||
has not been modified since it was compiled. If it has not, we may reuse the opcodes from cache. | ||
This dramatically speeds up the execution of PHP programs. This is precisely what the opcache | ||
extension does. It lives in the ``ext/opcache`` directory. | ||
|
||
Opcache also performs some optimizations on the opcodes before caching them. As opcaches are | ||
expected to be reused many times, it is profitable to spend some additional time simplifying them if | ||
possible to improve performance during execution. The optimizer lives in ``Zend/Optimizer``. | ||
|
||
JIT | ||
=== | ||
|
||
The opcache also implements a JIT compiler, which stands for just-in-time compiler. This compiler | ||
takes the virtual PHP opcodes and turns it into actual machine instructions, with additional | ||
information gained at runtime. JITs are very complex pieces of software, so this book will likely | ||
barely scratch the surface of how it works. It lives in ``ext/opcache/jit``. |