Skip to content

Introduce a type check cache (TCC) #5096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Introduce a type check cache (TCC) #5096

wants to merge 2 commits into from

Conversation

dtakken
Copy link

@dtakken dtakken commented Jan 19, 2020

This was an experiment I did while trying to learn more about PHP internals. When the union types feature was merged it raised some concerns regarding the cost of complex type checks. I was wondering if type checks were cached in some way. It turned out this was not the case. The same type check is redone over and over again, both simple and complex checks.

I also noticed that the JIT compiler does not generate efficient code for complex type checks. A cache would turn complex checks into simple lookups, probably simple enough to implement in the JIT compiler. A double gain.

So here it is, a type check cache for PHP. The PR also extends the JIT compiler to exploit the cache. Where the JIT generated code previously had to bail out to slow code paths it now keeps running at full jitty speed.

Some key characteristics:

  • Only type checks involving classes are cached
  • Cache is global, shared between all op arrays
  • Cache size is dynamic, it grows on demand at run time
  • Cache is limited to a configurable maximum size
  • Argument type checks, return type checks and typed property writes are supported

An additional gain that having a cache may bring is that it might allow the PHP type system to continue to develop in directions that are currently not considered due to the performance cost involved.

Weaknesses

  • When the cache hits the configured memory constraints some type checks will not be cached. Classes are assigned a cache slot on a first come first served basis. There is no guarantee that the classes that are most used in type checks get a cache slot. However, it is possible to re-assign cache slots at run time. It may be advantageous to keep track of cache hits and misses at run time and optimize the cache at some point. This could be implemented later.
  • For reasons of simplicity the cache will typically contain a lot of entries for type / class combinations that are never actually checked in the application.
  • Significant performance gains are only to be expected for hot, type check heavy code paths.
  • This PR only extends the JIT compiler to exploit the cache for doing argument type checks, because handlers for these checks are already in place. Handling of return type checking and typed property writes appears to be missing in the JIT compiler at this time. Accelerating these using the cache will have to wait.

Things I am not sure about

  • Sane defaults for cache capacity. Currently, the default is to have at most 1024 classes and 1024 type declarations. This means the cache can grow to 1MB of memory. Having 1024 classes (not counting interfaces, abstract classes and traits) and 1024 globally distinct type declarations sounds like a lot to me but I have no numbers of average real world code bases.
  • Each entry in the TCC occupies one byte of memory to store a zero or a one, which is a bit wasteful. Using single bits is possible but this makes TCC lookups more expensive. In practice I do not expect the TCC to ever require much more than a couple of MB at most as it is, so I'm not sure if compressing it makes sense. There may be CPU caching effects to think about as well here.

Some numbers finally

Using a benchmarking script I measured the performance difference relative to current master. The script is based on the script written by Dmitry Stogov to benchmark union type checks. It can be found here:

https://gist.github.com/dtakken/1539d64170921363dc8d1ed62effcd45

Below I placed the benchmark results I obtained of the master branch and the tcc branch side by side to compare the overall performance gain. The numbers in the leftmost columns are time spent doing a large number of operations that trigger type checks in a tight loop. Overhead of the loop itself is subtracted. First some numbers with JIT turned off:

                                   master tcc    speedup
Foo::$static_prop = ...            0.390  0.402  -3%
Foo::$class_union_static_prop = ...0.878  0.627  40%
$o->prop = ...                     0.375  0.384  -2%
$o->class_union_prop = ...         0.831  0.682  22%
func($x)                           0.814  0.817  0%
func($obj)                         1.049  0.948  11%
func(A|B|C|null $obj) (null)       0.784  0.890  -12%
func(A|B|C|null $obj) (A)          1.148  1.155  -1%
func(A|B|C|null $obj) (C)          1.427  1.157  23%
func(A|B|C|null $obj) (D)          1.526  1.154  32%
func(A|B|C|null $obj) (E)          1.608  1.153  39%
func(A|B|C|null $obj) (F)          1.599  1.159  38%
func($obj): A|B|C|null (A)         0.986  0.991  -1%
func($obj): A|B|C|null (C)         1.218  0.989  23%
func($obj): A|B|C|null (D)         1.302  1.002  30%
func($obj): A|B|C|null (E)         1.341  1.006  33%
func($obj): A|B|C|null (F)         1.451  0.995  46%

The numbers are slightly noisy. Still, the effect of the TCC shows nicely here. With the TCC enabled, the cost of simple and complex checks is similar.

Next, the same run with JIT enabled:

                                   master tcc    speedup
Foo::$static_prop = ...            0.608  0.641  -5%
Foo::$class_union_static_prop = ...1.341  0.888  51%
$o->prop = ...                     0.685  0.685  0%
$o->class_union_prop = ...         1.344  0.898  50%
func($x)                           0.305  0.349  -13%
func($obj)                         0.430  0.425  1%
func(A|B|C|null $obj) (null)       0.364  0.380  -4%
func(A|B|C|null $obj) (A)          0.731  0.471  55%
func(A|B|C|null $obj) (C)          1.747  0.472  270%
func(A|B|C|null $obj) (D)          2.255  0.472  378%
func(A|B|C|null $obj) (E)          2.388  0.471  407%
func(A|B|C|null $obj) (F)          2.508  0.472  431%
func($obj): A|B|C|null (A)         0.914  0.927  -1%
func($obj): A|B|C|null (C)         1.217  0.926  31%
func($obj): A|B|C|null (D)         1.392  0.931  50%
func($obj): A|B|C|null (E)         1.543  0.930  66%
func($obj): A|B|C|null (F)         1.627  0.927  76%

While these numbers look really nice, there are some important things to take into consideration here.

  • The results of the argument type checks are a bit unfair because it partly compares static compiled code with fully JIT generated code. On the other hand, without the TCC where would not have been efficient JIT code to start with.
  • Some operations like typed property assignments and return type checks have no JIT equivalent yet, which means that JIT slows things down. The observed gains will increase once support for these operations is added.

The final measurement compares the performance of the master branch to the performance of the tcc branch while setting the TCC capacity to zero. This shows the worst case scenario of having a cache in place while badly misconfiguring it:

                                   master tcc miss   speedup
Foo::$static_prop = ...            0.390  0.397      -2%
Foo::$class_union_static_prop = ...0.878  0.908      -3%
$o->prop = ...                     0.375  0.384      -2%
$o->class_union_prop = ...         0.831  0.889      -7%
func($x)                           0.814  0.827      -2%
func($obj)                         1.049  1.013      4%
func(A|B|C|null $obj) (null)       1.148  1.424      -19%
func(A|B|C|null $obj) (A)          1.427  1.612      -11%
func(A|B|C|null $obj) (C)          1.526  1.697      -10%
func(A|B|C|null $obj) (D)          1.608  1.82       -12%
func(A|B|C|null $obj) (E)          0.784  0.836      -6%
func(A|B|C|null $obj) (F)          1.599  2.027      -21%
func($obj): A|B|C|null (A)         0.986  1.204      -18%
func($obj): A|B|C|null (C)         1.218  1.457      -16%
func($obj): A|B|C|null (D)         1.302  1.532      -15%
func($obj): A|B|C|null (E)         1.341  1.655      -19%
func($obj): A|B|C|null (F)         1.451  1.701      -15%

Please note that this is my first significant contribution, I'm not familiar with the things I had to touch. Careful review is highly appreciated.

@dtakken
Copy link
Author

dtakken commented Jan 19, 2020

I noticed I need to rebase (master is moving fast!), will do that later.

@Girgias
Copy link
Member

Girgias commented Jan 20, 2020

CI failures are legit, seems there are various Segfaults and Bus errors.

@@ -1039,6 +1040,9 @@ ZEND_API int pass_two(zend_op_array *op_array)
opline++;
}

// TODO: Should we re-assign CE columns in opcache after loading them from cache?
tcc_assign_ce_columns();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah, this is a problem. Classes loaded from opcache may be immutable, which means that it's not possible to change the index. One could use a MAP pointer for this purpose, which adds an extra level of indirection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failures are legit, seems there are various Segfaults and Bus errors.

I am looking into these. Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah, this is a problem. Classes loaded from opcache may be immutable, which means that it's not possible to change the index. One could use a MAP pointer for this purpose, which adds an extra level of indirection.

Ah, I did not consider this. Classes can be shared between processes, so these processes cannot write their own stuff into them.

Reconsidering, I think I still need the class entries to have a consecutive integer ID that is unique. But unique for all classes that exist in opcache, which means that opcache should assign them and guarantee uniqueness. Sounds tricky. Then, each process could map that global ID to a column index in the local TCC. I'm not sure if that is what you mean by a MAP pointer though.

@cmb69
Copy link
Member

cmb69 commented Dec 28, 2021

I'm closing this PR due to inactivity. @dtakken, feel free to fix the merge conflicts, address the test failures, and re-open.

Thanks for your work, anyway! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants