<center>
    
# CLI Performance (part 2)
<br>
<hr>
<br>


# New rustls/tokio Library

https://github.com/denoland/rustls-tokio-stream/

 - Replaces rustls-tokio + our custom code to split read/write halves
 - Old code is stable, but difficult to modify

# New rustls/tokio Library

 - Designed in layers:
   * A tokio task that takes a TLS connection and drives a handshake in the background
   * A stream for a handshaked TLS connection
   * A stream for a TLS connection that buffers writes and pauses reads until handshake is complete
 - More robust: extensive testing at each layer, written in Rust

# New rustls/tokio Library

 - Current focus is on reliability, follow-up work will be on performance

# Fast Streams

 - Slowly working on replacing all resource read/write operations with `deno_core` code
 - Will allow for "big bang" optimizations once we have less implementations
 - Big project, will take some time

# Fast Streams

 - Major output from first round: `resourceForReadableStream`
   * Optimized resource layer over a `ReadableStream`
   * Supports backpressure and packet aggregation
   * Replaces custom code in `Deno.serve`
   * Will shortly replace code in `fetch` and `node:http`

In [16]:
console.log("Running benchmark...");
let process = Deno.run({ cmd: ["cargo", "bench", "--bench", "ops_sync", "--features=unsafe_runtime_options"], cwd: "../deno_core/", stdout: "piped", stderr: "piped" });
await process.status();
let benchOutSync = new TextDecoder().decode(await Deno.readAll(process.stdout));
console.log("Finished benchmark...");
console.log(benchOutSync);

Running benchmark...
Finished benchmark...

running 34 tests
test baseline                               ... bench:         470 ns/iter (+/- 13)
test bench_op_arraybuffer                   ... bench:       4,040 ns/iter (+/- 21)
test bench_op_bigint                        ... bench:       2,213 ns/iter (+/- 24)
test bench_op_bigint_return                 ... bench:       2,147 ns/iter (+/- 32)
test bench_op_buffer                        ... bench:       3,034 ns/iter (+/- 134)
test bench_op_buffer_nofast                 ... bench:      28,873 ns/iter (+/- 852)
test bench_op_buffer_old                    ... bench:       2,114 ns/iter (+/- 47)
test bench_op_external                      ... bench:       2,363 ns/iter (+/- 21)
test bench_op_external_nofast               ... bench:       8,050 ns/iter (+/- 433)
test bench_op_option_u32                    ... bench:       5,859 ns/iter (+/- 172)
test bench_op_string                        ... bench:       7,759 ns/iter (+/- 171)
test bench

In [31]:
console.log("Running benchmark...");
let process = Deno.run({ cmd: ["cargo", "bench", "--bench", "ops_async", "--features=unsafe_runtime_options"], cwd: "../deno_core/", stdout: "piped", stderr: "piped" });
await process.status();
let benchOutAsync = new TextDecoder().decode(await Deno.readAll(process.stdout));
console.log("Finished benchmark...");
console.log(benchOutAsync);

Running benchmark...
Finished benchmark...

running 12 tests
test baseline                             ... bench:         824 ns/iter (+/- 13)
test bench_op_async_void                  ... bench:     103,436 ns/iter (+/- 4,794)
test bench_op_async_void_deferred         ... bench:     547,586 ns/iter (+/- 45,903)
test bench_op_async_void_deferred_nofast  ... bench:     540,914 ns/iter (+/- 34,145)
test bench_op_async_void_lazy             ... bench:     482,150 ns/iter (+/- 24,172)
test bench_op_async_void_lazy_nofast      ... bench:     545,937 ns/iter (+/- 126,705)
test bench_op_async_yield                 ... bench:     528,709 ns/iter (+/- 26,613)
test bench_op_async_yield_deferred        ... bench:     530,796 ns/iter (+/- 23,708)
test bench_op_async_yield_deferred_nofast ... bench:     527,600 ns/iter (+/- 27,955)
test bench_op_async_yield_lazy            ... bench:     560,303 ns/iter (+/- 129,685)
test bench_op_async_yield_lazy_nofast     ... bench:     553,959 ns/iter (+/- 122,

In [None]:
import pl from "npm:nodejs-polars"

In [93]:
let names = [], times = [];
for (let line of benchOutSync.split('\n')) {
    if (line.startsWith('test ') && line.includes('...')) {
        let [nameBits, timeBits, ...rest] = line.split('bench:');
        let [_, name] = nameBits.trim().split(" ");
        let [timeComma] = timeBits.trim().split(" ");
        let time = timeComma.replace(/,/g, '');
        names.push(name);
        times.push(time);
    }
}

let df = new pl.DataFrame({
    name: names,
    time: times,
})


let r = df.toRecords()
    .filter((row) => row.name.includes('op_string'))
    .reduce((input, row) => { input[row.name] = +row.time; return input }, {});
    
let comparisons = [
    ["small", "bench_op_string_old", "bench_op_string"],
    ["1,000", "bench_op_string_old_large_1000", "bench_op_string_large_1000"],
    ["1,000,000", "bench_op_string_old_large_1000000", "bench_op_string_large_1000000"],
    ["1,000 utf8", "bench_op_string_old_large_utf8_1000", "bench_op_string_large_utf8_1000"],
    ["1,000,000 utf8", "bench_op_string_old_large_utf8_1000000", "bench_op_string_large_utf8_1000000"],
    ["ByteString", "bench_op_string_bytestring", "bench_op_string_onebyte"],
];

let dfrec = { name: [], old: [], new: [], speedup: [] };

for (let row of comparisons) {
    dfrec.name.push(row[0]);
    dfrec.old.push(r[row[1]].toLocaleString());
    dfrec.new.push(r[row[2]].toLocaleString());
    dfrec.speedup.push( ((r[row[1]] - r[row[2]]) / r[row[1]] * 100).toFixed(2) + "%" );
}

const dfString = new pl.DataFrame(dfrec);
dfString

name,old,new,speedup
small,9196,7759,15.63%
1000,93889,72969,22.28%
1000000,170818,281887,-65.02%
"1,000 utf8",2008520,1443703,28.12%
"1,000,000 utf8",10019408,7640741,23.74%
ByteString,40429,3245,91.97%


In [63]:
let names = [], times = [];
for (let line of benchOutAsync.split('\n')) {
    if (line.startsWith('test ') && line.includes('...')) {
        let [nameBits, timeBits, ...rest] = line.split('bench:');
        let [_, name] = nameBits.trim().split(" ");
        let [timeComma] = timeBits.trim().split(" ");
        let time = timeComma.replace(/,/g, '');
        names.push(name);
        times.push(time);
    }
}

let df = new pl.DataFrame({
    name: names,
    time: times,
})

let r = df.toRecords()
    .filter((row) => row.name != 'baseline')
    .reduce((input, row) => { input[row.name] = row.time; return input }, {});
    
let dfrec = { name: [], "slowdown vs sync": [] };
console.log(r);
for (let row of Object.keys(r)) {
    console.log(r[row]);
    let baseline = +r["sync_baseline"];
    dfrec.name.push(row);
    dfrec["slowdown vs sync"].push( (+r[row] / +baseline).toFixed(2) + "x" );//
}

let dfAsync = new pl.DataFrame(dfrec);
dfAsync

name,slowdown vs sync
bench_op_async_void,2.42x
bench_op_async_void_deferred,12.82x
bench_op_async_void_deferred_nofast,12.66x
bench_op_async_void_lazy,11.29x
bench_op_async_void_lazy_nofast,12.78x
bench_op_async_yield,12.38x
bench_op_async_yield_deferred,12.43x
bench_op_async_yield_deferred_nofast,12.35x
bench_op_async_yield_lazy,13.12x
bench_op_async_yield_lazy_nofast,12.97x


{
  bench_op_async_void: [32m"103436"[39m,
  bench_op_async_void_deferred: [32m"547586"[39m,
  bench_op_async_void_deferred_nofast: [32m"540914"[39m,
  bench_op_async_void_lazy: [32m"482150"[39m,
  bench_op_async_void_lazy_nofast: [32m"545937"[39m,
  bench_op_async_yield: [32m"528709"[39m,
  bench_op_async_yield_deferred: [32m"530796"[39m,
  bench_op_async_yield_deferred_nofast: [32m"527600"[39m,
  bench_op_async_yield_lazy: [32m"560303"[39m,
  bench_op_async_yield_lazy_nofast: [32m"553959"[39m,
  sync_baseline: [32m"42720"[39m
}
103436
547586
540914
482150
545937
528709
530796
527600
560303
553959
42720



# `#[op2]`

<img src="https://miro.medium.com/v2/resize:fit:864/format:webp/1*nR8Gow0ukWXWZdkofgkT8A.png">

# A brief history of ops
### (incomplete)

 * Note: a high-level reconstruction that skips or misses some details

- early ops: JSON and binary buffers sent from JS to Rust
  * _lots_ of serialization overhead

# A brief history of ops (incomplete)

- `serde_v8` + codegen via traits
  * No more JSON overhead
  * ops dispatched via central table

- `[op]`: proc macros
  * one function per op
  * ops can now have custom number of parameters
  * still using `serde_v8` + codegen

# A brief history of ops (incomplete)

- fastcalls and custom per-type dispatch
  * Skip serde_v8 for some basic types
  * Allow v8 to call Rust directly from JIT'd code

# `#[op2]`

Evolution from `#[op]`
 
<table><tr style="background-color: white"><td style="text-align:left !important;">

Before:

```rust
 #[op]
 pub fn op_do_something(...) {
   do_something()
 }
```
 
</td><td style="text-align:left !important;">

After:
    
```rust
 #[op2]
 pub fn op_do_something(...) {
   do_something()
 }
```
    
</td></tr></table>

# `#[op2]`

 - Maintainability:
   - Parsing and codegen split into distinct steps
   - Fast and slow codegen separate to evolve as we have bandwidth
   - Codegen for each input/output type is separate

# `#[op2]`

 - Designed for performance:
   - Locks removed for almost all sync ops (unless they touch the state)
   - Context and other objects only created as necessary
   - Metrics are pluggable and have near-zero cost when disabled
   - Removed allocations for most strings

In [3]:
dfString

name,old,new,speedup
small,12677,11428,9.85%
1000,137553,106861,22.31%
1000000,240685,399521,-65.99%
"1,000 utf8",2962908,2132591,28.02%
"1,000,000 utf8",14986862,11187025,25.35%
ByteString,58914,4773,91.90%


```
running 34 tests
test baseline                               ... bench:         470 ns/iter (+/- 13)
test bench_op_arraybuffer                   ... bench:       4,040 ns/iter (+/- 21)
test bench_op_bigint                        ... bench:       2,213 ns/iter (+/- 24)
test bench_op_bigint_return                 ... bench:       2,147 ns/iter (+/- 32)
test bench_op_buffer                        ... bench:       3,034 ns/iter (+/- 134)
test bench_op_buffer_nofast                 ... bench:      28,873 ns/iter (+/- 852)
test bench_op_buffer_old                    ... bench:       2,114 ns/iter (+/- 47)
test bench_op_external                      ... bench:       2,363 ns/iter (+/- 21)
test bench_op_external_nofast               ... bench:       8,050 ns/iter (+/- 433)
test bench_op_option_u32                    ... bench:       5,859 ns/iter (+/- 172)
test bench_op_u32                           ... bench:       2,143 ns/iter (+/- 47)
test bench_op_v8_global                     ... bench:      20,497 ns/iter (+/- 4,137)
test bench_op_v8_global_scope               ... bench:      28,486 ns/iter (+/- 667)
test bench_op_v8_local                      ... bench:       3,072 ns/iter (+/- 69)
test bench_op_v8_local_nofast               ... bench:       5,774 ns/iter (+/- 151)
test bench_op_v8_local_scope                ... bench:      11,753 ns/iter (+/- 233)
test bench_op_void                          ... bench:       2,146 ns/iter (+/- 49)
test bench_op_void_2x                       ... bench:       4,440 ns/iter (+/- 105)
test bench_op_void_nofast                   ... bench:       5,656 ns/iter (+/- 91)

test result: ok. 0 passed; 0 failed; 0 ignored; 34 measured
```

# `#[op2]`

 - _Explicit_ over _implicit_, clarity for developers
 - Annotations indicate where developer should pay attention to argument type because of performance or other concerns
 
```rust
#[op2]
pub fn op_something(
    #[smi] id: u32,
    #[string] name: &str,
    #[buffer(copy)] buffer_in: JsBuffer,
    #[buffer] buffer_out: &mut [u8],
    #[serde] control: ComplexStruct) {
}
```

# `#[op2]`

 - _Explicit_ over _implicit_, clarity for developers
 - Shortcuts for common patterns: `#[state]`, `v8::Global`
 - Fast is now very explicit: `#[op2(fast)]` is self-checking
 
```
custom attribute panicked
message: Failed to parse #[op2]:
 - This op is fast-compatible and should be marked as (fast)
```


# `#[op2]`

Self-documenting: https://docs.rs/deno_ops/latest/deno_ops/attr.op2.html

<img src="docs.png">

# `#[op2]`

Async is still a problem (but we'll fix that)



In [64]:
dfAsync

name,slowdown vs sync
bench_op_async_void,2.42x
bench_op_async_void_deferred,12.82x
bench_op_async_void_deferred_nofast,12.66x
bench_op_async_void_lazy,11.29x
bench_op_async_void_lazy_nofast,12.78x
bench_op_async_yield,12.38x
bench_op_async_yield_deferred,12.43x
bench_op_async_yield_deferred_nofast,12.35x
bench_op_async_yield_lazy,13.12x
bench_op_async_yield_lazy_nofast,12.97x


# `#[op2]`

 - Future plans:
   * Final benchmark of op vs op2, ensure op2 is fast or faster
   * More helpers: `#[resource]`, `ScopeFunction`
   * Updating all the old docs (roll-your-own Javascript, etc)
   * `async` rewrite (difficult w/`#[op]` still around)
   * Fancy fast return options