Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x11rb-protocol takes very long to build #883

Open
kchibisov opened this issue Sep 26, 2023 · 13 comments
Open

x11rb-protocol takes very long to build #883

kchibisov opened this issue Sep 26, 2023 · 13 comments

Comments

@kchibisov
Copy link

kchibisov commented Sep 26, 2023

During profiling both debug and release builds of both alacritty and winit, I've noticed that
x11rb-protocol builds around 5x longer than most massive deps like (serde).

And if you happen to have 2 copies of x11rb-protocol in your tree, it'll really slow things down.

For example alacritty on my system builds in around 20seconds, and each of the x11rb-protocol takes around 12s (I have 32 threads, so it doesn't really matter, but on slow systems it could really decrease perf).

I was testing with the cargo build --timings and cargo build --release --timings on the tip of the alacritty/alacritty repo. You could do that on the winit repo alone for example.

It was used on both rustc 1.72.0 and rustc 1.74.0-nightly (0288f2e19 2023-09-25) showing the exact same compilation times.

@psychon
Copy link
Owner

psychon commented Sep 26, 2023

Random guess would be "lots of XML for protocol stuff resulting in even more generated Rust code that rustc has to cope with".

For example alacritty on my system builds in around 20seconds, and each of the x11rb-protocol takes around 12s (I have 32 threads, so it doesn't really matter, but on slow systems it could really decrease perf).

I was curious about enabled features, but that is apparently not so easy to figure out just with a web browser.

According to https://github.com/alacritty/alacritty/blob/a58fb39b68caa34b073f66911c0ac6945f56eac2/Cargo.lock, a dependency on x11rb 0.12.0 comes from winit. x11-clipboard depends on x11rb 0.10.1

x11-clipboard just enables xfixes (and apparently has already been updated to 0.12.0 in git): https://github.com/quininer/x11-clipboard/blob/c7fd2a93c93a7690478953db87de849d47ce9f2a/Cargo.toml#L12

winit enables allow-unsafe-code, dl-libxcb, randr, resource_manager, xinput, xkb: https://github.com/rust-windowing/winit/blob/master/Cargo.toml#L155

I am ignoring the non-X11-extension features (since I kinda expect the generated code to be the problem since it really is huge). This leads to the following files being enabled to build in x11rb-protocol/src/protocol/:

  • 135.214 bytes / 3.645 lines in xfixes.rs
  • 285.344 bytes / 7.352 lines in randr.rs
  • 584.999 bytes / 14.709 lines in xkb.rs
  • 742.184 bytes / 19.405 lines in xinput.rs
  • 947.122 bytes / 25.897 lines in xproto.rs
    For a grand total of of 2.694.863 bytes / 71.008 lines of generated code. (mod.rs in that file also has some more lines that are gates behind these features.)

And random time measurements with cargo check (which of course does less than cargo build, but I am just trying to measure the influence of the extensions). I am running cargo clean && time cargo check -p x11rb-protocol $EXTRA_STUFF:

  • --no-default-features: 4.9 seconds of wall clock time
  • default features: 5.0 seconds of wall clock time
  • --features resource_manager: 5.3 seconds
  • --features xfixes,randr,xkb,xinput: 15.3 seconds

So, to me this is clearly due to the huge generated code files.

However, that still does not provide any hint on what to do about this. Adding more feature flags doesn't sound like a workable solution.

@kchibisov
Copy link
Author

I guess the only way is to reduce the amount of things being generated, since for example wayland stuff is pretty fast to generate on the other side and we use a lot of protocols.

@kchibisov
Copy link
Author

I'm not familiar with X11 internals though to say whether what you generate makes sense or not in the end of the day, but the amount of code you say sounds like really a lot (it's like alarcitty + glutin + winit + some wayland crates combined)... Especially when those types have derive impls, etc, etc.

One technique to reduce code with generated source code is outlining, but I'm not sure how it's applicable.

@psychon
Copy link
Owner

psychon commented Sep 26, 2023

wayland-protocols seems to use wayland-scanner to generate code for the XML from another thingie called wayland-protocols. The docs for wayland-scanner do not really say / show what it generates.

I don't know much about Wayland, but there does not seem to be anything like nested structs in there. I hacked together a quick Python script to show which XML tags exist and found ['arg', 'copyright', 'description', 'entry', 'enum', 'event', 'interface', 'protocol', 'request']. The same script run on the XML from xcb-proto finds ['allowed', 'bit', 'bitcase', 'brief', 'case', 'description', 'doc', 'enum', 'enumref', 'error', 'errorcopy', 'event', 'eventcopy', 'eventstruct', 'example', 'exprfield', 'fd', 'field', 'fieldref', 'import', 'item', 'length', 'list', 'listelement-ref', 'op', 'pad', 'paramref', 'popcount', 'reply', 'request', 'required_start_align', 'see', 'struct', 'sumof', 'switch', 'type', 'typedef', 'union', 'unop', 'value', 'xcb', 'xidtype', 'xidunion'].

Perhaps wayland was actually designed to be simple. X11 certainly was not.

One technique to reduce code with generated source code is outlining, but I'm not sure how it's applicable.

https://github.com/psychon/x11rb/blob/master/doc/generated_code.md has some introduction to the generated code and it shows how much we generate. Random example: <enum>s do not have a size in the XML (because C is so lenient about integer casts, I guess...), so for each enum, we generate code to convert to/from ever integer type that may fit, e.g. u8, u16, u32. According to the example on that page, one <enum> turned into seven <impl> blocks.

I feel like cutting down on the build time here means cutting down on features. :-(

@psychon
Copy link
Owner

psychon commented Sep 26, 2023

https://fasterthanli.me/articles/why-is-my-rust-build-so-slow suggested some crimes: RUSTC_BOOTSTRAP=1 cargo rustc -p x11rb-protocol --features xfixes,randr,xkb,xinput --verbose -- -Z self-profile

summarize looks like this is just lots of code and thus taking long (first 20 lines of output):

+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| Item                                                | Self time | % of total time | Time     | Item count | Incremental result hashing time |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| LLVM_module_codegen_emit_obj                        | 13.51s    | 29.391          | 13.51s   | 135        | 0.00ns                          |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| typeck                                              | 5.04s     | 10.966          | 5.44s    | 17024      | 79.56ms                         |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| LLVM_passes                                         | 4.72s     | 10.274          | 4.72s    | 1          | 0.00ns                          |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| codegen_module                                      | 4.54s     | 9.873           | 5.24s    | 135        | 0.00ns                          |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_borrowck                                        | 2.77s     | 6.028           | 5.29s    | 17024      | 6.54ms                          |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| LLVM_module_codegen                                 | 1.50s     | 3.263           | 15.01s   | 135        | 0.00ns                          |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_drops_elaborated_and_const_checked              | 1.36s     | 2.949           | 1.56s    | 17029      | 2.96ms                          |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| optimized_mir                                       | 973.42ms  | 2.118           | 2.53s    | 14807      | 140.88ms                        |
+-----------------------------------------------------+-----------+-----------------+----------+------------+---------------------------------+
| mir_built                                           | 936.56ms  | 2.038           | 1.39s    | 17024      | 235.99ms                        |

The SVG from flamegraph is not much more interesting. A third of the time is analyis, which is stuff like borrow checking and type checking. Most everything else sounds like codegen.

The chrome://tracing thingie didn't work. crox produced a 1006 MiB JSON, but chromium didn't like it. On ui.perfetto.dev, I got an out-of-memory error.

@notgull
Copy link
Collaborator

notgull commented Sep 26, 2023

It seems that the issue is the raw amount of code generated being passed to LLVM. It would be nice if there was a feature that disabled some of the lesser used trait implementations (like the Eq and Ord ones, which are rarely used). I think that might alleviate the issue, at least partially.

@kchibisov
Copy link
Author

This won't help unfortunately. The only thing that could probably help is speculating on the data and reducing the amount of matches you have. Like if you look at common code it has lots of very long matches doing basically the same thing as getting the same field from various event structs. From what I remember with X11 they sort of share the place in the struct when it comes to byte offset, because the X11 API from what I remember in Xlib uses a bunch of unions to model things.

Maybe with the help of some unsafe and speculations on how things are located in memory it could reduce at least giant matches?

Probably don't need a bunch of impls of from basic integer types for bitmasks, since you can obviously cast them and then call From.

The issue is not only about the build times, but the amount of code ending in the binary. The stuff like Eq and Ord won't help because they simply will be optimized out.

@notgull
Copy link
Collaborator

notgull commented Sep 27, 2023

Idea similar to one I tried in winit: replace the match statement with a per-display HashMap that's keyed to take some part of the bytes and then translate that to a function pointer that parses the bytes as the relevant request/response/event/whatever. It will shift some part of the complexity from compile time to runtime (although that may be a good thing), but it should eliminate these huge matches.

@psychon
Copy link
Owner

psychon commented Sep 27, 2023

I made an attempt to delete all code related to parsing requests (we normally only have to serialise them, but our xtrace example uses code for parsing; this seemed like something that could be hidden behind a feature flag). The time cargo check -p x11rb-protocol --features xfixes,randr,xkb,xinput measurement from above says 13 seconds, so this saves about 2 seconds.

The result does not properly build (e.g. x11rb is broken), but at least x11rb-protocol builds fine.

The patch for the code generator
diff --git a/generator/src/generator/namespace/request.rs b/generator/src/generator/namespace/request.rs
index f9d9d32..72d4d9d 100644
--- a/generator/src/generator/namespace/request.rs
+++ b/generator/src/generator/namespace/request.rs
@@ -193,6 +193,7 @@ pub(super) fn generate_request(
         header = generator.ns.header,
         lifetime = lifetime_block
     ));
+    /*
     if gathered.has_fds() {
         enum_cases.request_parse_cases.push(format!(
             "{header}::{opcode_name}_REQUEST => return \
@@ -214,6 +215,7 @@ pub(super) fn generate_request(
             name = name,
         ));
     }
+    */
     emit_request_function(
         generator,
         request_def,
@@ -299,7 +301,7 @@ pub(super) fn generate_request(
             &reply_fields,
             &[],
             false,
-            true,
+            false,
             StructSizeConstraint::EmbeddedLength { minimum: 32 },
             true,
             true,
@@ -316,6 +318,7 @@ pub(super) fn generate_request(
         ));
     }
 
+/*
     if gathered.needs_lifetime {
         enum_cases.request_into_owned_cases.push(format!(
             "Request::{ns_prefix}{name}(req) => Request::{ns_prefix}{name}(req.into_owned()),",
@@ -329,6 +332,7 @@ pub(super) fn generate_request(
             name = name,
         ));
     }
+    */
 }
 
 fn generate_aux(
@@ -341,7 +345,7 @@ fn generate_aux(
     let aux_name = format!("{}Aux", request_def.name);
 
     if switch_field.kind == xcbdefs::SwitchKind::Case {
-        switch::emit_switch_type(generator, switch_field, &aux_name, true, true, None, out);
+        switch::emit_switch_type(generator, switch_field, &aux_name, false, true, None, out);
     } else {
         let doc = format!(
             "Auxiliary and optional information for the `{}` function",
@@ -351,7 +355,7 @@ fn generate_aux(
             generator,
             switch_field,
             &aux_name,
-            true,
+            false,
             true,
             Some(&doc),
             out,
@@ -912,6 +916,7 @@ fn emit_request_struct(
         outln!(out, "}}");
 
         // Parsing implementation.
+        /*
         outln!(
             out,
             "/// Parse this request given its header, its body, and any fds that go along \
@@ -1071,6 +1076,7 @@ fn emit_request_struct(
             });
             outln!(out, "}}");
         }
+        */
     });
     outln!(out, "}}");
     outln!(
@@ -1118,6 +1124,7 @@ fn emit_request_struct(
     });
     outln!(out, "}}");
 
+    /*
     let request_trait = if request_def.reply.is_none() {
         "crate::x11_utils::VoidRequest"
     } else if gathered.reply_has_fds {
@@ -1136,6 +1143,7 @@ fn emit_request_struct(
         outln!(out.indent(), "type Reply = {}Reply;", name);
     };
     outln!(out, "}}");
+    */
     num_slices_opt.unwrap()
 }
 
diff --git a/generator/src/generator/namespace/struct_type.rs b/generator/src/generator/namespace/struct_type.rs
index 73b7d86..be9f268 100644
--- a/generator/src/generator/namespace/struct_type.rs
+++ b/generator/src/generator/namespace/struct_type.rs
@@ -25,7 +25,7 @@ pub(super) fn emit_struct_type(
     out: &mut Output,
 ) {
     assert!(!generate_serialize || fields_need_serialize);
-    assert!(matches!(parse_size_constraint, StructSizeConstraint::None) || generate_try_parse);
+    //assert!(matches!(parse_size_constraint, StructSizeConstraint::None) || generate_try_parse);
 
     let deducible_fields = gather_deducible_fields(fields);
 
diff --git a/generator/src/generator/requests_replies.rs b/generator/src/generator/requests_replies.rs
index 4b29d82..6100fb1 100644
--- a/generator/src/generator/requests_replies.rs
+++ b/generator/src/generator/requests_replies.rs
@@ -276,6 +276,7 @@ pub(super) fn generate(out: &mut Output, module: &xcbdefs::Module, mut enum_case
             );
         });
         outln!(out, "}}");
+        /*
         outln!(
             out,
             "/// Get the matching reply parser (if any) for this request."
@@ -340,6 +341,7 @@ pub(super) fn generate(out: &mut Output, module: &xcbdefs::Module, mut enum_case
             outln!(out, "}}");
         });
         outln!(out, "}}");
+        */
     });
     outln!(out, "}}");
     outln!(out, "");

Diffstat for the generated code:

$ git diff --stat generator x11rb-protocol/                                                                                                       master 37
 generator/src/generator/namespace/request.rs     |   14 +-
 generator/src/generator/namespace/struct_type.rs |    2 +-
 generator/src/generator/requests_replies.rs      |    2 +
 x11rb-protocol/src/protocol/bigreq.rs            |   30 -
 x11rb-protocol/src/protocol/composite.rs         |  180 ------
 x11rb-protocol/src/protocol/damage.rs            |  100 ---
 x11rb-protocol/src/protocol/dbe.rs               |  184 ------
 x11rb-protocol/src/protocol/dpms.rs              |  199 ------
 x11rb-protocol/src/protocol/dri2.rs              |  507 ---------------
 x11rb-protocol/src/protocol/dri3.rs              |  349 ----------
 x11rb-protocol/src/protocol/ge.rs                |   36 --
 x11rb-protocol/src/protocol/glx.rs               | 3321 -----------------------------------------------------------------------------------------------
 x11rb-protocol/src/protocol/mod.rs               | 3266 ----------------------------------------------------------------------------------------------
 x11rb-protocol/src/protocol/present.rs           |  176 ------
 x11rb-protocol/src/protocol/randr.rs             | 1591 +---------------------------------------------
 x11rb-protocol/src/protocol/record.rs            |  222 -------
 x11rb-protocol/src/protocol/render.rs            | 1222 -----------------------------------
 x11rb-protocol/src/protocol/res.rs               |  219 -------
 x11rb-protocol/src/protocol/screensaver.rs       |  298 ---------
 x11rb-protocol/src/protocol/shape.rs             |  283 ---------
 x11rb-protocol/src/protocol/shm.rs               |  242 -------
 x11rb-protocol/src/protocol/sync.rs              |  558 ----------------
 x11rb-protocol/src/protocol/xc_misc.rs           |  100 ---
 x11rb-protocol/src/protocol/xevie.rs             |  167 -----
 x11rb-protocol/src/protocol/xf86dri.rs           |  369 -----------
 x11rb-protocol/src/protocol/xf86vidmode.rs       |  839 ------------------------
 x11rb-protocol/src/protocol/xfixes.rs            |  768 ----------------------
 x11rb-protocol/src/protocol/xinerama.rs          |  198 ------
 x11rb-protocol/src/protocol/xinput.rs            | 2248 -----------------------------------------------------------------
 x11rb-protocol/src/protocol/xkb.rs               | 2383 +-------------------------------------------------------------------
 x11rb-protocol/src/protocol/xprint.rs            |  667 --------------------
 x11rb-protocol/src/protocol/xproto.rs            | 4595 ------------------------------------------------------------------------------------------------------------------------------------
 x11rb-protocol/src/protocol/xselinux.rs          |  691 --------------------
 x11rb-protocol/src/protocol/xtest.rs             |  110 ----
 x11rb-protocol/src/protocol/xv.rs                |  642 -------------------
 x11rb-protocol/src/protocol/xvmc.rs              |  267 --------
 36 files changed, 24 insertions(+), 27021 deletions(-)

One one hand, saving two seconds is not much. On the other hand, saving about 15% of time is a relative large gain. Dunno how that number evolved when the code is not simply deleted, but actually hidden behind a flag.

Besides that, not much caught my eye when looking through doc/generated_code.md. We could drop serialize_into from the Serialize trait. Or we could hide impl From<&FooEvent> for [u8; 32] behind a flag (since that is only necessary for conn.send_event(), which perhaps is not used much?). Some of the trait-business around request sending seems "too much" and might be able to be cut down a bit, but this part is complicated (there is a free function for the request calling a serialize function and in parallel there is a impl Request for FooRequest that also does stuff; I don't know where to cut).

Hm... perhaps the switch stuff can be simplified to the compiler with helper function? I am talking about the FooAux stuff that generates long chains of code for each case.

@psychon
Copy link
Owner

psychon commented Sep 28, 2023

It would be nice if there was a feature that disabled some of the lesser used trait implementations (like the Eq and Ord ones, which are rarely used). I

That helped surprisingly much. Removing Debug, Copy, Clone breaks the build. Removing everything else makes time cargo check -p x11rb-protocol --features xfixes,randr,xkb,xinput take 11.3 s instead of 15 s. 30% reduction in compile time.

diff --git a/generator/src/generator/namespace/helpers.rs b/generator/src/generator/namespace/helpers.rs
index f57ebb2..2c95235 100644
--- a/generator/src/generator/namespace/helpers.rs
+++ b/generator/src/generator/namespace/helpers.rs
@@ -225,12 +225,12 @@ impl Derives {
             debug: true,
             clone: true,
             copy: true,
-            default_: true,
-            partial_eq: true,
-            eq: true,
-            partial_ord: true,
-            ord: true,
-            hash: true,
+            default_: false,
+            partial_eq: false,
+            eq: false,
+            partial_ord: false,
+            ord: false,
+            hash: false,
         }
     }
 

Combined with yesterday's patch, the result is 10.4 seconds. Uhm.... I would have expected more.

@psychon
Copy link
Owner

psychon commented Oct 14, 2023

I was testing with the cargo build --timings and cargo build --release --timings on the tip of the alacritty/alacritty repo. You could do that on the winit repo alone for example.

Okay, so here are some measurements for winit bbeacc46d52c9f0455a376efb4d561b33b14e8a1 using Debian's cargo 1.65 and rustc 1.70. I used [patch.crates-io] to override the x11rb and x11rb-protocol versions with my local checkout (master from a week ago (commit 89a3197) and current master (4fcd1c6)) for the second build. So, I am testing the effect that #884 had.

For non-release x11rb, I had to apply the following patch to winit for it to build
diff --git a/src/platform_impl/linux/x11/window.rs b/src/platform_impl/linux/x11/window.rs
index bab2f8c..e7c2f0e 100644
--- a/src/platform_impl/linux/x11/window.rs
+++ b/src/platform_impl/linux/x11/window.rs
@@ -1322,7 +1322,7 @@ impl UnownedWindow {
             self.xwindow as xproto::Window,
             xproto::AtomEnum::WM_NORMAL_HINTS,
         )?
-        .reply()?;
+        .reply()?.unwrap();
         callback(&mut normal_hints);
         normal_hints
             .set(
@@ -1375,6 +1375,7 @@ impl UnownedWindow {
         )
         .ok()
         .and_then(|cookie| cookie.reply().ok())
+        .flatten()
         .and_then(|hints| hints.size_increment)
         .map(|(width, height)| (width as u32, height as u32).into())
     }
@@ -1764,6 +1765,7 @@ impl UnownedWindow {
             WmHints::get(self.xconn.xcb_connection(), self.xwindow as xproto::Window)
                 .ok()
                 .and_then(|cookie| cookie.reply().ok())
+                .flatten()
                 .unwrap_or_default();
 
         wm_hints.urgent = request_type.is_some();

I used cargo build --timings --no-default-features --features x11 and the same with --release added. The time shown are Total time and the time for x11rb-protocol according to --timings

latest x11rb release old x11rb master current x11rb master
debug build 1 31.5s and 24.4s 29.1s and 27.3s 23.9s and 20.9s
debug build 2 31.6s and 24.6s 30.1s and 27.0s 23.8s and 20.6s
release build 1 45s and 39.1s 38.5s and 36.9s 31.9s and 30.5s
release build 2 44.9s and 38.7s 38.8s and 37.2s 32.2s and 30.0s

So... #883 clearly helped, but we are still dominating build times. And somehow the total build time goes down when patching from latest x11rb release to my local, newer copy, but the times for x11rb-protocol goes up (for a debug build)?!

@kchibisov
Copy link
Author

So... #883 clearly helped, but we are still dominating build times. And somehow the total build time goes down when patching from latest x11rb release to my local, newer copy, but the times for x11rb-protocol goes up (for a debug build)?!

I suggest to test single threaded build, because modern CPUs schedule things weirdly and the core used for longer crates could have less clock. With massive build times like we have, general testing could be done 'normally', since any improvement will be noticeable.

I'm afraid though, that feature gating won't help, unless it's enforced, since other crates could enforce features not used by winit (x11-clipboard) which will result in basically no perf changes. Though, we could just suggest everyone to disable default features as of now.

@psychon
Copy link
Owner

psychon commented Oct 14, 2023

Though, we could just suggest everyone to disable default features as of now.

The new features are not enabled by default (at least for x11rb, x11rb-protocol has a default feature, but is used by x11rb with default features disabled). So, no need to have everyone disable default features for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants