Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: assembling to object files and linking #48

Open
physnoct opened this issue Oct 11, 2020 · 11 comments
Open

Feature request: assembling to object files and linking #48

physnoct opened this issue Oct 11, 2020 · 11 comments

Comments

@physnoct
Copy link

It would be interesting to compile to object files and linking. The SDCC code could be a good starting point.
I used customasm in my little project. I'm able to use 128-bit constants!
https://github.com/physnoct/softmicro

@aslak3
Copy link

aslak3 commented Oct 11, 2020

Linking would be pretty incredible to have. Making up larger projects via includes gets tedious pretty quickly. I'm assuming it wouldn't be possible to generate elf format files and link them using standard tools?

@mindstorm38
Copy link

mindstorm38 commented Oct 11, 2020

I have been thinking about this feature for several days, maybe I found something elegant.

Since there are several types of object formats (COFF / ELF ...), I think it would be great to have a block type #obj <format> {<options>} which could be used for describe the options required for this format.

For example, in ELF, it's necessary to have in the header identifier the number of bits used (32/64) as well as the ABI.

To do this I thought of the following syntax:

#obj elf {
    bits = 32           ; maybe guessed from #bits ?
    abi = linux         ; literal or integer (linux = 3)
    machine = x86
}

By doing so, the assembler will be able to know if the definition of a CPU is valid for the assembly in such or such object format.

@moonheart08
Copy link

The problem with existing formats is that they don't exactly support custom architectures.
The final object linking to an existing format makes sense, but, for example, ELF only supports 32bit and 64bit machines.

@mindstorm38
Copy link

mindstorm38 commented Oct 11, 2020

The problem with existing formats is that they don't exactly support custom architectures.
The final object linking to an existing format makes sense, but, for example, ELF only supports 32bit and 64bit machines.

That's why I think it's interesting to use a new #obj block, by default all CPU definitions will not support assembly of object files. Only explicit declarations of #obj <format> will allow you to assemble an object file of a specific <format>.

@hlorenzi
Copy link
Owner

Ohh, that project is so cool! It would be nice to have a list of all projects using the assembler.

About linking, I think I'd need more input from you guys about the advantages of using one, and the current annoyances in working without one. @aslak3 , what's your current project structure and build pipeline, and what would they ideally look like? Would you use a makefile or something else?

There are also some considerations about the structure of the .asm files. For example, I'd rather have something akin to the Rust compiler, where it reads and considers the entire project as a whole, than to have a C-like compiler, where files must textually include every dependency in a certain order and you end up having to split up declarations and definitions. But then again, that might not be the best architecture for an assembler, but I don't have enough data at hand to make an informed decision...

@moonheart08
Copy link

An assembler pretty much needs to go with the old linking method of doing things. As nice as a full project system is, it's not viable here.

@rj45
Copy link

rj45 commented Feb 27, 2021

For a C toolchain to target customasm, you need to be able to build object files from assembler, that are later linked together into binaries. That's just the way C works, and going against the grain on this just makes it really hard to port existing software.

An object file needs to be able to have multiple "segments" (which are pretty much banks with auto-sizing) with a header that specifies the size of each segment. Segments can either be code, read-only constants (usually a series of strings), initialized data (usually global variables) and uninitialized data (which usually isn't stored in the object file at all, it just has its address and size specified in the header).

Another important feature of object files is importing symbols from other object files. So you would have a relocation table that specifies where in the segments each imported symbol is found so that the linker can go in and patch those locations with the actual address of the symbol. In the case that the address is encoded into an instruction, it would be important to specify which rule produced the instruction so the linker knows which bits are address bits that it needs to change.

Then of course you need to be able to list out all the exported symbols and their address/segment info so that the linker knows what address to insert into the imports of dependent objects.

There is standard object file formats. A 32-bit ELF file for example is able to work with a 8-bit, 16-bit or 24-bit CPU no problem, it just means it uses 32-bit values for addresses internally. It doesn't mean the CPU architecture needs to use all those bits. ELF is pretty well documented and there's libraries in most languages for working with it. Not sure if it supports custom relocation rules though, but that's okay, a separate file could be produced with the rules.

With the above, a linker could be made. If the linker also understands the rules, it could even be generic enough to work with any customasm CPU definition. That would be pretty awesome actually.

@rj45
Copy link

rj45 commented Feb 27, 2021

There is possibly also another solution: make customasm assembly files the "object" files, and treat customasm as a linker instead.

I think there might be issues with imported / exported symbols and file-local "static" symbols colliding between files if you just concat all the assembly files together and run customasm on it. There might be a way to use scoped labels to get around that, maybe having each "object" assembly file have a top-level label that all other labels are scoped under, and only exported/imported labels are unprefixed. But if this doesn't work, it might be fairly easy to make customasm support the right semantics.

But it would still be nice to have dynamically sized banks, so "executables" can be produced by customasm rather than just ROM images. I think the minimum feature for this would be the ability to generate a header "bank" with the sizes of other banks specified as #d values. Then the ability to make a bank size "dynamic" and assign a label to the size for use in the header bank. But maybe there's better solutions.

Then the final piece is probably just wrapping customasm in a script that adapts the command line parameters that a linker would usually use, to the ones that customasm expects. That would be easy enough.

I haven't tried it, but that might allow a C program to be ported to a CPU that customasm can support, provided of course that there's a C compiler that can produce the assembly "object" files, and not have to mess with the build system of the C program too much. LCC for example is more than capable of producing customasm assembly files and it has a book that explains how to retarget it.

@Phlosioneer
Copy link

So resurrecting this old issue here.

I'm using customasm for a custom CPU I made with extremely little RAM and a primitive MMU. That means I need to deal with overlays. And that means I need a linker to manage the code layout, compute the overlay tables, include those in the output, and potentially compute what symbols should go in which overlay blocks.

At this point I have three options:

  1. Write my own linker, and write my own assembler that outputs to a format the linker can work with.
  2. Write my own linker, and extend customasm to output to a format the linker can work with.
  3. Extend customasm with linker functionality.

To get an understanding for the difficulty in doing each, I started doing all three as a way to measure the level-of-effort required for the change. Here's what I found:

  1. Writing my own linker is pretty easy, as long as I have control over the object-file format, and as long as I manually place labels with special names at the start and end of each function.
  2. Writing my own assembler is doable but mildly difficult for my particular choice of ruleset, due to the way I've encoded dome of my instructions.
  3. Extending customasm with linker functionality is really hard, because there is the potential (in the general case) for different object files to be assembled with different rulesets, or different versions/iterations of the same ruleset.
  4. Extending customasm to output to an object-file format is surprisingly difficult. This is due to how label resolution affects program layout - the specific label address can change the instruction being used (imagine switching from a jump-unaligned instruction to a shorter jump-aligned instruction if a label is word-aligned, and then that change unaligns the label), which can cascade in unpredictable ways through the program. This is a problem that is unique to the way customasm works, and no other assembler has to be so general.

I think the best approach is to have an opt-in system of separate ruleset(s) for instructions that need to be delayed until link time (I'm going to call these linker-rulesets).

During assembly to executable, all linker-rulesets are ignored, and no special behavior happens. During assembly to object file:

  1. All linker-ruleset rules are allowed to reference undeclared labels in their inputs, as long as those inputs have bounded maximum size, and as long as the rule size is also bounded. This will avoid the complexity of the size-guessing system, which I don't think I understand well enough to modify effectively.
  2. The assembler will assume any assert statement fails for unknown labels, unless the assert statement references bits outside the known bounded maximum, or is doing a comparison outside the known bounded maximum. (So for example, assert(unk_8_bit_label < 50) will always fail, but assert(unk_8_bit_label[8:15] == 0) is fine, and assert(unk_8_bit_label < 256) is also fine)
  3. Generation will decide whether to use the normal rulesets or the linker-rulesets depending on whether unknown labels are present.
  4. Undeclared labels may be bounded manually using special syntax (something like #labelbound 256 <= farFunction < 65536) and may be aligned (something like #labelalign farFunction 4). A command-line option will be added to force all unknown labels to use these directives, as a stricter way to catch more errors (like misspelled labels) at assemble-time instead of link-time.
  5. During assembly, keep a table of the addresses and sizes of each linker rule's output used in the final assembly; and also the name of the linker-ruleset the rule came from. This table includes instructions generated as part of asm { } blocks (even if the instruction with the asm { } block is also in a named linker-ruleset).
  6. When generating the output files, in addition to the normal assembled binary, customasm will generate a file in either json or csv format with the built table.

The following details are left to end-users:

  • Translate csv/json + binary files into object file(s) on their own.
  • Craft their linker-rulesets such that the ruleset name, the address, and the size are enough info to find the bits/bytes within the instruction to change. (For example, a linker-ruleset named ByteAddressAfterSecondByte, another named QWordAddressAfterThirdByte, and another named WordAddressInBits3Thru18.)
  • Handle "far" jumps discovered during linking, possibly by re-assembling the file with proper #labelbound statements appended. I don't think this will come up often, users will generally have a good idea of the bounds for unknown labels, or will write jumps assuming the largest outcome, and if a jump happens to be "near" or smaller, then the linker can overwrite the opcodes itself and fill in NOPs for unused
  • Choose and/or write their linker, and all the options that come with it.

My main ask in this comment is whether @hlorenzi would be interested in the above as a PR, or if it should stay a private fork. It would be a significant amount of code to add this feature. If there's no interest in the code as part of a PR, I would not include support for the features I'm not using; or I may just write my own purpose-built assembler, I don't know for sure.

@Phlosioneer
Copy link

After some more coding, this design is much simpler:

  • All external symbols are pre-declared with specified upper bounds using an #external directive. The lower bounds and alignment are optional, both default to 0.
  • External symbols can have a specific constant value.
  • The linking capabilities are always on, no flag is needed for them to function.
  • External symbols can still only be used within rules defined in linker ruleset blocks, which still must have names.
  • The assert rules above still apply.
  • I don't think any further restrictions are necessary on the final output size of rules.

Nice-to-have features that I may add:

  • Allow specifying external symbols using a special input file in a standard format, like CSV.
  • Allow any named rule block to accept external symbols, and removing the #linkerruleset directive.
  • Allow #addr directives to accept external symbols as a value (or as part of an expression). Expressions would need to be aware of linear equations to make this feature useful; all labels after the address would be encoded as unknown_addr + known_offset, so that subtraction can cancel out the unknown_addr, and so that bounds and alignment can be computed for the expressions. But the general case looks more like unknown_external0 * known_constant0 + unknown_external1 * known_constant1 + ... + unknown_externalN * known_constantN + known_offset.

@hlorenzi
Copy link
Owner

hlorenzi commented Sep 26, 2022

@Phlosioneer I'd like to discuss this further to see if I understand all the nuances! Would you be available on Discord? You can find an invite to my server on the readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants