Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add address property to physicalLocation object #302

Closed
fishoak opened this issue Dec 21, 2018 · 20 comments
Closed

Add address property to physicalLocation object #302

fishoak opened this issue Dec 21, 2018 · 20 comments
Labels
2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. e-ballot e-ballot-3 enhancement impact-non-breaking-change merged Changes merged into provisional draft. p1 Priority 1 issue to close resolved-fixed tc-33 Issues to present at SARIF TC33

Comments

@fishoak
Copy link

fishoak commented Dec 21, 2018

Binary analysis tools may report issues in terms of an effective address. These are not physical locations (as we have been thinking of them) and none of the options that the regions offer suffice to express such addresses adequately. Addresses are not offsets in any meaningful sense. One could shoe-horn them into location.fullyQualifiedLogicalName using kind="address", but that feels unsatisfactory.

The only place in the spec where an address is allowed is in a stackFrame object, but note that stackFrame.location exists too.

Consequently, I propose that we move the address object into the location. It could either go in as a new property location.address, or it could be pushed down into either the physical or logical location properties. Conceptually, it is neither, but I concede that an address is more like a physical thing than a logical thing, so it could go into a new optional property of physicalLocation named "address".

@ghost
Copy link

ghost commented Dec 26, 2018

I agree that it is neither fish nor fowl, so it belongs on the location object. @michaelcfanning, do you agree?

@ghost ghost added enhancement impact-non-breaking-change 2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. labels Dec 26, 2018
@michaelcfanning
Copy link
Contributor

location object could contain this, potentially, or perhaps some other property. stack frames do have this notion. some tools provide a relative virtual address against some other unknown base. to be discussed at the f2f.

@fishoak
Copy link
Author

fishoak commented Jan 10, 2019

I asked some of our local binary analysis experts about this topic. Below are two responses:

From Tom Johnson:

On this particular topic, I'd say you almost definitely want addresses specified with respect to some base. And possibly allow multiple kinds of bases to be specified. Since the introduction of ASLR, most binaries these days are relocatable - and are actively relocated on purpose. That means dynamic analysis tools may see different addresses from one run to the next (unless one explicitly disables this). Even without ASLR, DLLs are relocated often and can even be relocated differently from one run to the next of the same program (e.g., consider dynamically loaded DLLs).

Also, if you're analyzing object files, there isn't really a fixed notion of absolute address in this context. Technically addresses have not been determined for object files. You can kind-of think of it as every section in an object file starts at address 0. CodeSonar deals with this by artificially loading the object files into a synthetic address space. Other tools may handle object files entirely differently. Certainly it's unlikely that other tools would have the same notion of artificial loading that CodeSonar does.

There are several common notions of address that you'll encounter:

  • "Effective address" - the absolute address in a program's logical memory that something gets loaded at. This is what most people think of when you say "address". And it is the concept that SWYX uses. But it will vary due to relocation operations as mentioned above. If you wanted to use this as the concept for Sarif, the "base" would probably be some notion a "default load". That is, you'd determine where an OS would put the program by default without any consideration for ASLR, some pre-specified resolution for relocation of conflicting DLLs, etc. Whatever absolute address you get there, that's the value you communicate here (instead of the actual address observed by a dynamic tool, say). Figuring this out in a canonical way could be non-trivial, and it's not something tools normally do when looking at something that has been relocated. So it might be a bit of a fuzzy/confusing concept to get right.

  • "Image or section offset" - this is the offset wrt some piece of the logical structure of the program when loaded into memory. Many static tools will know how to work with this concept. And if you say, "this is an address with respect to a base", this is the first type of thing that comes to mind. Whether it's an image or a section offset depends on what kind of artifacts you're dealing with. Section offsets are more natural when working with object files (which aren't combined in a full image yet). Image offsets show up more often with linked executables. However, that's by no means a firm rule. You can certainly talk about section offsets within a linked executable.

  • "File offset/address" - this is the absolute offset of a location within the binary file artifact, ignoring internal logical structure of the file. These are rare, but do show up on occasion. For example, the headers in a binary file often include file offsets here and there. The nice thing about a file offset is that you don't need to parse any of the structure of the file to know the location of what you might be interested in.

If forced to pick only one of these notions, the one that is likely going to allow the most reliable communication method would be section offset. Section offsets survive relocation operations. And section offsets generally make sense for both object files as well as linked executables and dynamic libraries. Most tools already know how to work with them.

The one caveat there is that section offsets cannot be used to locate objects that are not within a program's image. For example, if something is allocated on the heap or stack. Say something generates self-modifying code or something like that. Might be a high-class worry, but might motivate offering flexibility in the specification too.

From Eric Schulte:

The only thing I could add to Tom's response is to suggest that you use
the name "address" instead of "effective address." We did a review of
what most binary analysis tools use for this concept and "address" was
the clear winner.

@fishoak fishoak closed this as completed Jan 10, 2019
@fishoak
Copy link
Author

fishoak commented Jan 10, 2019

Sorry, closed in error!

@fishoak fishoak reopened this Jan 10, 2019
@ghost
Copy link

ghost commented Jan 10, 2019

Thanks for Tom and Eric for that detailed analysis.

The only thing I will add is that SARIF already has a construct for the "file offset/address" concept: it is physicalLocation.region.byteOffset. This is what we would expect a tool to use if it wanted to report, for example, that there was something wrong in the file's header at offset 0xA0 from the start of the file.

I agree with you that "image or section offset" makes sense to express an address that might be relocated. We might imagine an address object that fits into the existing SARIF object model like this:

{                             # A SARIF location object.
  "physicalLocation": {
    "fileLocation": {
      "uri": "main.obj"
    },
    "address": {
      "section": ".TEXT",
      "offset": 1234
    }
  }
}

Tom, could you please explain the scenario where we'd need to specify multiple bases? If we need to support that, we could simply give the locations object an addresses property whose value is an array of address objects:

{                             # A SARIF location object.
  "physicalLocation": {
    "fileLocation": {
      "uri": "main.obj"
    },
    "addresses": [
      {
        "section": ".TEXT",
        "offset": 1234
      },
      ...
    ]
  }
}

@tjohnson-gt
Copy link

I'm not sure there's a need for multiple bases for a single location. My comment was that you might want to recognize multiple kinds of bases. And it sounds like you're already considering that with file-offset and section-offset.

So for a structured executable file, you have a clear notion of content organized by sections. And those sections could be relocated in memory at different locations at load time. So, a section-based offset makes sense here.

Another possible subject might be a ROM image for firmware that just gets directly injected into memory somewhere. There may be no internal structure to the image - it's just a block of bytes. In this case, the file-offset notion is appropriate.

We had an in-house discussion that in ELF binaries, there is also a notion of "segment". Typically a segment is a combination of sections that all get loaded contiguously in memory w/ the same access permissions. Sometimes, though, segments include additional parts of the binary file (for example, headers). Someone may find it useful to talk in terms of segment+offset locations rather than section+offset locations.

@ghost
Copy link

ghost commented Jan 11, 2019

Got it. How about this for an address object:

{
 "sectionName": ".text",
 "sectionOffset": 1234,   # Byte offset from start of specified section.
 "segmentIndex": 4,       # 0-based index of the segment's entry in the ELF program header table.
 "segmentOffset": 5678    # Byte offset from start of specified segment.
}

Each pair is optional, both pairs can be present.
Note that since this is JSON, the numbers are all decimal.

@michaelcfanning
Copy link
Contributor

I'm not sure we want to create a bucket object that contains mutually exclusive properties that define the address. It would be preferable to provide generic object that could be repurposable for various scenarios.

Conceptually, as Tom notes, these addresses are chained, e.g., address of PE + sections header.virtual address (gives us start of a section) + offset. if you think about the problem this way, it is similar to logical locations (except that we're using actual binary internals to build the 'path to address'. the size of these pieces is relevant to analysis.

The BinSkim code may be helpful illustrating how that particular tool manipulates low level binary address information. Poking around, I notice a small other wrinkle, which relates to padding that may be required to produce the address. This BinSkim ImageFieldData construct actually looks close to what might work. I'm thinking something along the lines of:

{
baseAddress, // could be gleaned by examining the parent, think 'fullyQualifiedLogicalName'
baseIndex, // information for current base, such as the section, think 'logicalLocationIndex'
size, // size of this thing
offset, // offset from the current base, to identify a thing within a container, like a specific import
//
name, // would be '.text' for an address that denotes this section,
kind or type, // what is this thing? a segment? a section?
padding and/or trailing bytes count, // ? not sure whether this needs to be expressed
}

Again, very similar to the logical location and nested files mechanisms already in the format. And so, suffers/benefits from the same advantages and disadvantages. Note that the non-deterministic base address is similar in concept to the value associated with uriBaseId. Section header addresses + offsets are similar to the deterministic relative of paths under source control. Just trying to make SARIF connections to help connect this concept to existing patterns.

Here's the basic view of what we're up to:

(Virtual address of something) - (base address) = relative virtual address

The RVA + some unknown base address (such as the base address of the loaded module).

@ghost
Copy link

ghost commented Jan 11, 2019

I don't think it needs to be that complicated. Unlike nested files and logical locations, these address objects have a limited nesting. I think the notion of a "parent" is overkill.

The design I proposed is specific to ELFs; you might consider that a drawback. OTOH, it supports ELF's well, and takes into account that in ELFs, sections have names but segments do not (hence the string-valued sectionName and the integer-valued segmentIndex). We could certainly change it to

{
  "sectionId": ....      # Either a string or on integer
  "byteOffset": 1234,
  "kind": ....           # Either "section" or "segment"; leaves open possible future support for
                         #  other executable formats.
}

... but it does run into the usual problem where our code gen can't deal with a property that might have more than one type.

Tom mentioned (see his bullet point on "effective address") that the absolute address might be hard to define in a clear way; I took that as a clue not to include it in the format. OTOH, for tools that know how to compute it, like BinSkim, we could include absoluteAddress and/or relativeVirtualAddress fields in the address object.

@michaelcfanning
Copy link
Contributor

The limited nesting is a good point, one that occurred to me as well after posting the above. For me, I think the next best step is to look in detail at some analysis results of this kind. The SARIF, as always, should be a support for developers to prove to themselves that there is, in fact, a problem to act on. We should ask ourselves what that process would look like.

Consider a simple example, 'your imports section is executable'. Here are the addresses of interest:

  1. Import table RVA location. Somewhere in the binary, we will find the RVA of the import table. This location has an RVA.
  2. The section that contains the import table.
  3. The section header contains a piece of data that indicates it is executable.
  4. The import table itself, i.e., the RVA from Define driving principles for SARIF effort #1

The interesting question is how a viewer wants to take the user through this information to conclude, yes, the section that contains my import section is marked executable. I could imagine getting a hex view of the binary with some outlines/chrome demarcating various things. You could imagine a structural view of that data within the binary.

@michaelcfanning michaelcfanning added the p1 Priority 1 issue to close label Jan 24, 2019
@michaelcfanning
Copy link
Contributor

michaelcfanning commented Jan 24, 2019

A proposed design for an address descriptor:

baseAddress
kind
name
index
offset
length

we should prepare a change draft and review in advance of a TC discussion

@michaelcfanning
Copy link
Contributor

In addition to this core design, SARIF could populate a load map that would be persisted in a similar way as the files and logicalLocations table. address object could refer to these entries, which would also allow for parenting/nesting as other constructs do.

@michaelcfanning
Copy link
Contributor

michaelcfanning commented Feb 27, 2019

EBALLOT PROPOSAL Provide an address property on the location object in order to provide address details for results. Addresses can refer to a parent address, in order to render address 'chains' that describe structure within an analysis target (e.g., an offset from a section header). Provide a table of cached address objects at run.addresses consistent with other SARIF tables.

API IMPACT
Define an address object.
Add location.address (an address instance)
Add run.addresses (an array of address instances)
Add externalPropertyFiles.addresses to permit externalizing address data.
Remove stackFrame.address and stackFrame.offset (both of type integer) and replace them with a new stackFrame.address property of type address.
The address object contains the following properties:
baseAddress : a base address rendered as a hexadecimal string
kind: an open-ended string that identifies the address kind. 'section' and 'segment' are well-known values.
name: a name that is associated with the address, e.g., '.text'
offset: an offset from the base address, if present, rendered as a hexadecimal string
index: an index into run.addresses used to retrieve a cached instance to represent the address
parentIndex: an index into run.addresses to retrieve a parent address. the parent can provide a base address (from which the current offset value is relevant) and other details

@ghost ghost added the to-be-written label Mar 1, 2019
@ghost ghost deleted a comment from michaelcfanning Mar 4, 2019
@ghost
Copy link

ghost commented Mar 4, 2019

Feedback from MS [UPDATED]:

  • Move address to the physical location object.
  • Remove run.addresses.
  • Add an artifact role for memoryObject and anything else the TC thinks appropriate.

By associated an address with a PLC, we will naturally have a mechanism for specifying regions, etc. The standard artifact parenting mechanism can be leveraged. An artifact has a timestamp, settling that concern.

michaelcfanning pushed a commit to microsoft/sarif-sdk that referenced this issue Mar 7, 2019
… (#1323)

* solution builds post change + transformer logic (UTCs fail)

* fixing utcs and release md

* manually merging some files cleanly

* modifying transformer logic due to change in node name!

* rc++
@kupsch
Copy link

kupsch commented Mar 7, 2019

fileOffset can come from the physicalLocation property.

Should add a snippet object for the contents

@michaelcfanning michaelcfanning added tc-33 Issues to present at SARIF TC33 and removed tc-32 labels Mar 7, 2019
@michaelcfanning
Copy link
Contributor

TC33 conclusions:

address moves to physicalLocation
add a fullyQualifiedName property to address
add an artifactContents.rendered property (a multiformat string)

@ghost
Copy link

ghost commented Mar 28, 2019

E-BALLOT #3 PROPOSAL

Provide an address property on the physicalLocation object to provide address details for results. Addresses can refer to a parent address, to render address 'chains' that describe structure within an analysis target (e.g., an offset from a section header). Provide a table of cached address objects at run.addresses consistent with other SARIF tables.

SCHEMA CHANGES

  • Define an address object with the following properties:

    • baseAddress of type string: a base address rendered as a hexadecimal string
    • kind of type string: an open-ended string that identifies the address kind. "section" and "segment" are well-known values.
    • name of type string: a name that is associated with the address, e.g., ".text".
    • fullyQualifiedName of type string.
    • offset of type string: an offset from baseAddress, if present, rendered as a hexadecimal string.
    • index of type integer: an index into run.addresses used to retrieve a cached instance to represent the address.
    • parentIndex of type integer: an index into run.addresses to retrieve a parent address. The parent can provide a base address (from which the current offset value is relevant) and other details.
  • In the physicalLocation object:

    • Add a property address of type address, optional.
  • In the run object:

    • Add a property addresses of type address[], optional, minItems: 0, default: [], externalizable.
  • In the stackFrame object:

    • Remove the properties address and offset.
  • In the artifactContents object:

    • Add a property rendered of type multiformatString, optional.
  • In the artifact object:

    • In the roles property, add a value "memoryContents".

NOTES

An earlier proposal included adding stackFrame.address. But this isn't necessary because stackFrame has a location property of type location, which has a physicalLocation property, which will now have an address.

@ghost ghost changed the title Add address property to a location object Add address property to physicalLocation object Mar 29, 2019
ghost pushed a commit that referenced this issue Mar 29, 2019
@ghost ghost added change-draft-available merged Changes merged into provisional draft. and removed to-be-written labels Mar 29, 2019
@kupsch
Copy link

kupsch commented Apr 3, 2019

The types of baseAddress and offset should be integer as this is their natural type. Other than for display purposes, they need to integers to compute absolute addresses or differences of addresses. Making this an integer means that JSON libraries will produce the correct type automatically. The viewer should determine the output format and displaying the value as hexadecimal is generally no more difficult than outputing a string or decimal representation of a number.

@ghost ghost removed the change-draft-available label Apr 6, 2019
@ghost ghost self-assigned this Apr 6, 2019
@ghost
Copy link

ghost commented Apr 6, 2019

Approved in e-ballot-3.

@ghost ghost closed this as completed Apr 6, 2019
@ghost ghost removed the schema-todo label Apr 29, 2019
@ghost
Copy link

ghost commented Apr 29, 2019

I confirm that this change is correctly in the schema.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.1.0-CSD.1 Will be fixed in SARIF v2.1.0 CSD.1. e-ballot e-ballot-3 enhancement impact-non-breaking-change merged Changes merged into provisional draft. p1 Priority 1 issue to close resolved-fixed tc-33 Issues to present at SARIF TC33
Projects
None yet
Development

No branches or pull requests

4 participants