Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pysrc2cpg] Model Field-like Behaviour of Module Variables #3750

Merged
merged 6 commits into from
Oct 23, 2023

Conversation

DavidBakerEffendi
Copy link
Collaborator

@DavidBakerEffendi DavidBakerEffendi commented Oct 20, 2023

The motivation for this PR is that module-level variables in Python can be accessed by other files by some implicit global scope. This behaviour is more akin to fields rather than identifiers, and should be modelled closer to that direction.

Given the caveat that the data-flow engine sees fields a bit differently, especially if the flow goes across closures, this aims to be the least intrusive change to achieve this result. The benefits of this would be:

  • Immediately see the global variables of a module by referring to the <module> type declaration's members.
  • Improve import resolution when an entity is being introduced from across modules.
  • Type recovery and propagation does not need to "guess" which identifiers have interprocedural scope.
  • Ability to traverse from a module's global variable to its usages across other modules.

Main changes

  • Create a module-level identifier assigned to the module's type ref to be used as the base of module-level member field accesses.
  • At module-level identifier STOREs, create a block that writes to the module field, then aliases the result to the identifier, e.g., x = 1 becomes
x = {
      <module>.x = 1
      tmp = <module>.x
      tmp
    }
  • Added helper methods, e.g., isModuleContext to ContextStack
  • Separated VariableReference class into Reference trait shared between VariableReference and FieldReference
  • During createMemberLinks, FieldReference calls will have a REF edge to their referencing member.
  • If a FieldAccess had an untyped base that is recovered during the type recovery, the REF edge will be recovered to the target type and member, if either are present.
  • Have reachableByFlows ignore the desugaring when returning path elements to keep the result as if there was no desugaring occuring.

Misc Changes

  • Fix pysrc2cpg/ImportResolver bug when an __init__.py entity is imported from a non-sibling module.
  • Added shortcut in PythonTypeRecovery.visitStatementsInBlock to detect these strong-update blocks and "re-sugar" them to skip propagating the type via member
  • Removed Python slicer tests, I would rather be intentional in how I support Python usage slicing at a later stage

The motivation for this PR is that module-level variables in Python can be accessed by other files by some implicit global scope. This behaviour is more akin to fields rather than identifiers, and should be modelled closer to that direction.

Given the caveat that the data-flow engine sees fields a bit differently, especially if the flow goes across closures, this aims to be the least intrusive change to achieve this result.

### Main changes
* Create a module-level identifier assigned to the module's type ref to be used as the base of module-level member field accesses.
* At module-level identifier STOREs, create a block that writes to the module field, then aliases the result to the identifier, e.g., `x = 1` becomes
```python
x = {
      <module>.x = 1
      tmp = <module>.x
      tmp
    }
```
* Added helper methods, e.g., `isModuleContext` to `ContextStack`
* Separated `VariableReference` class into `Reference` trait shared between `VariableReference` and `FieldReference`
* During `createMemberLinks`, `FieldReference` calls will have a `REF` edge to their referencing member. TODO: Recover these `REF` edges during the type recovery pass.
*

## Misc Changes
* Fix `pysrc2cpg/ImportResolver` bug when an `__init__.py` entity is imported from a non-sibling module.
* Added shortcut in `PythonTypeRecovery.visitStatementsInBlock` to detect these strong-update blocks and "re-sugar" them to skip propagating the type via member
* Removed Python slicer tests, I would rather be intentional in how I support Python usage slicing at a later stage
@DavidBakerEffendi DavidBakerEffendi added the python Relates to pysrc2cpg label Oct 20, 2023
@DavidBakerEffendi DavidBakerEffendi self-assigned this Oct 20, 2023
@DavidBakerEffendi DavidBakerEffendi marked this pull request as ready for review October 20, 2023 15:02
Copy link
Contributor

@fabsx00 fabsx00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this contain a test for a flow we can now find which we could not find previously? It looks to me like a few test expectations were edited, but there is no new test. Also, can you comment on why the number of flows changes? Are we ok with suddenly getting 2 flows in a place where we previously only had one?

@@ -214,6 +215,23 @@ class TypeRecoveryPassTests extends PySrc2CpgFixture(withOssDataflow = false) {
"bar.py"
).cpg

"be able to traverse from `foo.[x|y|db]` to its members" in {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional (non-data-flow) test to show recovered member references that were previously not there


val source = cpg.literal("42").l
val sink = cpg.call.code("print.*").argument.l
sink.reachableByFlows(source).size shouldBe 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this PR, we can now obtain the following flow:

    /*
    ____________________________________________________________
    | nodeType| tracked         | lineNumber| method   | file   |
    |===========================================================|
    | Literal | <module>.a = 42 | 2         | <module> | foo.py |
    | Call    | print(foo.a)    | 4         | <module> | bar.py |
     */

_.method.typeDecl
_.method.typeDecl,
// for Python, we have moved to replacing strong updates of module-level variables with their members
_.target.isCall.nameExact(Operators.fieldAccess).argument(1).isIdentifier.name("<module>")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added to compensate for the de-sugared blocks to reason between the module-level variable identifier and its field access

expr.method.typeDecl.map { typeDecl => (typeDecl, expr) } ++
expr.inCall.fieldAccess.referencedMember.flatMap { member =>
member.typeDecl.map { typeDecl => (typeDecl, member) }
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extends on the Expression case to access the referenced members, effectively combining the behaviour of both cases below this one

@@ -37,6 +37,18 @@ class ExtendedCfgNode(val traversal: Iterator[CfgNode]) extends AnyVal {
reachedSources.cast[NodeType]
}

/** Trims out path elements that represent module-level variable assignment desugaring
*/
private def trimModuleVariableDesugarBlock(path: Vector[PathElement]): Vector[PathElement] = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabsx00 I've added this to remove the visibility of the desugaring, now the existing Python data-flows no longer need to be +1'd in certain places since they were duplicates in any case

DavidBakerEffendi added a commit that referenced this pull request Oct 23, 2023
@fabsx00 fabsx00 self-requested a review October 23, 2023 13:55
@fabsx00 fabsx00 merged commit 4d58f8f into master Oct 23, 2023
5 checks passed
@DavidBakerEffendi DavidBakerEffendi deleted the dave/python/module-var-as-field branch October 23, 2023 13:57
DavidBakerEffendi added a commit that referenced this pull request Nov 14, 2023
This query package, `moduelvariable`, adds the necessary complexity to reason and handle the changes introduced by #3750

See the README.md for additional context.

One can now navigate the module members, the block references, the locals and identifiers referencing these variables, and go navigate back concisely.
DavidBakerEffendi added a commit that referenced this pull request Nov 14, 2023
This query package, `modulevariable`, adds the necessary complexity to reason and handle the changes introduced by #3750

See the README.md for additional context.

One can now navigate the module members, the block references, the locals and identifiers referencing these variables, and go navigate back concisely.
DavidBakerEffendi added a commit that referenced this pull request Nov 21, 2023
…Steps

* Reverts #3750's module block structure in Python and performs more on-the-fly interprocedural usage resolution
* Adds `MODULE` modifier on module-defining methods in JS and Ruby (Python already has this). Note: Other frontends don't seem to have a module-defining method, though I may be wrong.
* Transferred import resolution query steps and classes to `semanticcpg` under `importresolver` to perform on-the-fly CPG entity retrieval of resolved imports.
* Modified existing `modulevariable` query steps to now reflect `LOCAL` as the base of the `ModuleVariable` node extension and created a bunch of queries around that.
* Modified `SourcesToStartingPoints` to handle module variables similarly to fields for interprocedural flows.
DavidBakerEffendi added a commit that referenced this pull request Nov 22, 2023
…Steps (#3839)

In an ongoing effort to consolidate the interprocedural behaviour of "global" or "module-level" variables, this PR allows frontends to define module-defining methods the same as other modules, while treating the `LOCAL` variables on these methods closer to `TypeDecl.Member` nodes in the data-flow engine.

This change allows all frontends that specify the `MODULE` modifier to benefit from inter-module flows from module-level local variables, i.e., a module importing another module will now benefit from data-flow tracking across importing modules.

### Changelist

* Reverts #3750 module block structure in Python and performs more on-the-fly interprocedural usage resolution
* Adds `MODULE` modifier on module-defining methods in JS and Ruby (Python already has this). Note: Other frontends don't seem to have a module-defining method, though I may be wrong.
* Transferred import resolution query steps and classes to `semanticcpg` under `importresolver` to perform on-the-fly CPG entity retrieval of resolved imports.
* Modified existing `modulevariable` query steps to now reflect `LOCAL` as the base of the `ModuleVariable` node extension and created a bunch of queries around that.
* Modified `SourcesToStartingPoints` to handle module variables similarly to fields for interprocedural flows.
* Query steps of node:
  * `isModule` added to `Method` nodes
  * `method` step added to `Declaration` nodes
  * `isModuleVariable` added to `Identifier` and `Local` nodes
  * `Expression` nodes have `moduleVariables` step which filters in nodes that relate to some module variable, and casts these nodes to `OpNodes.ModuleVariable` (extends `Local`)
  * `references` added to `OpNodes.ModuleVariable` which behaves like `Local.referencingIdentifiers` but also includes `fieldIdentifier` nodes whenever a module variable is referenced off an imported module field access.

### Things to Note

* In "lowered" code blocks, including lambdas, there were a few flow increases due to method variable scoping behaving more like field accesses:
  * For Ruby, in the deprecated frontend, there was 1 fewer in some tests. In the new frontend, the assignment flows doubled.
  * JavaScript flows also increased here and there in 1 lowering example and 2 lambdas.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Relates to pysrc2cpg
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants